catalogue
1. What is a knowledge map?
2. Representation of knowledge map
3. Storage of knowledge map
4. Application
5. Challenges
6. Conclusion
1. What is a knowledge map?
Knowledge map is essentially a semantic network and a graph-based data structure, which consists of nodes and edges. In the knowledge map, each node represents the "entity" existing in the real world, and each edge is the "relationship" between entities. Knowledge map is the most effective way to express relationships. Generally speaking, a knowledge map is a relational network that connects all different types of information together. Knowledge map provides the ability to analyze problems from the perspective of "relationship".
The concept of knowledge map was first put forward by Google, which is mainly used to optimize existing search engines. Different from the traditional search engine based on keyword search, knowledge map can better query complex related information, understand users' intentions from the semantic level and improve search quality. For example, if you enter Bill Gates in Google's search box, information related to Bill Gates will appear on the right side of the search results page, such as date of birth, family situation and so on.
In addition, for slightly complicated search sentences such as "Who is Bill Gates' wife", Google can also accurately return his wife Melinda Gates. This shows that the search engine really understands the user's intention through the knowledge map.
The knowledge maps mentioned above all belong to a broader category, and solve the problems of search engine optimization and question answering system in general fields. Next, let's take a look at the representation and application of domain-specific knowledge maps in specific fields, which is also a topic of concern to the industry.
2. Representation of knowledge map
Suppose we use a knowledge map to describe a fact)-"Zhang San is the father of Li Si". The entities here are Zhang San and Li Si, and the relationship is "father" (is_father_of). Of course, Zhang San and Li Si may also have some relationship with other people (for the time being). When we add the phone number as a node to the knowledge map (the phone number is also an entity), we can also define the relationship between a person and the phone, which means that has_phone number belongs to someone. The following figure shows these two different relationships.
In addition, we can add time as an attribute in the has_phone relationship, indicating the time when the phone number is open. This attribute can be added not only to relationships, but also to entities. When all this information is added as attributes of a relationship or entity, the resulting mapping is called an attribute graph. Both attribute graph and traditional RDF format can be used as the representation and storage methods of knowledge graph, but there are still differences between them, which will be briefly explained in the following chapters.
3. Storage of knowledge map
Knowledge map is a graph-based data structure, and there are two main storage modes: RDF storage format and graph database. As for their differences, please refer to 1. The following curves show the development of various data storage types in recent years. From here, we can clearly see the rapid development of graph-based storage in the whole database storage field. This chart comes from graph DBMS, which has increased its popularity by 500% in the past two years.
The following list shows the popular database rankings based on graph storage. It can be seen from this ranking that Secondary occupies the position of 1 in the whole graph storage field, and Jena is still the most popular storage framework in RDF field. This part of the data comes from DB-Engines ranking.
Of course, if the knowledge map to be designed is very simple, and the query will not involve related queries above 1 degree, we can also choose to use relational data storage format to save the knowledge map. But for those slightly complicated relationship networks (entities and relationships in real life are generally complicated), the advantages of knowledge map are still very obvious. First of all, compared with the traditional storage method, the efficiency of associated query will be significantly improved. When we are involved in 2-or 3-degree related queries, the query efficiency based on knowledge map will be thousands of times or even millions of times higher. Secondly, graph-based storage will be very flexible in design, and generally only local changes are needed. For example, if we have a new data source, we just need to insert it into the existing map. On the contrary, the flexibility of relational storage mode is very poor, all its modes are defined in advance, and if it is to be changed later, its cost is very high. Finally, storing entities and relationships in graphic data structures is the best way to conform to the logic of the whole story.
4. Application
This paper mainly discusses the application of knowledge map in internet finance industry. Of course, many application scenarios and ideas can be extended to other industries. The application scenarios mentioned here are just the tip of the iceberg. In many other applications, knowledge map can still play its potential value, which we will continue to discuss in subsequent articles.
Anti-fraud
Anti-fraud is a very important link in risk control. The difficulty of anti-fraud based on big data lies in how to integrate data from different sources (structured and unstructured), build an anti-fraud engine, and effectively identify fraud cases (such as identity fraud, group fraud, agency packaging, etc.). ). And many fraud cases will involve complex relationship networks, which also brings new challenges to fraud audit. As a direct expression of relationship, knowledge map can solve these two problems well. First of all, knowledge map provides a very convenient way to add new data sources, as mentioned earlier. Secondly, using the knowledge map itself to represent the relationship can help us to analyze the specific potential risks in complex relationships more effectively.
The core of anti-fraud is people. First of all, it is necessary to get through all the data sources related to borrowers and build a knowledge map containing multiple data sources, so as to integrate it into a structured knowledge that can be understood by a machine. Here, we can not only integrate the basic information of the borrower (such as the information filled in when applying), but also integrate the borrower's consumption records, behavior records and online browsing records into the whole knowledge map for analysis and prediction. One difficulty here is that a lot of data is unstructured data obtained from the network, which needs to be transformed into structured data through machine learning and natural language processing technology.
Inconsistency verification
Inconsistent verification can be used to judge a borrower's fraud risk, similar to cross-verification. For example, the borrower Zhang San and the borrower Li Si fill in the same company phone number, but the company filled in by Zhang San is completely different from that filled in by Li Si, which becomes a risk point and needs auditors' special attention.
For another example, the borrower said that Zhang San is a friend and Li Si is a father-son relationship. When we try to add the borrower's information to the knowledge map, it will trigger the "consistency verification" engine. The engine will first read the relationship between Zhang San and Li Si to verify whether this "triangle relationship" is correct. Obviously, friends of friends are not father and son, so there is obvious inconsistency.
Inconsistency verification involves knowledge reasoning. Generally speaking, knowledge reasoning can be understood as "link prediction", that is, deriving new relationships or links from existing relationship diagrams. For example, if Zhang Sanhe and Li Si are friends and Zhang Sanhe and the borrower are friends, then we can infer that the borrower and Li Si are also friends.
Group fraud
Compared with the identification of false identities, it is more difficult to mine group fraud. This kind of organization is hidden in a very complex network of relationships and is not easy to find. Only by combing out the hidden relationship network can we analyze and discover the potential risks. As an analytical tool of natural relationship network, knowledge map can help us identify this potential risk more easily. For a simple example, some group fraud members will apply for loans under false identities, but some information is shared. The figure below roughly illustrates this situation. As can be seen from the picture, there is no direct relationship between Zhang San, Li Si and Wang Wu, but through the network, we can easily see that all three of them have shared some information, which immediately reminds us of the risk of fraud. Although there are many forms of group fraud, it is certain that knowledge map will provide a better and more convenient analysis method than any other tool.
Anomaly detection
Anomaly analysis is an important topic in the field of data mining. We can simply understand it as looking for "abnormal" points from given data. In our application, these "abnormal" points may be related to fraud. Because the knowledge map can be regarded as a graph, the anomaly analysis of the knowledge map is mostly based on the structure of the graph. Because of the different entity types and relationship types in the knowledge map, anomaly analysis also needs to consider these additional information. Most graph-based anomaly analysis requires a large amount of calculation, so you can choose to do off-line calculation. In our application framework, anomaly analysis can be divided into two categories: static analysis and dynamic analysis, which will be discussed one by one later.
-Static analysis
The so-called static analysis refers to finding some abnormal points (such as abnormal subgraphs) from a given graphic structure and a certain time point. In the picture below, we can clearly see that five of them are very close to each other and may be a fraud organization. Therefore, we can further analyze these abnormal structures.
-Dynamic analysis
The so-called dynamic analysis refers to the analysis of the trend of its structure changing with time. Our assumption is that the structure of knowledge map will not change much in a short time. If there is a big change, it means that there may be anomalies and further attention is needed. Analyzing the change of structure with time will involve time series analysis technology and graphic similarity calculation technology. Interested readers can refer to these materials.
Lost customer management
In addition to risk control before lending, knowledge map can also play a powerful role after lending. For example, in the management of customers lost after loan, knowledge map can help us find more potential new contacts, thus improving the success rate of collection.
In reality, many borrowers don't pay back after success, play hide-and-seek and can't contact themselves. Even if I try to contact other contacts provided by the borrower, I still can't contact myself. This has entered the so-called "lost contact" state, and the collection personnel have no way to start. Then the next question is, in the case of losing contact, can we find a new borrower's contact information? And this group of people did not appear in our knowledge map as relevant contacts. If we can dig out more potential new contacts, it will greatly improve the success rate of the collection. For example, in the picture below, the borrower has a direct relationship with Li Si, but we can't contact Li Si. Can you predict which contacts of Li Si may know the borrower through the analysis of the 2-degree relationship? This involves the analysis of map structure.
Intelligent search and visual display
Based on the knowledge map, we can also provide intelligent search and data visualization services. The function of intelligent search is similar to the application of knowledge map on Google and Baidu. In other words, for every keyword searched, we can return richer and more comprehensive information through the knowledge map. For example, searching for a person's ID number, our intelligent search engine can return all the historical loan records, contact information, behavior characteristics and labels of each entity (such as blacklist, peers, etc.). ) It has something to do with this person. In addition, the benefits of visualization are self-evident. Through visualization, complex information is presented in a very intuitive way, so that we can understand the ins and outs of hidden information at a glance.
precision marketing
Michele Goetz, chief analyst of Forrester Research, said: "Knowledge charts allow you to get the core information of customers, including their names, addresses and contact information, and connect them with other people they know and how they interact online."
A smart enterprise can tap potential customers more effectively than its competitors. In the Internet age, there are various marketing methods, but no matter how many ways there are, they are inseparable from one core-analyzing users and understanding users. Knowledge map can combine various data sources to analyze the relationship between entities, so as to have a better understanding of users' behavior. For example, the marketing manager of a company uses the knowledge map to analyze the relationship between users and find the similarities and differences of an organization, so as to formulate marketing strategies for a certain group of people. Only by better understanding the needs of users can we do better marketing.
5. Challenges
Knowledge mapping has not been widely used in industry. Even if some enterprises try to develop in this direction, many are still in the research stage. The main reason is that many enterprises don't know or understand the knowledge map deeply. But one thing is certain, knowledge map will become a popular tool in the industry in the next few years, and it is easy to predict from the current trend. Of course, knowledge mapping is a relatively new tool after all, and it will definitely involve more or less challenges in practical application.
Data noise
First of all, there is a lot of noise in the data. Even if the data already exists in the database, we cannot guarantee its 100% accuracy. There are mainly two aspects here. First, there are errors in the data currently accumulated, which need to be corrected. The simplest way to correct this problem is to perform off-line inconsistent verification, as mentioned earlier. Second, data redundancy. For example, borrower Zhang San fills in the company name "Pratt & Whitney", borrower Li Si fills in the company name "inclusive finance" and borrower Wu Wang fills in the company name "inclusive finance Information Service Co., Ltd.". Although all three people belong to the same company, the computer will think that they are from different companies because they have filled in different names. Then the next question is, how to find these ambiguous names from the massive data and merge them into one name? This involves the "disambiguation analysis" technology in natural language processing.
Unstructured data processing ability
In the era of big data, a lot of data are unprocessed unstructured data, such as text, pictures, audio, video and so on. Especially in the internet finance industry, we often face a lot of text data. How to extract valuable information from these unstructured data is a very challenging task, which puts forward a higher threshold for mastering machine learning, data mining and natural language processing capabilities.
knowledge reasoning
Reasoning ability is an important feature of human intelligence, which enables us to discover implicit knowledge from existing knowledge. General reasoning often needs the support of some rules. For example, friends of friends can infer the relationship between friends, and fathers can infer the relationship between grandfathers. For example, many of Zhang San's friends are also friends of Li Si, so we can speculate that Zhang San and Li Si are probably friends. Of course, there will be a question of probability. When the amount of information is particularly large, how to effectively combine this side information with the reasoning algorithm is the most critical. Commonly used reasoning algorithms include logic-based reasoning and distributed representation-based reasoning. With the increasingly important position of deep learning in the field of artificial intelligence, reasoning based on distributed representation method has also become a research hotspot. If you are interested, you can refer to the current work progress in this field 4, 5, 6 and 7.
Big data, small samples and effective ecological closed loop are the key.
Although the amount of data available now is huge, we still face the problem of small samples, that is, the number of samples is small. Suppose we need to build an anti-fraud scoring system based on machine learning, we need some fraud samples first. But in fact, the number of fake samples we can get is very small. Even if there are millions of loan applications, there are probably only tens of thousands of samples that we finally mark as fraud. This poses a higher challenge to the modeling of machine learning. We got all the fraud samples at a high price. With the passage of time, we will inevitably collect more samples, but the growth space of samples is still limited. This is different from traditional machine learning systems, such as image recognition, and it is not difficult to get hundreds of thousands or even millions of samples.
Under this small sample condition, it is particularly important to construct an effective ecological closed loop. The so-called ecological closed loop refers to the construction of an effective self-feedback system, which can feed back to our model in real time, so that the model can constantly optimize itself and improve its accuracy. In order to establish this self-learning system, we should not only improve the existing data flow system, but also go deep into various business lines and optimize the corresponding processes. This is also a necessary process of the whole anti-fraud link, you know, the whole process is full of games. So we need to constantly adjust our strategy through feedback signals.
6. Conclusion
Knowledge map has attracted more and more attention from academia and industry. In addition to the applications mentioned in this paper, knowledge map can also be applied to different fields such as authority management and human resource management. This application will be discussed in detail in subsequent articles.
refer to
1De Abreu, D., Flores, A., Palma, G., Pestana, V., Pinero, J., Queipo, J. & Vidal, M. E. (20 13). Choose between graphic database and RDF engine to consume and mine linked data. In the cold.
2 User Behavior Tutorial
3 Liu Zhiyuan knowledge map-knowledge base in machine brain Chapter II knowledge map-knowledge base in machine brain
4 nickel, m, Murphy, k, tripp, v. Review on relational machine learning of knowledge graph.
5Socher, R, Chen, D, Manning, C.D. Ng,A. (20 13)。 Complete reasoning of knowledge base based on neural tensor network. Progress in neural information processing system (pages 926-934).
6Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J. Jacques Nyenko, O. (20 13). Translation for embedding in multi-relational data modeling. Progress in neural information processing system (pages 2787-2795).
7 Generton, R, Lu, N.L., Bodes, A. Obozinski,G. R. (20 12)。 Potential factor model of highly relational data. Progress of neural information processing system (page 3 167-3 175).