The quality of word segmentation is related to the accuracy of query and the size of generated index. In the development of Chinese word segmentation, binary word segmentation is often used in the early stage. The basic principle of this method is to divide sentences containing Chinese into binary words, regardless of the meaning, and only index the binary words. Therefore, this method separates a large number of words, resulting in a huge number of indexes, and useless data will be retrieved in the query. The advantage is that the algorithm is simple and the retrieved data will not be missed. Then the maximum matching word segmentation method is developed, which is divided into forward maximum word segmentation and reverse maximum word segmentation. Its principle is similar to looking up a dictionary, generating a dictionary for commonly used words, and matching the words in the dictionary to the maximum extent in the process of analyzing sentences, thus splitting sentences into meaningful word chains. The forward word segmentation method in the maximum matching method is easy to make mistakes when distinguishing formal words. For example, "jewelry and clothing" will use "kimono" as a single word. The improved reverse maximum word segmentation method is used in the big dream database, which improves the correct rate compared with the forward one. The most complicated is the method of word segmentation by statistical means. This method uses hidden Markov chain, that is, the probability of the last word depends on the probability of the previous word, and finally counts the maximum probability of all words as the basis of word segmentation. The recognition rate of this method for new nouns and place names is much higher than that of the maximum matching method, and the accuracy increases with the increase of the number of sample texts.
The binary word segmentation method and statistical method are independent of the dictionary, while the maximum matching word segmentation method depends on the dictionary, and the content of the dictionary determines the quality of the word segmentation structure.
The index of full-text retrieval is called inverted index, which is called inverted index because each word is regarded as an index item, and the text containing the word is searched according to the index item. Therefore, indexes are all words, and the labels of the only recorded text have a one-to-many relationship. Sort the index words, and locate the text containing these words according to the sorted words.
Step 1) Read a whole sentence into the variable str and go to step 2.
Step 2) Read 1 words into variable words at the end of clause, and go to step 3.
Step 3) Look up the words saved in word in the dictionary. If it exists, save word and go to step 4, otherwise go to step 5).
Step 4) If it is the largest word in the dictionary or exceeds the maximum number of words (recognized as new words), remove the word from the end of the clause and return to step 2.
Step 5) read the previous word into Word to form a new word, and go to step 3).
Storage data structure of thesaurus and matching algorithm of words in thesaurus
Words in memory are stored in a hierarchical structure.
Suppose there are the following words in the dictionary: China, Republic of China, National People's Democracy.
Arranged in the memory by layers as follows, where each square represents a word and the arrow points to the previous word of the word.