Index technology of search engine technology

indexing technology is one of the core technologies of search engines. Search engines should sort, classify and index the collected information to produce an index database, and the core of Chinese search engines is word segmentation technology. Word segmentation technology is to use certain rules and thesaurus to segment the words in a sentence and prepare for automatic indexing. Non—clustered method is often used in indexing, which has a great relationship with the understanding of language and characters. Specifically, it has the following points:

(1) storing a grammar database and cooperating with a vocabulary database to separate words from sentences;

(2) To store the vocabulary base, it is necessary to store the usage frequency and common collocation methods of vocabulary at the same time;

(3) The vocabulary is wide, which can be divided into different professional databases to facilitate the processing of professional documents;

(4) For sentences that cannot be segmented, treat each word as a word.

the indexer generates a relational index table from keywords to URLs. Index tables generally use some form of inverted list, that is, the corresponding URL is searched by index items. The index table should also record the positions of index items in the document, so that the searcher can calculate the adjacent relationship or close relationship between index items and store them on the hard disk in a specific data structure.

Different search engine systems may adopt different indexing methods. For example, Webcrawler uses full-text retrieval technology to index every word in the web page; Lycos only indexes optional words such as page name, title and the most important 1 annotation words; Infoseek provides concept retrieval and phrase retrieval, and supports Boolean operations such as AND, or, near and not. The indexing methods of search engines can be roughly divided into three categories: automatic indexing, manual indexing and user login.