検索エンジンの構成 - ptoolisの日記

最近情報取得に関する本を読んでいます。　　検索エンジンの目的はユーザーが入力するクエリーに合う文書を調べる事です。
Issues involved in search engine architecture are:
1. Document tokenization
2. Index construction
3. Query parsing and index lookup
4. Results ordering
Tokenization involves defining an atomic unit to divide the document up for indexing. In the case of English documents, tokens are usually words, and tokenization involves splitting on whitespace between words. However, there are some edge cases, like words with apostrophes (should it's be one token or two?). In Asian languages with no spaces, how to divide up a document is a difficult question. Some machine learning-based tokenizers exist. A common technique is a k-gram tokenizer, where every k characters are treated as a token.
The data structure used for the dictionary can take on several forms. One is a hash table keyed on term (or term ID) and having a value of a pointer to a posting list. Another is a binary or b-tree (tree where the number of branches at each node is in a range of values), which facilitates wild-card based lookup. For example a query of "word*" would return all documents containing terms with the prefix "word". A final structure is known as a "permuterm index", which is a tree storing all rotations of every term, along with a "$" to mark the end of the term. For example, "chaos" could be stored as {chaos$, $chaos, s$chao, os$cha, aos$ch, haos$c}, so that a parser knows to read any characters to the right of "$" first and then read all characters to the left. This method facilitates queries with "*" in the center, as the query term can be rotated such that the "*" is at the end and then matched with terms in the tree (dropping the "*" first).
An index maps terms (or term IDs) to documents containing those terms. The list of terms is known as a dictionary. The basic algorithm to construct an index passes over each document, extracting all term-document ID pairs. It then sorts by term and merges all document IDs for a single term into a "postings list". Subsequently, the words for a query are looked up in the index, yeilding a set of postings lists. The algorithm searches all lists, seeking single documents containing all words in the query.
Ordering of results involves assigning scores to documents based on:
1. how many times query terms occur in them
2. how rare the query terms the document contains are
3. relative position of words in the document (i.e. words adjacent in query should be adjacent in document).