信息检索与搜索引擎Introduction to Information RetrievalGESC1007
Philippe Fournier-Viger
Full professor
School of Natural Sciences and Humanities
******@
Spring 2020
1
Last week
We have discussed about:
Hashing (散列) and search trees (搜索树)
Wildcard queries
Spell correction
QQ Group: 1059666166Website: PPTs…
2
Homework
The first homework was announced last week.
Please submit your answers no later than on the 30th March 2020 at 23:59 PM.
3
Course schedule (日程安排)
4
Lecture 1
Introduction
Boolean retrieval (布尔检索模型)
Lecture 2
Term vocabulary and posting lists
Lecture 3
Dictionaries and tolerant retrieval
Lecture 4
Index construction and compression
Lecture 5
Scoring, weighting, and the vector space model
Lecture 6
Computer scores, and a complete search system
Lecture 7
Evaluation in information retrieval
Web search engines, advanced topics, and conclusion
PHONETIC (语音的) CORRECTION
5
Write…
Right… Rite… Wright
Phonetic correction
Misspellings are often caused by a user typing a query that sounds like the target term.
Phonetic hashing: try to group together all terms that sound similar.
6
7
Soundex algorithms
Turn every term to be indexed into a 4-character reduced form Hermann H655
Use these character to create an inverted index (dictionary 词典). The dictionary is called “soundex index”
Do the same with query terms
When a new query arrives, search using the soundex index.
8
How to calculate the 4 character codes?
Retain the first letter of the term.
Change all occurrences of the following letters to ’0’(zero): ’A’,E’, ’I’, ’O’, ’U’, ’H’, ’W’, ’Y’
Change letters to digits as follows: B, F, P, V to 1. C, G, J, K, Q, S, X, Z to 2. D,T to 3. L to 4. M, N to 5. R to 6.
Repeatedly remove one out of each pair of consecutive identical digits
Remove all zeros from the resulting text. Pad the resulting text with trailing zeros and return the first four positions, which will consist of a letter followed by three dig
信息检索与搜索引擎课件 来自淘豆网www.taodocs.com转载请标明出处.