信息检索与搜索引擎Introduction to Information RetrievalGESC1007
Philippe Fournier-Viger
Full professor
School of Natural Sciences and Humanities
******@
Spring 2021
1
Last week
We have discussed in more details about how index are created.
Tokenization, normalization, lemmatization…
Phrase queries
using positional indexes
QQ Group: 596340260Website: PPTs…
2
Course schedule (日程安排)
3
Lecture 1
Introduction
Boolean retrieval (布尔检索模型)
Lecture 2
Term vocabulary and posting lists
Lecture 3
Dictionaries and tolerant retrieval
Lecture 4
Index construction and compression
Lecture 5
Scoring, weighting, and the vector space model
Lecture 6
Computer scores, and a complete search system
Lecture 7
Evaluation in information retrieval
Lecture 8
Web search engines, advanced topics, and conclusion
About last course
Normalization -规范化: the process of converting tokens to a standard form
Stemming: consists of removing the end of words (simple)
cars ⇒ car
airplanes ⇒ airplane
Lemmatization: converting a word to a common base form called “lemma” (complicate)
am, are, is ⇒ be
4
Chapter 3 – Dictionaries and tolerant retrieval
5
PDF -…
Previous weeks
Boolean retrieval model (布尔检索模型 using Boolean operators) Shenzhen AND food
Phrase (短语) queries “Airplane tickets from Beijing”
Proximity queries “Shenzhen (within 5 words) of City”
To find documents, we have used a dictionary (词典 - also called inverted index 倒排索引).
6
Today
How to deal with typographical errors (打字错误)? Shenzhen vs Shenzhennn
often made by accident (无意地)
How to deal with different spellings (拼法)?
Color vs Colouranalyze vs analyse
How to deal with phonetically similar terms (发音相似的词)?
concede vs conceed right vs write vs rite vs wright
7
Wildcard queries (通配符查询)
Wildcard (*) query: a query containing the wildcard (通配符) character “ * ”
* = one or more characters
. automat* to search
信息检索与搜索引擎ppt课件 来自淘豆网www.taodocs.com转载请标明出处.