信息检索与搜索引擎Introduction to Information RetrievalGESC1007
Philippe Fournier-Viger
Full professor
School of Natural Sciences and Humanities
******@
Spring 2021
1
Last time
We have discussed:
how to calculate scores
the vector model
2
Course schedule (日程安排)
3
Week 1
Introduction (Chapter 1)
Boolean retrieval
Week 2
Term vocabulary and posting lists (Chapter 2)
Week 3
Dictionaries and tolerant retrieval (Chapter 3)
Week 4
Index construction (Chapter 4)
Week 5
Scoring, term weighting, the vector space model (Chapter 6)
Week 6
A complete search system (Chapter 7)
Week 7
Evaluation in information retrieval
Week 8
Web search engines, advanced topics, conclusion
Final exam (date to be announced)
LAST WEEK
4
Term frequency (TF): (词频) how many times a term appears in a document
Document frequency (DF) (文档频率): how many documents contain a term in a collection of documents.
5
Inverse document frequency (IDF) of a term t: (逆文档频率)
N = number of documents in the collectionDFt = document frequency of the term t
Example
N = 806, 791 documents
6
TF-IDF
7
Term frequency - 词频): number of times that the term appears in the document
Inverse
document
frequency
(逆文档频率) of the term
The TF-IDF of a term t for a document :
8
Vector model (矢量模型)
Documents can be viewed as vectors:
vector(doc1) = [, ]vector(doc2) = [, ]vector(doc3) = [, ]vector(doc4) = [, ]
…
… …
..
9
Shenzhen
(score using TF-IDF or TF)
Beijing
(score using TF-IDF or TF)
The vector space model can be used to calculate how similar two documents are.
10
Two documents should be similar if their vectors are close to each other
信息检索与搜索引擎 来自淘豆网www.taodocs.com转载请标明出处.