InhaltsverzeichnisSeite 1
Plan
Seite 3
Seite 4
Free text queries
Seite 6
Incidence matrices
Example
Overlap matching
Seite 10
Scoring: density-based
Term-document count matrices
Bag of words view of a doc
Counts vs. frequencies
Digression: terminology
Term frequency tf
Weighting term frequency: tf
Score computation
Weighting should depend on the term overall
Document frequency
tf x idf term weights
Summary: tf x idf (or tf.idf)
Real-valued term-document matrices
Documents as vectors
Why turn docs into vectors?
Intuition
The vector space model
Desiderata for proximity
First cut
Cosine similarity
Seite 31
Seite 32
Normalized vectors
Cosine similarity exercises
Seite 35
Seite 36
Digression: spamming indices
Summary: What’s the real point of using vector spaces?
Interaction: vectors and phrases
Vectors and Boolean queries
Vectors and wild cards
Vector spaces and other operators
Query language vs. scoring
Exercises
Efficient cosine ranking
Seite 46
Computing a single cosine
Encoding document frequencies
Computing the k largest cosines: selection vs. sorting
Use heap for selecting top k
Bottleneck
Removing bottlenecks
Can we avoid this?
Best m candidates
Seite 55
Seite 56
Seite 57
Precision-recall curves
Seite 59
One Problem With Boolean Queries: Feast or Famine
Seite 61
Resources for this lecture
|