Zum Starten hier klicken

Inhaltsverzeichnis

Seite 1

Plan

Seite 3

Seite 4

Free text queries

Seite 6

Incidence matrices

Example

Overlap matching

Seite 10

Scoring: density-based

Term-document count matrices

Bag of words view of a doc

Counts vs. frequencies

Digression: terminology

Term frequency tf

Weighting term frequency: tf

Score computation

Weighting should depend on the term overall

Document frequency

tf x idf term weights

Summary: tf x idf (or tf.idf)

Real-valued term-document matrices

Documents as vectors

Why turn docs into vectors?

Intuition

The vector space model

Desiderata for proximity

First cut

Cosine similarity

Seite 31

Seite 32

Normalized vectors

Cosine similarity exercises

Seite 35

Seite 36

Digression: spamming indices

Summary: What’s the real point of using vector spaces?

Interaction: vectors and phrases

Vectors and Boolean queries

Vectors and wild cards

Vector spaces and other operators

Query language vs. scoring

Exercises

Efficient cosine ranking

Seite 46

Computing a single cosine

Encoding document frequencies

Computing the k largest cosines: selection vs. sorting

Use heap for selecting top k

Bottleneck

Removing bottlenecks

Can we avoid this?

Best m candidates

Seite 55

Seite 56

Seite 57

Precision-recall curves

Seite 59

One Problem With Boolean Queries: Feast or Famine

Seite 61

Resources for this lecture