InhaltsverzeichnisSeite 1
Seite 2
Plan
Topic Specific Pagerank [Have02]
Seite 5
Seite 6
Influencing PageRank (“Personalization”)
Non-uniform Teleportation
Interpretation of Composite Score
Interpretation
Seite 11
Seite 12
Web vs. hypertext search
Seite 14
Query-doc popularity matrix B
Issues to consider
Vector space implementation
Issues
Basic Assumption
Validity of Basic Assumption
Variants
Seite 22
Seite 23
Crawling and Corpus Construction
Crawling Issues
Seite 26
Crawl Order
Stanford Web Base (179K, 1998) [Cho98]
Web Wide Crawl (328M pages, 2000) [Najo01]
BFS & Spam (Worst case scenario)
Adversarial IR (Spam)
A few spam technologies
Can you trust words on the page?
PowerPoint Presentation
Seite 35
The war against spam
Seite 37
Duplicate/Near-Duplicate Detection
Computing Near Similarity
Shingles + Set Intersection
Seite 41
Computing Sketch[i] for Doc1
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
However…
Question
Mirror Detection
Mirror Detection example
Repackaged Mirrors
Motivation
Bottom Up Mirror Detection [Cho00]
Can we use URLs to find mirrors?
Top Down Mirror Detection [Bhar99, Bhar00c]
Implementation
WebIR Infrastructure
Connectivity Server [CS1: Bhar98b, CS2 & 3: Rand01]
Usage
ID assignment
Adjacency List Compression - I
Adjacency List Compression - II
Term Vector Database [Stat00]
Architecture
|