Zum Starten hier klicken

Inhaltsverzeichnis

Seite 1

Seite 2

Plan

Topic Specific Pagerank [Have02]

Seite 5

Seite 6

Influencing PageRank (“Personalization”)

Non-uniform Teleportation

Interpretation of Composite Score

Interpretation

Seite 11

Seite 12

Web vs. hypertext search

Seite 14

Query-doc popularity matrix B

Issues to consider

Vector space implementation

Issues

Basic Assumption

Validity of Basic Assumption

Variants

Seite 22

Seite 23

Crawling and Corpus Construction

Crawling Issues

Seite 26

Crawl Order

Stanford Web Base (179K, 1998) [Cho98]

Web Wide Crawl (328M pages, 2000) [Najo01]

BFS & Spam (Worst case scenario)

Adversarial IR (Spam)

A few spam technologies

Can you trust words on the page?

PowerPoint Presentation

Seite 35

The war against spam

Seite 37

Duplicate/Near-Duplicate Detection

Computing Near Similarity

Shingles + Set Intersection

Seite 41

Computing Sketch[i] for Doc1

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

However…

Question

Mirror Detection

Mirror Detection example

Repackaged Mirrors

Motivation

Bottom Up Mirror Detection [Cho00]

Can we use URLs to find mirrors?

Top Down Mirror Detection [Bhar99, Bhar00c]

Implementation

WebIR Infrastructure

Connectivity Server [CS1: Bhar98b, CS2 & 3: Rand01]

Usage

ID assignment

Adjacency List Compression - I

Adjacency List Compression - II

Term Vector Database [Stat00]

Architecture