Skip to content

python, locality-sensitive hashing, min hash, hamming similarity, main text retrieval from html , map-reduce, mincemeatpy

Notifications You must be signed in to change notification settings

MikhailErofeev/duplicate-pages-finding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

  1. parse html docs, find main text, split to shingles, get int32 hash from shingle
  2. hashing docs buckets with location sensitive hashing
  3. find similarity with min-hash

all in map-reduce (with bottleneck in 2->3 step, storing all docs in 1 server, see main.py)

pip install beautifulsoup4

pip install scipy

running:

python main.py

./stupid_worker.sh (as many as you wish, but change news.py to right server if many servers)

lsh and min-hash exploration: http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf

lsh in map-reduce exploration: http://architects.dzone.com/articles/location-sensitive-hashing

simple python map-reduce framework: https://github.com/michaelfairley/mincemeatpy

lecture in russian: http://compscicenter.ru/program/lecture/7329

only russian in text parsing grammar, tests, and most of examples code :)

About

python, locality-sensitive hashing, min hash, hamming similarity, main text retrieval from html , map-reduce, mincemeatpy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published