GitHub - MikhailErofeev/duplicate-pages-finding: python, locality-sensitive hashing, min hash, hamming similarity, main text retrieval from html , map-reduce, mincemeatpy

all in map-reduce (with bottleneck in 2->3 step, storing all docs in 1 server, see main.py)

pip install beautifulsoup4

pip install scipy

running:

python main.py

./stupid_worker.sh (as many as you wish, but change news.py to right server if many servers)

only russian in text parsing grammar, tests, and most of examples code :)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
tests		tests
readme.md		readme.md

Provide feedback