Skip to content
This repository has been archived by the owner on Sep 9, 2024. It is now read-only.
/ LSH Public archive

Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents

License

Notifications You must be signed in to change notification settings

mattilyra/LSH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pylsh

pylsh is a Python implementation of locality sensitive hashing with minhash. It is very useful for detecting near duplicate documents.

The implementation uses the MurmurHash v3 library to create document finger prints.

Cython is needed if you want to regenerate the .cpp files for the hashing and shingling code. By default the setup script uses the pregenerated .cpp sources, you can change this with the USE_CYTHON flag in setup.py

NumPy is needed to run the code.

The MurmurHash3 library is distributed under the MIT license. More information https://github.com/aappleby/smhasher

examples

For an overview of how LSH works and how to set the parameters see this notebook. The notebook is also available in the examples directory.

installation

> git clone https://github.com/mattilyra/LSH
> cd LSH
> python setup.py install

About

Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published