This is a command line application that quickly checks to see if files are duplicates of each other.
It takes a directory as an argument and scans over that entire directory tree; (it searches files in all subfolders as well as the parent folder). The program detects duplicate files even if they have different names or are in different (sub-)directories.
To run the program, just pass a directory as the only argument:
java -jar dupecheck.jar /some/folder
If the directory has spaces in the pathanme:
java -jar dupecheck.jar "C:/My Documents"
The implementation uses file sizes and hashing. It stores files mapped by their file size. If it finds two files with the same size, it generates and stores a hash of each of the both of them. (Hashes are small, ~32B, so they are easy to store in memory.) Whenever it encounters another file that has the same contents as one it's already scanned, it compares the lengths, then the hashes, and finds a collision.
Hashing is done using SHA-256
. On a CPU-bound consumer computer, this program can easily hash and compare about 200MB/s
of file data, in some cases reaching as fast as 300MB/s
. This program is typically able to scan over files of much larger size though, and works extremely over non-duplicate files (this is because in typical scenarios, non-duplicate files rarely have the same file size).
The program uses a HashMap
to store each File
it hashes. Each file's hash is stored as the key, and the file itself as the value. Right now, the hash codes computed for keys in the hash map are from a Java hash code implementation on the SHA-256 hash. This can easily be made faster/better by modifying the hashCode()
function to simply return the first 4 bytes of the SHA-256
hash, rather than computing a Java-returned hash code of the SHA-256
hash, (that is, assuming that the randomness of the first 4 bytes of SHA-256
are less likely to result in a conflict than Java's computed hash code on those bytes, which is most likely the case).