Skip to content

Program to scan and search for file duplicates. (~300MB/s)

Notifications You must be signed in to change notification settings

Gartham/file-duplicate-checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

File Duplication Checker

This is a command line application that quickly checks to see if files are duplicates of each other.

image

It takes a directory as an argument and scans over that entire directory tree; (it searches files in all subfolders as well as the parent folder). The program detects duplicate files even if they have different names or are in different (sub-)directories.

Calling

To run the program, just pass a directory as the only argument:

java -jar dupecheck.jar /some/folder

If the directory has spaces in the pathanme:

java -jar dupecheck.jar "C:/My Documents"

Implementation

The implementation uses file sizes and hashing. It stores files mapped by their file size. If it finds two files with the same size, it generates and stores a hash of each of the both of them. (Hashes are small, ~32B, so they are easy to store in memory.) Whenever it encounters another file that has the same contents as one it's already scanned, it compares the lengths, then the hashes, and finds a collision. image Hashing is done using SHA-256. On a CPU-bound consumer computer, this program can easily hash and compare about 200MB/s of file data, in some cases reaching as fast as 300MB/s. This program is typically able to scan over files of much larger size though, and works extremely over non-duplicate files (this is because in typical scenarios, non-duplicate files rarely have the same file size).

Storage

The program uses a HashMap to store each File it hashes. Each file's hash is stored as the key, and the file itself as the value. Right now, the hash codes computed for keys in the hash map are from a Java hash code implementation on the SHA-256 hash. This can easily be made faster/better by modifying the hashCode() function to simply return the first 4 bytes of the SHA-256 hash, rather than computing a Java-returned hash code of the SHA-256 hash, (that is, assuming that the randomness of the first 4 bytes of SHA-256 are less likely to result in a conflict than Java's computed hash code on those bytes, which is most likely the case).