File Duplication Checker

This is a command line application that quickly checks to see if files are duplicates of each other.

It takes a directory as an argument and scans over that entire directory tree; (it searches files in all subfolders as well as the parent folder). The program detects duplicate files even if they have different names or are in different (sub-)directories.

Calling

To run the program, just pass a directory as the only argument:

java -jar dupecheck.jar /some/folder

If the directory has spaces in the pathanme:

java -jar dupecheck.jar "C:/My Documents"

Implementation

The implementation uses file sizes and hashing. It stores files mapped by their file size. If it finds two files with the same size, it generates and stores a hash of each of the both of them. (Hashes are small, ~32B, so they are easy to store in memory.) Whenever it encounters another file that has the same contents as one it's already scanned, it compares the lengths, then the hashes, and finds a collision. Hashing is done using SHA-256. On a CPU-bound consumer computer, this program can easily hash and compare about 200MB/s of file data, in some cases reaching as fast as 300MB/s. This program is typically able to scan over files of much larger size though, and works extremely over non-duplicate files (this is because in typical scenarios, non-duplicate files rarely have the same file size).

Storage

The program uses a HashMap to store each File it hashes. Each file's hash is stored as the key, and the file itself as the value. Right now, the hash codes computed for keys in the hash map are from a Java hash code implementation on the SHA-256 hash. This can easily be made faster/better by modifying the hashCode() function to simply return the first 4 bytes of the SHA-256 hash, rather than computing a Java-returned hash code of the SHA-256 hash, (that is, assuming that the randomness of the first 4 bytes of SHA-256 are less likely to result in a conflict than Java's computed hash code on those bytes, which is most likely the case).

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
src/com/gartham/tools/files/dupechecker		src/com/gartham/tools/files/dupechecker
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Duplication Checker

Calling

Implementation

Storage

About

Releases 2

Sponsor this project

Packages

Languages

Gartham/file-duplicate-checker

Folders and files

Latest commit

History

Repository files navigation

File Duplication Checker

Calling

Implementation

Storage

About

Topics

Resources

Stars

Watchers

Forks

Releases 2

Sponsor this project

Packages 0

Languages

Packages