Eliminating File Duplication

Sat Aug 15 10:00:40 PDT 2015

It seems that I am constantly fighting a lack of disk space. One reason for this is that I am making increased use of virtual machines, so I store a complete image of a machine on another machine's hard disk. This gets through disk space rapidly.

The other problem that I seem to face is that operating system and 'office' program updates regularly exceed a gigabyte. This rapidly eats through whatever diskspace I arrange to have free.

One tool to fight against wasted space is 'rdfind'. This is an efficiently written duplicate file finder. It is intended for dealing with backups where several dumps of a machine or set of machines are being managed and removing file duplication saves time and space.

rdfind uses various tricks to reduce the complexity of comparing every file to every other file in its search path. It uses file size as an initial check, it uses the first few bytes in the file as secondary check, and so on.

Here is a typical command line to find (but not do anything about in this case) local file duplicates. This will create a file called 'results.txt' in the directory in which you run rdfind describing what rdfind uncovers.

rdfind -n true ./

rdfind also has various options for removing duplicates, or trimming the files on various backup disks, or replacing duplicates with links.

To install rdfind, simply download the source, build, and install the program. Thank you Paul Dreik!

Posted by ZFS | Permanent link | File under: bash