This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite

Modern file systems, and even storage systems, might have built-in deduplication, but common file systems still do not. So checking for redundant data and do deduplication when possible might save disk space.

Once up on a a time, there was a system, were we had this 6TB spool of binary files on an production ext4 file-system, and the volume was running out of disk space. The owner of the data thought it likely that there were duplicates in the vast amount of files, and wanted to check this up. We checked using fdupes, and yes, there were a lot of duplicates.

Running over the file-tree with hardlink, we actually saved 30% of disk space. And could suspend the change of storage solution for some months.

For a testing, let’s make a a tree of directories with variable depth sub-directories and some diverse data:

#!/bin/bash
cd "$(mktemp -d)"
mkdir foo; pushd foo
for n in $(seq 1 100); do
  depth=$((RANDOM%10))
  for i in $(seq 1 $depth); do
    dir=dir$((RANDOM%10))
    mkdir -p $dir;
    pushd $dir;
  done;
  echo $((RANDOM%100)) > file$((RANDOM%100));
  for i in $(seq 1 $depth); do popd; done;
done;
find; echo; ls

Install hardlink. Note that there are different implementations of hardlink for Red Hat and Debian based distributions. While The Red Hat variant of hardlink is faster, the Debian variant has more fine-grained options for ignoring attributes like ownership, file mode and timestamp, and even filter filenames with regular expressions.

The following was tested using the Red Hat variant. On Debian and derivatives, replace -c with -pot, or read the hardlink man page.

sudo yum install hardlink

Run hardlink on the current directory. It will run for a while. On a large file-system, it might run for a very long while. Finally, it will show you a list of duplicates.

hardlink -c -vv -n .

Note: DO LOOK OVER THE OUTPUT BEFORE DELETING ANY PRODUCTION DATA. You have been warned. If you break something, you keep the parts.

Make a copy just to compare. Then run hardlink without the -n switch

cp -a ./ ../bar
hardlink -c -v .

Check that the copies are equal in content, though not in disk space

popd
diff -Naur foo bar && echo They are equal
du -s foo bar

To sum up: While a bit cumbersome and time-consuming, it is possible to use quite simple file tools to do deduplication, even on existing filled-up file systems.

Test and consider thoroughly before using hardlink in production. Changes in the tree while hardlink is running might cause unpredictable results.

Ingvar Hagelund

Team Lead, Application Management for Media at Redpill Linpro

Ingvar has been a system administrator at Redpill Linpro for more than 20 years. He is also a long time contributor to the Fedora and EPEL projects.

Comparison of different compression tools

Working with various compressed files on a daily basis, I found I didn’t actually know how the different tools performed compared to each other. I know different compression will best fit different types of data, but I wanted to compare using a large generic file.

The setup

The file I chose was a 4194304000 byte (4.0 GB) Ubuntu installation disk image.

The machine tasked with doing of the bit-mashing was an Ubuntu with a AMD Ryzen 9 5900X 12-Core ... [continue reading]

Why TCP keepalive may be important

Published on December 17, 2024

The irony of insecure security software

Published on December 09, 2024