I have this situation where I have a huge number of images (about 50 millions, with 3-4 versions of each one), organized in a nested tree of directories, like
This immense catalog of images is replicated from our internal « master » to a CDN-like box. Sometimes, the replication is out of sync and some images a destroyed on the master but on the slave.
It’s not a surprise that Rsync has the right set of options to deal with this :
rsync --recursive --delete --ignore-existing --existing --prune-empty-dirs --verbose src/ dst/
Let me explain each option.
--recursive will explore the whole directory tree, not just the first level.
--delete will remove files in
dst that are not in
--ignore-existing will not update any file in
--existing will not create any file in
--prune-empty-dirs will remove empty directories in
dst, not just deleting files.
--verbose will log what it does.
By not trying to compare the files, it’s much faster, but of course it’s only cleanup, not a real synchronization.
You can also run this a first time with
--dry-run to print each action instead of executing them, to verify that Rsync does what you want.
The complete list of options is available in the man page