- a 500Gb HDD in my MacBook, which hit “full” some time back before I freed up some space by moving stuff to an external HDD
- an external 320Gb HDD, initially to store photos and videos, in practice filled with undefined junk, most of it mine, some of it others’
- an external 250Gb HDD, initially to store a mirror of my MacBook HDD when it was only 250Gb, then filled with undefined junk, most of it mine, some of it others’
- an external 110Gb HDD, containing disk images of various installation DVDs, and quite a lot of undefined junk, most of it mine, some of it others’
As you can see, “undefined junk” comes back often. What is it?
- “I don’t have quite enough space on my MacBook HDD anymore, let’s move this onto an external drive”
- “heck, do I have a second copy of this data somewhere? let’s make one here just in case”
- “Sally, let me just make a copy of your user directory here before I upgrade your OS/put in a bigger hard drive, just in case things go wrong”
- “eeps, I haven’t made a backup in some time, let me put a copy of my home directory somewhere” (pre-Time Machine)
See the idea?
Enter dupeGuru. I’ve wanted a programme like this for ages, without really taking the time to find it. Thanks to a kind soul on IRC, I have finally found the de-duping love of my life. (It works on OSX, Windows, and Linux.) It’s been an invaluable assistance in showing me where my huge chunks of redundant data are. Plus, it’s released as Fairware, which I find a very interesting compensation model: as long as there are uncompensated hours of work on the project, you’re encouraged to contribute to it, and the whole process is visible online.
Back to data. I quickly realized (no surprise) that I had huge amounts of redundant data. This prompted me to coin the following law:
Lack of a clear backup strategy leads to massive, uncontrolled and disorganized data redundancy.
The first thing I did was create a directory on my home server and copy all my external hard drives there. Easier to clean if everything is in one place! I also used my (now clean) 500Gb to copy some folder structures I knew were clean.
Now, one nice thing about dupeGuru is that you can specify a “reference” folder when you choose where to hunt for duplicates. That means you tell dupeGuru “stuff in here is good, don’t touch it, but I want to know if I have duplicate copies of that content lying around”. Once you’ve found duplicates, you can choose to view only the duplicates, sort them by size or folder, delete, copy or move them.
As with any duplicate-finder programme, you cannot just use it blindly, but it’s an invaluable assistant in freeing space.
I ran it on my well-organized Music folder and discovered 5Gb of duplicate data in there — in less than a minute!
Now that I’ve cleaned up most of my mess, I realize that instead of having 8 or 900Gb of data like I imagined, reality is closer to 300Gb. Not bad, eh?
So, here are my clean-up tips, if you have a huge mess like mine, with huge folder structures duplicated at various levels of your storage devices:
start small, and grow: pick a folder to start with that’s reasonably under control, clean it up, then add more folders using it as reference
- scan horribly messy structures to identify redundant branches (maybe you have mymess/somenastydirectory and mymess/documents/old/documents/june/somenastydirectory), copy those similar branches to the same level (I do that because it makes it easier for my brain to follow what I’m doing), mark one of them as reference and prune the other; then copy the remaining files into the first one, if there are any
- if you need to quickly make space, sort your dupes by size
- if dupeGuru is suggesting you get rid of the copy of a file which is in a directory you want to keep, go back and mark that directory as reference
- keep an eye on the bottom of the screen, which tells you how much data the dupes represent (if it’s 50Mb and hundreds of small files in as many little folders, you probably don’t want to bother, unless you’re really obsessed with organizing your stuff, in which case you probably won’t have ended up in a situation requiring dupeGuru in the first place)
Happy digital spring-cleaning!
- A Data Management Fantasy [en] (2011)
- Going SSD and Amahi Home Server [en] (2011)
- Disqus Plugin Aftermath: Removing Duplicate Comments [en] (2009)
- Batch Category Editing For WordPress [en] (2004)
- Keeping The Flat Clean: Living Space As User Interface [en] (2003)
- Bridging the gap between me and orthodox GTD [en] (2006)
- Delivering Happiness: A Book to Read on Running a Happy Profitable Business [en] (2013)
- E-mail and Dirty Dishes [en] (2009)
- Blogging Like Cleaning the Flat [en] (2009)
- Back to Lightroom [en] (2018)