DupeGuru: You Own Less Data Than You Think [en]

[fr] Pour faire la chasse aux doublons sur Mac, Windows ou Linux, je vous recommande chaudement d'essayer dupeGuru! (En plus, système de rémunération des développeurs intéressant: Fairware.)

One of the consequences of putting an SSD into my MacBook and using CrashPlan and an Amahi home server to store my data and backups is that I have been forced to do a little digital spring-cleaning.

I had:

  • a 500Gb HDD in my MacBook, which hit “full” some time back before I freed up some space by moving stuff to an external HDD
  • an external 320Gb HDD, initially to store photos and videos, in practice filled with undefined junk, most of it mine, some of it others’
  • an external 250Gb HDD, initially to store a mirror of my MacBook HDD when it was only 250Gb, then filled with undefined junk, most of it mine, some of it others’
  • an external 110Gb HDD, containing disk images of various installation DVDs, and quite a lot of undefined junk, most of it mine, some of it others’

As you can see, “undefined junk” comes back often. What is it?

  • “I don’t have quite enough space on my MacBook HDD anymore, let’s move this onto an external drive”
  • “heck, do I have a second copy of this data somewhere? let’s make one here just in case”
  • “Sally, let me just make a copy of your user directory here before I upgrade your OS/put in a bigger hard drive, just in case things go wrong”
  • “eeps, I haven’t made a backup in some time, let me put a copy of my home directory somewhere” (pre-Time Machine)

See the idea?

dupeGuru logo.Enter dupeGuru. I’ve wanted a programme like this for ages, without really taking the time to find it. Thanks to a kind soul on IRC, I have finally found the de-duping love of my life. (It works on OSX, Windows, and Linux.) It’s been an invaluable assistance in showing me where my huge chunks of redundant data are. Plus, it’s released as Fairware, which I find a very interesting compensation model: as long as there are uncompensated hours of work on the project, you’re encouraged to contribute to it, and the whole process is visible online.

Back to data. I quickly realized (no surprise) that I had huge amounts of redundant data. This prompted me to coin the following law:

Lack of a clear backup strategy leads to massive, uncontrolled and disorganized data redundancy.

The first thing I did was create a directory on my home server and copy all my external hard drives there. Easier to clean if everything is in one place! I also used my (now clean) 500Gb to copy some folder structures I knew were clean.

Now, one nice thing about dupeGuru is that you can specify a “reference” folder when you choose where to hunt for duplicates. That means you tell dupeGuru “stuff in here is good, don’t touch it, but I want to know if I have duplicate copies of that content lying around”. Once you’ve found duplicates, you can choose to view only the duplicates, sort them by size or folder, delete, copy or move them.

As with any duplicate-finder programme, you cannot just use it blindly, but it’s an invaluable assistant in freeing space.

I ran it on my well-organized Music folder and discovered 5Gb of duplicate data in there — in less than a minute!

Now that I’ve cleaned up most of my mess, I realize that instead of having 8 or 900Gb of data like I imagined, reality is closer to 300Gb. Not bad, eh?

So, here are my clean-up tips, if you have a huge mess like mine, with huge folder structures duplicated at various levels of your storage devices:

  • start small, and grow: pick a folder to start with that’s reasonably under control, clean it up, then add more folders using it as reference actually, better to set a big folder as reference and check to see if a smaller folder isn’t already included in it
  • scan horribly messy structures to identify redundant branches (maybe you have mymess/somenastydirectory and mymess/documents/old/documents/june/somenastydirectory), copy those similar branches to the same level (I do that because it makes it easier for my brain to follow what I’m doing), mark one of them as reference and prune the other; then copy the remaining files into the first one, if there are any
  • if you need to quickly make space, sort your dupes by size
  • if dupeGuru is suggesting you get rid of the copy of a file which is in a directory you want to keep, go back and mark that directory as reference
  • keep an eye on the bottom of the screen, which tells you how much data the dupes represent (if it’s 50Mb and hundreds of small files in as many little folders, you probably don’t want to bother, unless you’re really obsessed with organizing your stuff, in which case you probably won’t have ended up in a situation requiring dupeGuru in the first place)

Happy digital spring-cleaning!

Disqus Plugin Aftermath: Removing Duplicate Comments [en]

[fr] Comment se débarrasser de 5000 commentaires à double dans sa base de données WordPress!

Now that Disqus integrates Friendfeed comments, I could be tempted to give it another try, if I hadn’t spent an hour yesterday cleaning up my database because of an earlier attempt to use Disqus on this blog. After the story, how I did it — in case you’re in the same mess and could use the help.

Back in August, I installed the Disqus plugin for WordPress. Things started off not too badly, though I was a bit concerned that the plugin seemed to have duplicated all the comments in my database. It didn’t seem to show up on the blog though, so I didn’t worry too much.

After a few months, I was a bit frustrated with Disqus and the plugin (which was clearly an older version than the Disqus plugin available now). Moderating comments through the WordPress interface seemed to work erraticly, and some spam just wouldn’t accept to stay in the spambox. I never really tried to identify the exact problems too closely, I have to admit, but things were not really working how I expected them to.

Then a few (unrelated) people told me they had completely failed to comment on my blog with the new system. At some point, I got fed up and uninstalled it. Unfortunately, the duplicate comments which had been hidden from view remained there after uninstalling the plugin, so all the old comments appeared on the blog twice. I let the problem sit for a long time before attempting to fix it — wild hope there might be a ready-made script out there I could just run& in vain.

Here’s how I tackled the problem this week-end and ended up removing the duplicate comments without too much trouble, through PhpMyAdmin (PMA for short).

  • In PMA, I made sure that duplication seemed constant — it was
  • I discovered that the duplicate comments had “DISQUS” in the user-agent field
  • I dug around until I identified the last duplicate comment (when I installed the Disqus plugin, actually; I sorted the database comments table by comment date to do that)
  • I did a search, selecting comments which were younger than the last duplicate comment date AND had “DISQUS” in the user-agent field (the date bit is important, because comments posted while the plugin were active have “DISQUS” in the user-agent field but are not duplicates)
  • Then, I deleted everything that came up in the search — about 5000 comments (it helps to tell PMA to display 3000 lines per page when doing that :-))

Hope this can help somebody, and remember: always back up your database first!