DupeGuru: You Own Less Data Than You Think [en]

[fr] Pour faire la chasse aux doublons sur Mac, Windows ou Linux, je vous recommande chaudement d'essayer dupeGuru! (En plus, système de rémunération des développeurs intéressant: Fairware.)

One of the consequences of putting an SSD into my MacBook and using CrashPlan and an Amahi home server to store my data and backups is that I have been forced to do a little digital spring-cleaning.

I had:

  • a 500Gb HDD in my MacBook, which hit “full” some time back before I freed up some space by moving stuff to an external HDD
  • an external 320Gb HDD, initially to store photos and videos, in practice filled with undefined junk, most of it mine, some of it others’
  • an external 250Gb HDD, initially to store a mirror of my MacBook HDD when it was only 250Gb, then filled with undefined junk, most of it mine, some of it others’
  • an external 110Gb HDD, containing disk images of various installation DVDs, and quite a lot of undefined junk, most of it mine, some of it others’

As you can see, “undefined junk” comes back often. What is it?

  • “I don’t have quite enough space on my MacBook HDD anymore, let’s move this onto an external drive”
  • “heck, do I have a second copy of this data somewhere? let’s make one here just in case”
  • “Sally, let me just make a copy of your user directory here before I upgrade your OS/put in a bigger hard drive, just in case things go wrong”
  • “eeps, I haven’t made a backup in some time, let me put a copy of my home directory somewhere” (pre-Time Machine)

See the idea?

dupeGuru logo.Enter dupeGuru. I’ve wanted a programme like this for ages, without really taking the time to find it. Thanks to a kind soul on IRC, I have finally found the de-duping love of my life. (It works on OSX, Windows, and Linux.) It’s been an invaluable assistance in showing me where my huge chunks of redundant data are. Plus, it’s released as Fairware, which I find a very interesting compensation model: as long as there are uncompensated hours of work on the project, you’re encouraged to contribute to it, and the whole process is visible online.

Back to data. I quickly realized (no surprise) that I had huge amounts of redundant data. This prompted me to coin the following law:

Lack of a clear backup strategy leads to massive, uncontrolled and disorganized data redundancy.

The first thing I did was create a directory on my home server and copy all my external hard drives there. Easier to clean if everything is in one place! I also used my (now clean) 500Gb to copy some folder structures I knew were clean.

Now, one nice thing about dupeGuru is that you can specify a “reference” folder when you choose where to hunt for duplicates. That means you tell dupeGuru “stuff in here is good, don’t touch it, but I want to know if I have duplicate copies of that content lying around”. Once you’ve found duplicates, you can choose to view only the duplicates, sort them by size or folder, delete, copy or move them.

As with any duplicate-finder programme, you cannot just use it blindly, but it’s an invaluable assistant in freeing space.

I ran it on my well-organized Music folder and discovered 5Gb of duplicate data in there — in less than a minute!

Now that I’ve cleaned up most of my mess, I realize that instead of having 8 or 900Gb of data like I imagined, reality is closer to 300Gb. Not bad, eh?

So, here are my clean-up tips, if you have a huge mess like mine, with huge folder structures duplicated at various levels of your storage devices:

  • start small, and grow: pick a folder to start with that’s reasonably under control, clean it up, then add more folders using it as reference actually, better to set a big folder as reference and check to see if a smaller folder isn’t already included in it
  • scan horribly messy structures to identify redundant branches (maybe you have mymess/somenastydirectory and mymess/documents/old/documents/june/somenastydirectory), copy those similar branches to the same level (I do that because it makes it easier for my brain to follow what I’m doing), mark one of them as reference and prune the other; then copy the remaining files into the first one, if there are any
  • if you need to quickly make space, sort your dupes by size
  • if dupeGuru is suggesting you get rid of the copy of a file which is in a directory you want to keep, go back and mark that directory as reference
  • keep an eye on the bottom of the screen, which tells you how much data the dupes represent (if it’s 50Mb and hundreds of small files in as many little folders, you probably don’t want to bother, unless you’re really obsessed with organizing your stuff, in which case you probably won’t have ended up in a situation requiring dupeGuru in the first place)

Happy digital spring-cleaning!

Similar Posts:

A Data Management Fantasy [en]

[fr] Mon rêve: un système qui cacherait sur un espace donné de mon SSD (disons 50GB) les fichiers les plus récemment ouverts se trouvant sur mon disque dur externe. Ainsi, j'aurais à portée de main et sur disque dur rapide tous mes fichiers courants. Vous connaissez une solution qui fait ça?

I’m now running a happy MacBook with a 120Gb SSD (too big or to small depending on how you look at it, but I was in a hurry and dependant on what was in stock in the shop). I have an external 500Gb HDD to store all my junk on.

And here’s my dream. Wouldn’t it be nice if I could devote a certain amount of space on the SSD to my files, say 50Gb, and have that space occupied by cached copies of the files from the external drive that I most recently used? When I modify the files, the cached copies and those on the HDD would sync. And if I haven’t touched a file for long enough, it would be removed from the cache to free up space.

Like that my “current” files would be on the super-fast SDD and close at hand when I’m on the road.

I’m sure a solution to do this already exists — heard of anything?

Similar Posts:

BarCamp Lausanne: Wuala (Dominik Grolimund) [en]

[fr] Wuala: pour partager des données en ligne.

[Wuala](http://wua.la/) launched last week. Demo. (Closed Alpha.)

BarCamp Lausanne 30

Storing, sharing, publishing files. Desktop application. Allows you to search for files. “Free, simple, and secure.*

What’s new about it? Different technology. Decentralized.

Advantages:

1. Free because uses resources provided by participating computers.
2. You get 1GB free, and get more by trading unused disk space. ***steph-note: so basically, this is a service that allows you to convert local disk space into online storage — it doesn’t give you significant extra storage.** You need to be online for at least 4 hours a day to do that. *steph-note: I find 1Gb very little. Gmail offers 3 times that.*
3. No traffic limits.
4. No file size limits.
5. Fast downloads. P2P. Like BitTorrent.
6. You can stream music and video files.
7. “We think it’s a great application.” Drag’n drop. Upload in the background.
8. Simple.
9. All in one place.
10. Security and privacy. All the files are encrypted on your computer. Your password never leaves your computer. **Not even Wuala people can see your files.**

Demo.

BarCamp Lausanne 31

BarCamp Lausanne 32

*steph-note: I feel annoying. I always ask if there are “[buddy list groups](http://climbtothestars.org/archives/2007/08/16/we-need-structured-portable-social-networks-spsn/)” or complain about their non-existence (Facebook).*

Use: mainly to share a few files with other people. *steph-note: not sure I’d call this “online storage” as I find it a little misleading (gives the impression you get extra storage space outside your local drives) — this is really **file sharing**, Pownce-like but without the timeline.* For example, Dominik’s mum is going to use this to share photographs with him, because she’s not comfortable putting them on Flickr as it’s “on the internet”. *steph-note: I see this as an interesting alternative to dropsend and the like.*

Question: what is the business model? Ads in the client. *steph-note: alternate business model would be to make people pay to have more actual “storage”.*

Privacy: in Switzerland, there are “anti-spying” laws which would protect Wuala from having to surrender data to the CIA etc., for example. Wuala doesn’t see what is private or shared (regarding content). Very strong emphasis on privacy. *steph-note: “illegal” music and TV series sharing system of choice, if there is more storage. Problem with this strong emphasis on privacy is when people start using the service to trade kiddie porn.* Dominik says one of the solutions to this could be to limit the size of groups.

Careful! if you lose the password, you lose your files. *steph-note: ouch! this sounds unacceptable to me… no possibility to reset it? Secret question: I hate them, because they are usually very weak. What’s the point of having great encryption, secure passwords, if people give secret answers to secret questions which aren’t so secret?*

Should keep a local version of all the files you share. *steph-note: so this is really not extra storage.* What makes it so different from a prettily dressed up FTP client, besides the fact that the underlying technology is different? From a user point of view? Encryption, and sharing with friends/groups.

*steph-note: a bit skeptical about this, though parts do indeed sound interesting. Not sure what I’d use it for. Maybe to swap music.*

Similar Posts:

Bridging the gap between me and orthodox GTD [en]

[fr] Je note ici quelques divergences entre le système d'organisation que j'ai mis en place et ce que recommande directement le livre "Getting Things Done" (comme il ne semble pas qu'il existe en français, j'en parlerai -- peut-être de vive voix -- plus longuement à l'occasion).

(Whatever “orthodox” GTD is, to start with.)

I started to try to [hack together some implementation of GTD](http://climbtothestars.org/archives/2006/08/02/if-you-missed-hearing-my-voice/) based on what I had read at [43folders](http://43folders.com), and, I have to say, I didn’t do too badly. I’ve now received The Book and am starting to read it. Of course, there were some missing elements I’m now understanding, and I’m preparing to set aside enough time to start implementing a good system for myself. Roughly 2 days work to gather and process all the “open loops” in my life — most people I talk to tell me they would have expected more time was required, but I still think it’s a lot of time when I look at my overbooked calendar. Still, I’m really looking forward to doing it, and I already know it will be worth it.

For the moment, I’ve noted the my filing system isn’t really DA’s-GTD-compatible: I use 26 hanging folders, and stick translucid folders (labeled! I got that bit right! love using my labeler, in fact!) under the right letter. But I’m already noticing that letter C is bulging (don’t ask me, but clients as well as administrivia tend to collect under the letter C). I’m not going to get a hanging folder for each file (way too expensive), and even the cardboard (manila-type) folders we can get here don’t really come cheap. The transparent plastic ones are really nice, but I’m not sure they’d stand up on their own.

In addition to that, the box I’m storing them is a bit deep, and I had to line the bottom with the lid of another box to have the daily folders of my tickler file stand upright at the right place. I’m not quite sure which solution I’ll come up with. How much do manila folders (A4 size) cost, and can I order them online without the shipping fees killing me?

Another huge gap I’ve noted is that I store the tickler and A-Z reference in the same box. That’s not going to be enough space for very long, so I’ll have to go and buy other boxes to store on/under the desk.

Also, I’ve been noting “action items” on small index cards (A8), and DA suggests using a whole sheet of paper per item. I’m looking forward to reading through chapter 7 to understand where that comes in handy. I’m also looking forward to figuring out good lists to use, and of course, going through the initial collecting phase (though I’m a bit frightened I might end up putting my whole flat in the inbox).

Will keep you posted.

Similar Posts: