DupeGuru: You Own Less Data Than You Think [en]

[fr] Pour faire la chasse aux doublons sur Mac, Windows ou Linux, je vous recommande chaudement d'essayer dupeGuru! (En plus, système de rémunération des développeurs intéressant: Fairware.)

One of the consequences of putting an SSD into my MacBook and using CrashPlan and an Amahi home server to store my data and backups is that I have been forced to do a little digital spring-cleaning.

I had:

  • a 500Gb HDD in my MacBook, which hit “full” some time back before I freed up some space by moving stuff to an external HDD
  • an external 320Gb HDD, initially to store photos and videos, in practice filled with undefined junk, most of it mine, some of it others’
  • an external 250Gb HDD, initially to store a mirror of my MacBook HDD when it was only 250Gb, then filled with undefined junk, most of it mine, some of it others’
  • an external 110Gb HDD, containing disk images of various installation DVDs, and quite a lot of undefined junk, most of it mine, some of it others’

As you can see, “undefined junk” comes back often. What is it?

  • “I don’t have quite enough space on my MacBook HDD anymore, let’s move this onto an external drive”
  • “heck, do I have a second copy of this data somewhere? let’s make one here just in case”
  • “Sally, let me just make a copy of your user directory here before I upgrade your OS/put in a bigger hard drive, just in case things go wrong”
  • “eeps, I haven’t made a backup in some time, let me put a copy of my home directory somewhere” (pre-Time Machine)

See the idea?

dupeGuru logo.Enter dupeGuru. I’ve wanted a programme like this for ages, without really taking the time to find it. Thanks to a kind soul on IRC, I have finally found the de-duping love of my life. (It works on OSX, Windows, and Linux.) It’s been an invaluable assistance in showing me where my huge chunks of redundant data are. Plus, it’s released as Fairware, which I find a very interesting compensation model: as long as there are uncompensated hours of work on the project, you’re encouraged to contribute to it, and the whole process is visible online.

Back to data. I quickly realized (no surprise) that I had huge amounts of redundant data. This prompted me to coin the following law:

Lack of a clear backup strategy leads to massive, uncontrolled and disorganized data redundancy.

The first thing I did was create a directory on my home server and copy all my external hard drives there. Easier to clean if everything is in one place! I also used my (now clean) 500Gb to copy some folder structures I knew were clean.

Now, one nice thing about dupeGuru is that you can specify a “reference” folder when you choose where to hunt for duplicates. That means you tell dupeGuru “stuff in here is good, don’t touch it, but I want to know if I have duplicate copies of that content lying around”. Once you’ve found duplicates, you can choose to view only the duplicates, sort them by size or folder, delete, copy or move them.

As with any duplicate-finder programme, you cannot just use it blindly, but it’s an invaluable assistant in freeing space.

I ran it on my well-organized Music folder and discovered 5Gb of duplicate data in there — in less than a minute!

Now that I’ve cleaned up most of my mess, I realize that instead of having 8 or 900Gb of data like I imagined, reality is closer to 300Gb. Not bad, eh?

So, here are my clean-up tips, if you have a huge mess like mine, with huge folder structures duplicated at various levels of your storage devices:

  • start small, and grow: pick a folder to start with that’s reasonably under control, clean it up, then add more folders using it as reference actually, better to set a big folder as reference and check to see if a smaller folder isn’t already included in it
  • scan horribly messy structures to identify redundant branches (maybe you have mymess/somenastydirectory and mymess/documents/old/documents/june/somenastydirectory), copy those similar branches to the same level (I do that because it makes it easier for my brain to follow what I’m doing), mark one of them as reference and prune the other; then copy the remaining files into the first one, if there are any
  • if you need to quickly make space, sort your dupes by size
  • if dupeGuru is suggesting you get rid of the copy of a file which is in a directory you want to keep, go back and mark that directory as reference
  • keep an eye on the bottom of the screen, which tells you how much data the dupes represent (if it’s 50Mb and hundreds of small files in as many little folders, you probably don’t want to bother, unless you’re really obsessed with organizing your stuff, in which case you probably won’t have ended up in a situation requiring dupeGuru in the first place)

Happy digital spring-cleaning!

I Hate FTP [en]

[fr] Je hais le FTP. Donnez-moi un accès SSH et screen sur le serveur, et me voilà heureuse.

Ever since I discovered the magical combination of SSH + screen, I have come to loathe FTP. Although some of you will cringe at the idea, I like working directly on the server. No stray copies lying around, dated I-don’t-know-what. No chance of mistakenly overwriting your last set of changes.

Screen is a terminal multiplexer (just learned the term). What you do, basically, is climb inside it when you’re on the server, and do everything from there. The advantage is that:

  • when you disconnect your SSH connection, screen keeps running, so your workspace is how you left it next time you come in
  • you can have multiple “screens” (ie, terminal windows) you can easily switch around, so you can have your IRC channel running in one screen, be editing a file in another, etc. (basically, multi-tasking like you would do with windows in a graphical environment).

I learnt shell commands as I went along. Those I use the most are:

  • wget http://wordpress.org/latest.zip to download (instantly!) the latest version of WordPress directly on the server
  • unzip latest.zip to unzip it, still directly on the server
  • mv wp old-200910 to archive an old installation of wordpress (or move other files around)
  • cp -Rf plugins/* ../../wordpress/wp-content/plugins/ to copy all my plugins to the freshly unzipped install of WordPress
  • nano wp-config-sample.php to add my settings to the file and save it as wp-config.php

These are just a few examples. Once you know these commands and have them at the tip of your fingers, how fast you work is only limited by how fast you can type them. And you’re doing things directly on the web server. You’re not stuck looking at the “real world” (= the server) through the imperfect lens of an FTP client, waiting for uploads to happen (or downloads), paying attention not to overwrite stuff, having everything ready on your computer before pressing the magic button and hoping everything will be all right, because otherwise you’re in for another bout of download, edit, upload…

Some of my clients have WordPress installations on servers with no shell access. Obviously, I don’t have as much practice doing things the FTP way, but I swear it takes me 5 times as much time to do things with no SSH access. When you know how to use it, the command-line is wickedly fast.

The only situation where I actually do like FTP is when I’m using CSSEdit, because coupled to an FTP client, I can be editing my CSS file with the added power of the programme on my Mac, and have it upload and update the file on the server each time I hit save. Because yes, it’s nicer to write CSS in CSSEdit than in nano.

But for managing files and moving them around and minor edits… I’m much happier sitting on my server inside my screen.

Browser Language Detection and Redirection [en]

[fr] Une explication de la méthode que j'ai suivie pour que http://stephanie-booth.com redirige le visiteur soit vers la version anglaise du site, soit la française, en fonction des préférences linguistiques définies dans son navigateur.

Update, 29.12.2007: scroll to the bottom of this post for a more straightforward solution, using Multiviews.

I’ve been working on stephanie-booth.com today. One of my objectives is the add an English version to the previously French-only site.

I’m doing this by using two separate installations of WordPress. The content in both languages should be roughly equivalent, and I’ll write a WordPress plugin which allows to “automate” the process of linking back and forth from equivalent content in different languages.

What I did today is solve a problem I’ve been wanting to attack for some time now: use people’s browser settings to direct them to the “correct” language for the site. Here is what I learnt in the process, and how I did it. It’s certainly not the most elegant way to do things, so let me know if you have a better solution by using the comments below.

First, what I needed to know was that the browser language preferences are sent in the HTTP_ACCEPT_LANGUAGE header (HTTP header). First, I thought of capturing the information through PHP, but I discovered that Apache (logical, if you think of it) could handle it directly.

This tutorial was useful in getting me started, though I think it references an older version of Apache. Out of the horses mouth, Apache content negotiation had the final information I needed.

I’ll first explain the brief attempt I did with Multiviews (because it can come in handy) before going through the setup I currently have.

Multiviews

In this example, you request a file, e.g. test.html which doesn’t physically exist, and Apache uses either test.html.en or test.html.fr depending on your language preferences. You’ll still see test.html in your browser bar, though.

To do this, add the line:

Options +Multiviews

to your .htaccess file. Create the files test.html.en and test.html.fr with sample text (“English” and “French” will do if you’re just trying it out).

Then, request the file test.html in your browser. You should see the test content of the file corresponding to your language settings appear. Change your browser language prefs and reload to see what happens.

This is pretty neat, but it forces you to open a file — and I wanted / to redirect either to /en/ or to /fr/.

It’s explained pretty well in this tutorial I already linked to, and this page has some useful information too.

Type maps

I used a type map and some PHP redirection magic to achieve my aim. A type map is not limited to languages, but this is what we’re going to use it for here. It’s a text file which you can name menu.var for example. In that case, you need to add the following line to your .htaccess so that the file is dealth with as a type map:

AddHandler type-map .var

Here is the content of my type-map, which I named menu.var:

URI: en.php
Content-Type: text/html
Content-Language: en, en-us, en-gb

URI: fr.php
Content-Type: text/html
Content-Language: fr, fr-ch, fr-qc

Based on my tests, I concluded that the value for URI in the type map cannot be a directory, so I used a little workaround. This means that if you load menu.var in the browser, Apache will serve either en.php or fr.php depending on the content-language the browser accepts, and these two PHP files redirect to the correct URL of the localized sites. Here is what en.php looks like:

And fr.php, logically:

Just in case somebody came by with a browser providing neither English nor French in the HTTP_ACCEPT_LANGUAGE header, I added this line to my .htaccess to catch any 406 errors (“not acceptable”):

ErrorDocument 406 /en.php

So, if something goes wrong, we’re redirected to the English version of the site.

The last thing that needs to be done is to have menu.var (the type map) load automatically when we go to stephanie-booth.com. I first tried by adding a DirectoryIndex directive to .htaccess, but that messed up the use of index.php as the normal index file. Here’s the line for safe-keeping, if you ever need it in other circumstances, or if you want to try:

DirectoryIndex menu.var

Anyway, I used another PHP workaround. I created an index.php file with the following content:

And there we are!

Accepted language priority and regional flavours

In my browser settings, I’ve used en-GB and fr-CH to indicate that I prefer British English and Swiss French. Unfortunately, the header matching is strict. So if the order of your languages is “en-GB, fr-CH, fr, en” you will be shown the French page (en-GB and fr-CH are ignored, and fr comes before en). It’s all explained in the Apache documentation:

The server will also attempt to match language-subsets when no other match can be found. For example, if a client requests documents with the language en-GB for British English, the server is not normally allowed by the HTTP/1.1 standard to match that against a document that is marked as simply en. (Note that it is almost surely a configuration error to include en-GB and not en in the Accept-Language header, since it is very unlikely that a reader understands British English, but doesn’t understand English in general. Unfortunately, many current clients have default configurations that resemble this.) However, if no other language match is possible and the server is about to return a “No Acceptable Variants” error or fallback to the LanguagePriority, the server will ignore the subset specification and match en-GB against en documents. Implicitly, Apache will add the parent language to the client’s acceptable language list with a very low quality value. But note that if the client requests “en-GB; q=0.9, fr; q=0.8”, and the server has documents designated “en” and “fr”, then the “fr” document will be returned. This is necessary to maintain compliance with the HTTP/1.1 specification and to work effectively with properly configured clients.

Apache, Content Negotiation

This means that I added regional language codes to the type map (“fr, fr-ch, fr-qc”) and also that I changed the order of my language preferences in Firefox, making sure that all variations of one language were grouped together, in the order in which I prefer them:

Language Prefs in Firefox

Catching old (now invalid) URLs

There are lots of incoming links to pages of the French site, where it used to live — at the web root. For example, the contact page address used to be http://stephanie-booth.com/contact, but it is now http://stephanie-booth.com/fr/contact. I could write a whole list of permanent redirects in my .htaccess file, but this is simpler. I just copied and modified the rewrite rules that WordPress provides, to make sure that the correct blog installation does something useful with those old URLs (bold is my modification):

# BEGIN WordPress

RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . **/fr**/index.php [L]


# END WordPress

In this way, as you can check, http://stephanie-booth.com/contact is not broken.

Next steps

My next mission is to write a small plugin which I will install on both WordPress sites (I’ve got to write it for a client too, so double benefit). This plugin will do the following:

  • add a field to the write/edit post field in which to type the post slug of the correponding page/post in the other language *(e.g. “particuliers” in French will be “individuals” in English)
  • add a link to each post pointing to the equivalent page in the other language

It’s pretty basic, but it beats manual links, and remains very simple. (I like simple.)

As I said, if you have a better (simpler!) way of doing all this, please send it my way.

A simpler solution [Added 29.12.2007]

For each language, create a file named index.php.lg where “lg” is the language code. For French, you would create index.php.fr with the following content:

Repeat for each language available.

Do not put an index.php file in your root directory, just the index.php.lg files.

Add the two following lines to your .htaccess:

Options +Multiviews
ErrorDocument 406 /fr/

…assuming French is the default language you want your site to show up in if your visitor’s browser doesn’t accept any of the languages you provide your site in.

You’re done!

Invalid argument supplied for foreach() in wp-capabilities.php: Case Cracked! [en]

[fr] Le problème avec wp-capabilities.php qui fait qu'on peut se retrouver "exfermé" (enfermé dehors) de son blog WordPress (typiquement en cas de changement de serveur) semble avoir sa source dans le contenu du champ wp_user_roles dans la table wp_options. En particulier, pour la version française, "Abonné" est un rôle d'utilisateur, et en cas de problèmes d'encodage MySQL, le caractère accentué sera corrompu, causant ainsi l'erreur.

Il suffit de remplacer le caractère fautif dans PhpMyAdmin, et on retrouve l'accès à son blog. Bon, reste ensuite à régler les questions d'encodage... mais c'est déjà ça!

Finally. At last. Endlich. Enfin.

Once more, while trying to transfer a WordPress installation from one server to another, I found myself facing the dreaded problem which locks me out of my WordPress install with a rather cryptic message:

Warning: Invalid argument supplied for foreach() in /home/user/wp/wp-includes/capabilities.php on line 31

(Your lineage may vary.)

What happens is that WordPress cannot read user roles, and therefore, even though your password is accepted, you get a message telling you that you’re not welcome in the wp-admin section:

Vous n’avez pas les droits suffisants pour accéder à cette page.

Or, in English:

You do not have sufficient permissions to access this page.

A quick search on the WordPress forums told me that I was not alone in my fight with wp-capabilities.php, but that many problems had not been resolved, and more importantly, that suggested solutions often did not work for everyone.

I’ve bumped into this problem a couple of times before, and I knew that it was linked to encoding problems in the database. (I’ve had my share of encoding problems: once, twice, thrice — “once” being on of the most-visited posts on this blog, by the way, proof if needed that I’m not alone with mysql encoding issues either.)

I’ll leave the detailed resolution of how to avoid/cure the MySQL problems later (adding
mysql_query("SET NAMES 'utf8'");
to wp-db.php as detailed in this thread, and as zedrdave had already previously told me to do — should have listened! — should prevent them). So anyway, adding that line to my working WordPress install showed me that the problem was not so much in the database dumping process than in the way WordPress itself interacted with the database, because the dreaded wp-capabilities.php problem suddenly appeared on the original blog.

Now, this is where I got lucky. Browsing quickly through the first dozen or so of forum threads about wp-capability.php problems, this response caught my eye. It indicated that the source of the problem was the content of the wp_user_roles field (your prefix may vary). In this case, it had been split on more than one line.

I headed for the database, looked at the field, and didn’t see anything abnormal about it at first. All on one line, no weird characters… just before giving up, I moved the horizontal scrollbar to the end of the line, and there — Eurêka! I saw it.

Abonné

“Contributor”, in French, is “abonné”, with an accent. Accent which got horribly mangled by the MySQL problems which I’ll strive to resolve shorty. Mangled character which caused the foreach() loop to break in wp-capabilities.php, which caused the capabilities to not be loaded, which caused me to be locked out of my blog.

So, in summary: if you’re locked out of your blog and get a warning/error about wp-capabilities and some invalid foreach() loop thingy, head for PhpMyAdmin, and look carefully through the wp_user_roles field in the wp_options table. If it’s split over two or more lines, or contains funky characters, you have probably found the source of your problem.

Good luck!

Hairy .htaccess Dreamhost WordPress Problem Solved! [en]

[fr] Résolution d'un problème qui m'a littéralement empoisonné mes vacances. Ouf.

Thanks to grimboy, my “parent” .htaccess now has two extra lines and looks like this (and the problem that has kept me awake for the last week is solved):

AddDefaultCharset OFF
# BEGIN WordPress

RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_URI} ^/membres.*$ [OR]
RewriteCond %{REQUEST_URI} ^/failed_auth.html$
RewriteRule ^.*$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php

# END WordPress

Thanks so much!

Hairy .htaccess Dreamhost WordPress Problem [en]

[fr] Un des derniers problèmes qui me résistent sur le nouveau serveur.

Here’s roughly what I wrote in a support ticket I sent Dreamhost this morning. If you have any suggestions, I’ll take them.

> Hello,

> I have a site http://cafecafe.ch on which I have installed wordpress
(http://cafecafe.ch/wp/ -> displays as http://cafecafe.ch/blog with
Filosofo Homepage-Control plugin).

> That server has a subdirectory http://cafecafe.ch/membres/ which is
password-protected using .htaccess. Inside is another wordpress
install http://cafecafe.ch/membres/wp.

> I had this set up on my previous host and it worked fine.

> Now, if I go to http://cafecafe.ch/membres/ the request is caught be
the blog installed in http://cafecafe.ch/wp/, and I’m shown the page
http://cafecafe.ch/blog/.

> To make sure it wasn’t a conflict between the two wordpress installs,
I created an empty directory http://cafecafe.ch/test/ which I tried to
password-protect in the same way. The problem is the same (going to
http://cafecafe.ch/test/ displays http://cafecafe.ch/blog). If I
comment out the “request valid-user” line of the .htaccess, I get to
see the directory listing.

> Similarly, if I come back to http://cafecafe.ch/membres and comment
out that line in .htaccess, both wordpress installs work fine, with
permalinks and all (only the private blog isn’t protected anymore,
which won’t do it).

> I’ve tried not doing the password protection manually, and using what
is provided in the panel for that, but the problem remains exactly the
same.

> Weird, isn’t it?

> Hope you can help me out on this. Tried checking error logs but they
were empty.