Disk drive recovery: ddrescue, dd_rescue, dd_rhelp, by John Gilmore

From: http://www.toad.com/gnu/sysadmin/index.html#ddrescue

Disk drive recovery: ddrescue, dd_rescue, dd_rhelp
If you have a disk drive with errors on it, that you'd like to be able to read the recoverable data from, GNU ddrescue is your best friend.

It is modeled after the two preceding programs, dd_rescue (with an underbar), and dd_rhelp. But GNU ddrescue it's far better than both -- I've tried all three, on the same drive, as well as trying to use plain old "dd". You should skip my learning process and just head straight for the best way, which is GNU ddrescue. I'll tell you about it.

So, a brief tutorial on things I learned about copying disk drives. "dd" will make a copy of a disk drive with errors, if you set "conv=noerror" so it will keep going after errors. The catch is that it just *removes* the erroneous sectors from its output, as if they didn't exist, which totally screws up the file system image. Fsck will tell you just how unhappy it is with such an image; it's unrecoverable without massive manual work, shifting big blocks of data around. Instead, you can use "dd conv=noerror,sync" which will write an output record (zeros) even if the input record has an error. You had better do this on single disk sectors, thus "dd bs=512 conv=noerror,sync". If you use a larger blocksize (read multiple disk sectors) at once, the first one that has an error will stop the read, and what will get written out will be zeroes for not only the bad sector, but for all subsequent sectors in that block.

"dd bs=512 conv=noerror,sync" works, but has many drawbacks. It's slow even on the error-free stuff since it's doing tiny reads and writes. It spends a lot of time chewing through the erroneous parts of the drive, rather than reading as much error-free stuff as it can, THEN going back to do the hard stuff. (When your drive is crapping out, it has a tendency to die the big death at any moment. You'd like to get as much info off it as possible before that happens. One example is if small particles of stuff are rattling around inside the drive; they make more and more errors, as you run the drive. Sometimes, putting the drive in the freezer for a few hours, in a ziploc bag to keep the moisture off, will revive it briefly; electronics work better at low temperatures than when they get hot.)

Kurt Garloff's dd_rescue was the first attempt to improve on this. It reads and copies bigger blocks until it sees an error, then slows down and goes back, and reads single sectors. After a while it speeds up again. It can also read backward, and can quit after it gets some specified number of errors. It keeps a 3-line display updated in your text window so you can see what it's doing. If you run it simply, it just does what "dd bs=64k" does until it sees an error, then backs up and does "dd bs=512". If it gets an error reading a sector, it doesn't write to that sector of the output file, but it skips past it to write the next good one, so everything stays in sync. It seeks the input and output in parallel so it makes an exact copy of the parts that it can read.

LAB Valentin's dd_rhelp. is a complex shell script that runs dd_rescue many times, trying to be strategic about copying the drive. It copies forward until it gets errors, then jumps forward by a big jump looking for either the end of the drive, or more easy-to-read stuff. Once it finds the end of the drive, then it starts working backward, trying to close up the "hole" that it hasn't read yet. As it encounters errors, it skips around looking for more error-free parts of the drive. It only reads each sector once. It reads the logfile output of dd_rescue to see what happened and to figure out what to do next.

One problem with dd_rhelp is that it's a shell script, so it's really slow and consumes massive resources. On one of my drives that had about 2900 bad sectors on it, dd_rhelp would waste upwards of 15 minutes deciding what blocks to tell dd_rescue to try reading next. During that time it makes about 100 new Unix processes every second.

Antonio Diaz Diaz's GNU ddrescue learned from these experiences. It combines both dd_rescue's ability to read big blocks and then shift gears, with dd_rhelp's ability to remember what parts of the disk have been looked at already. It keeps this info in a really simple logfile format, and keeps it updated every 30 seconds, or whenever it stops or is interrupted. It's written in C++ and it's small and fast.

It starts off running like "dd", blasting through large error-free areas. When it gets an error, it writes out any partial data that it received during that read, and KEEPS GOING to the next big block. It notes in the logfile that a bunch of sectors (the first erroneous one, plus whatever ones followed it in the multi-sector read) were skipped. And keeps going. So it reads through the entire disk in big blocks first. Then it goes back to "split" the skipped parts, trying to read each sector individually. The compact logfile always shows which chunks of disk have been read OK, have been read with errors, have been read one sector at a time with errors, or have never been read yet.

One catch about GNU ddrescue is that the author has some strange ideas about what "ought to" be in the C++ library. So in the current version (1.1), you'll be lucky to get it to compile without errors. All the errors are minor, and are not in key parts of the software, so you can dike them out if you need to. Sometime soon I'll make some nice clean patches for these parts, and submit them, but it's been this way for a year and people complain about this every month on the bug-ddrescue mailing list, and the maintainer doesn't fix it, so I'm not optimistic that the patches will be accepted. But use his software anyway; other than this quirk, it's really nice.

As an aside, it takes a lot of time and screen space for the kernel to log all the error messages from when you're reading from a failing disk drive. The messages also tend to screw up the screen that you're trying to work in. To speed up the logging, you can edit /etc/syslog.conf and insert a "-" before "/var/log/messages", then restart the syslog daemon. This tells syslog to not do an "fsync" after every log message it writes out. If you crash you'll be missing the last few messages, but if you don't crash you'll run about eight times as fast. Also, to make it stop printing those messages on your console, you have to edit the arguments to the "klogd" daemon, which is usually started by the same script that starts the syslog daemon (/etc/init.d/syslog). On my Red Hat 7.3 system, you can edit /etc/sysconfig/syslog and change the KLOGD_OPTIONS line so it includes " -c 0 " which will suppress all console messages, then restart the syslog daemon. (If you can figure out the "logging level" of the disk error messages, you can set the level higher than 0, but I was in a hurry.) Change these things back when you're done doing disk recovery, so you'll see kernel error messages, and so they'll get logged reliably, when the kernel is crashing a year later for some totally unrelated reason.

Speculation about fixing bad disk blocks to keep using the disk
Once you have copied your entire disk drive off to some other drive (or to a file on a bigger drive, which is often easier), what do you do with the failing drive? If there only a few errors, and the drive is fairly modern, you can probably just rewrite those sectors, and the drive will reallocate those sectors automatically to new, "spare" sectors that it keeps lying around for just this purpose. From the drive's point of view, it doesn't matter what you write to those sectors -- could be zeros, garbage, or good data; it will either reallocate them and write your data there, or it will just try writing your new data overtop of the bad data and see if it "sticks" (is readable afterward).

When I have a drive with only one or two failing sectors, I often find them with "smartctl -t long" (one at a time, sigh) and then write to them with a complicated series of commands. The very smart smartmontools maintainer, Bruce Allen, has written a BadBlockHowTo.txt about how to do this. (I hope somebody automates this error-prone process soon.) I have done this many times, on many generations of disk drives; on old 1980s SCSI drives there were utility programs that would reallocate individual sectors. I know of no free software program for doing this kind of low-level drive formatting, unfortunately. Modern drives just do it when you write.

OK, this next part is pure uninformed speculation. I HAVEN'T DONE THIS ON MY DISKS, AND YOU SHOULD NOT DO THIS TO YOUR DISKS UNLESS YOU ARE A WIZARD AND YOU KNOW WHAT YOU ARE DOING, AND YOU'RE WILLING TO TAKE THE RISKS YOURSELF WITHOUT WHINING.

It occurs to me that it OUGHT to work to do this: First, copy the entire drive to somewhere else with GNU ddrescue. The logfile will show you exactly where the erroneous sectors are. Look at it with a text editor. Make sure that's what it says. Then, make a copy of that logfile, and (here's the tricky part) run GNU ddrescue in a very strange way to write zeroes onto those bad sectors:

ddrescue -r1 /dev/zero /dev/baddisk my-logfile-copy

Note that your bad disk is the OUTPUT of this command, while the system file "/dev/zero" (as many zeroes as you ever wanted) is the INPUT to the command.

If you hadn't specified the logfile and the "-r1", this would copy zeroes over your entire bad disk. The logfile points out what disk blocks it couldn't read, so ddrescue will only try to read the parts it hasn't read before. The "-r1" tells it to go back and try to read them again, even though it failed the first time. But since it's reading from /dev/zero this time, all of those reads are going to succeed -- and then it will write those zeros into the exact places on the bad disk drive where you need to write new data to reallocate the bad sectors.

In the process, it destroys the logfile, which is why you made a copy of it. When it's done, the logfile will report that the entire disk was readable, because it was able to read from every "bad" sector in /dev/zero.

LIKE I SAID ABOVE, this is all pure speculation. I have never done this on my own drives. I might try it someday (the drive with >2000 errors would be a good candidate, except that it's a very old 2GB drive, it has already reallocated every spare sector in it, and I don't know what trashed those 2000 sectors so I don't really trust the drive anyway. Better to try it on a 200GB or up drive (newer, with better error recovery firmware) and with only a few errors (that probably weren't cause by some massive internal problem).) Don't blame me if you trash your drive this way. But do tell me what you think of the idea.

Disk drive recovery: ddrescue on MacOSX
My friend's Macintosh started acting very strange and slow today, and we eventually figured out that it's the disk drive, which shows failing SMART status. (MacOS is too stupid to give access to the real SMART info like the free smartctl, it just provides a red FAILED indication. However, smartctl may have been ported to MacOSX by now.)

So the race is on to get as much data as possible off the drive.

We immediately copied off a few key documents onto a USB Flash drive, then hooked up a spare USB hard drive. Then the question became: what software do we have to copy this failing drive to the external USB drive?

The snooty Mac utility Carbon Copy refused to copy to a USB drive. (And I don't know whether it would handle a source drive full of errors, anyway.)

I looked for MacOS ports of ddrescue, but there weren't any in a form that I could simply download and run (or even download, install, and run). Instead there was Darwinports, which wanted me to install almost a gigabyte of Apple proprietary versions of the GNU tools (Xcode), with a license agreement that you MUST CLICK AGREE on or it REFUSES TO INSTALL, even though the GNU license does not work that way. Plus their own small wart on the side of that, which seems to be some scripts that suck down sources and run them through "./configure; make; make install". I aborted that all-night download as soon as I found an alternative; I don't know if this disk drive is up to installing a gigabyte of software and then configuring and compiling a GNU program.

Fink at least has the concept of installing binaries rather than compiling everything from sources. But I downloaded it and tried running its installer, and the installer failed with an error message that didn't appear in the documentation, saying that the installation had failed but I should retry it. When I retried it, of course, it said it couldn't install on top of an existing installation. When I tried to run the half-installed one, it would instantly fail. And when I removed it according to the documentation, and tried a reinstall, I got the exact same problem. Yeah, I know, it's free software, so it's my problem. I don't recommend Fink to you.

So I have fallen back on trusty old "dd conv=sync,noerror" and I hope it doesn't lose my friend too much of the drive. My hat's off to the Mac community for the elaborateness of its infrastructure. Doesn't it need a few more bells and whistles, guys? Someday if I ever waste a week installing a compiler on a Mac, I'll build a simple binary of ddrescue, and put it up on this web site to help the next person who's trying to rescue their data from a failing drive.

OK, so John Perry Barlow came over, and had the development environment handy, so we built ddrescue by downloading the sources (ddrescue-1.1.tar.bz2, unpacking them in the Finder, opening a terminal, going into the unpacked ddrescue-1.1 folder, and typing "./configure" and "make". Here's the resulting binary of ddrescue and the matching documentation on how to use it. As I type this, it is copying my friend's whole dying hard drive onto an identical Firewire drive.

If your disk dies on MacOS, grab a copy of this binary and use it to copy the whole disk onto an identical (or larger) spare disk drive. If it works at all, you'll probably recover 99+% of your files. In fact, why don't you grab it now and put it on your Mac, so it'll be handy when you need it?