Use of the data recovery software included in Ubuntu-rescue-remix is detailed here:
http://help.ubuntu.com/community/DataRecovery
Before version 8.04, a shortened version of that page is found on the cd as a file named CheatSheet.txt. The cheat sheet in in the /install directory on the live cd, as well as in /root from within the live cd session.
So to view the cheat sheet once you have booted Ubuntu Rescue Remix, run
cat /root/CheatSheet.txt | less
As of version 8.04Alpha, the cheat sheet is part of the ubuntu-rescue-remix metapackage and can be found in /usr/share/doc/ubuntu-rescue-remix/
The live CD and USB images are made using the instructions found here:
https://help.ubuntu.com/community/LiveCDCustomizationFromScratch
This file is part of Testdisk:
/usr/share/doc/testdisk/html/advanced_ntfs_boot_and_mft_repair.html
If the NTFS boot sector is damaged, data can not be accessed. Windows will prompt The drive is not formatted, do you want to format it now? Linux mount will display wrong fs type, bad option, bad superblock
TestDisk let you manipulate and fix the boot sector of NTFS partitions. Select the partition you want to modify and choose Boot.
Recovering NTFS Boot Sector on NTFS Partitions using its backup
TestDisk can use backup boot sector to fix corrupted NTFS boot sector. The primary boot sector is sector zero of the filesystem, the backup NTFS boot sector is located near the end of the filesystem. TestDisk checks the boot sector and the backup boot sector. If the boot sector and backup boot sector mismatches, you can overwrite the NTFS boot sector with the backup boot sector (Backup BS) or vice versa (Org. BS). Dump can used to display the boot sector content in both hexadecimal and ascii.
Rebuilding NTFS Boot Sector on NTFS Partitions
If both NTFS boot sector are corrupted, you need to rebuild NTFS boot sector, TestDisk searches the MFT (Master File Table: $MFT) and its backup ($MFTMirr). It reads the mft record size and it computes the cluster size, It reads the Size of Index Allocation Entry in the root directory index. Using all these values, TestDisk can provide a new boot sector. Finaly it lets the user list the files before writing.
Repair NTFS MFT
The MFT (Master File Table) is sometimes corrupted. If Microsoft Check Disk (chkdsk) failed to repair the MFT, run TestDisk and in the Advanced menu, select your NTFS partition and choose Repair MFT. TestDisk will try to repair the MFT using MFT mirror, its backup.
Content is available under GNU Free Documentation License 1.2.
by Christophe GRENIER
http://smartmontools.sourceforge.net/BadBlockHowTo.txt
THIS DOCUMENT SHOWS HOW TO IDENTIFY THE FILE ASSOCIATED WITH AN
UNREADABLE DISK SECTOR, AND HOW TO FORCE THAT SECTOR TO REALLOCATE.
Assumptions: Linux OS, ext2 or ext3 file system.
Bruce Allen
Thanks to Sergey Vlasov, Theodore Ts'o, Michael Bendzick, and others
for explaining this to me. I would like to add text showing how to do
this for other file systems, in particular ReiserFS, XFS, and JFS:
please email me if you can provide this information.
NOTE: Starting with GNU coreutils release 5.3.0, dd on Linux includes
options 'iflag=direct' and 'oflag=direct'. Using these with the dd commands
below should be helpful, because adding these flags should avoid any interaction
with the block buffering IO layer in Linux and permit direct reads/writes
from the raw device. Use 'dd --help' to see if your version of dd supports
these options. If not, build the latest code from
fttp://alpha.gnu.org/gnu/coreutils.
In this example, the disk is failing self-tests at Logical Block
Address LBA = 0x016561e9 = 23421417. The LBA counts sectors in units
of 512 bytes, and starts at zero.
-----------------------------------------------------------------------------------------------
root]# smartctl -l selftest /dev/hda:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 217 0x016561e9
-----------------------------------------------------------------------------------------------
Note that other signs that there is a bad sector on the disk can be
found in the non-zero value of the Current Pending Sector count:
-----------------------------------------------------------------------------------------------
root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1
-----------------------------------------------------------------------------------------------
First Step: We need to locate the partition on which this sector of
the disk lives:
-----------------------------------------------------------------------------------------------
root]# fdisk -lu /dev/hda
Disk /dev/hda: 123.5 GB, 123522416640 bytes
255 heads, 63 sectors/track, 15017 cylinders, total 241254720 sectors
Units = sectors of 1 * 512 = 512 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 63 4209029 2104483+ 83 Linux
/dev/hda2 4209030 5269319 530145 82 Linux swap
/dev/hda3 5269320 238227884 116479282+ 83 Linux
/dev/hda4 238227885 241248104 1510110 83 Linux
-----------------------------------------------------------------------------------------------
The partition /dev/hda3 starts at LBA 5269320 and extends past the
'problem' LBA. The 'problem' LBA is offset 23421417 - 5269320 =
18152097 sectors into the partition /dev/hda3.
To verify the type of the file system and the mount point, look in
/etc/fstab:
-----------------------------------------------------------------------------------------------
root]# grep hda3 /etc/fstab
/dev/hda3 /data ext2 defaults 1 2
-----------------------------------------------------------------------------------------------
You can see that this is an ext2 file system, mounted at /data.
Second Step: we need to find the blocksize of the file system
(normally 4096 bytes for ext2):
-----------------------------------------------------------------------------------------------
root]# tune2fs -l /dev/hda3 | grep Block
Block count: 29119820
Block size: 4096
-----------------------------------------------------------------------------------------------
In this case the block size is 4096 bytes.
Third Step: we need to determine which File System Block contains this
LBA. The formula is:
b = (int)((L-S)*512/B)
where:
b = File System block number
B = File system block size in bytes
L = LBA of bad sector
S = Starting sector of partition as shown by fdisk -lu
and (int) denotes the integer part.
In our example, L=23421417, S=5269320, and B=4096. Hence the
'problem' LBA is in block number
b = (int)18152097*512/4096 = (int)2269012.125
so b=2269012.
Note: the fractional part of 0.125 indicates that this problem LBA is
actually the second of the eight sectors that make up this file system
block.
Fourth Step: we use debugfs to locate the inode stored in this block,
and the file that contains that inode:
-----------------------------------------------------------------------------------------------
root]# debugfs
debugfs 1.32 (09-Nov-2002)
debugfs: open /dev/hda3
debugfs: icheck 2269012
Block Inode number
2269012 41032
debugfs: ncheck 41032
Inode Pathname
41032 /S1/R/H/714197568-714203359/H-R-714202192-16.gwf
-----------------------------------------------------------------------------------------------
In this example, you can see that the problematic file (with the mount
point included in the path) is:
/data/S1/R/H/714197568-714203359/H-R-714202192-16.gwf
To force the disk to reallocate this bad block we'll write zeros to
the bad block, and sync the disk:
-----------------------------------------------------------------------------------------------
root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=1 seek=2269012
root]# sync
-----------------------------------------------------------------------------------------------
NOTE: THIS LAST STEP HAS PERMANENTLY AND IRRETREVIABLY DESTROYED SOME
OF THE DATA THAT WAS IN THIS FILE. DON'T DO THIS UNLESS YOU DON'T
NEED THE FILE OR YOU CAN REPLACE IT WITH A FRESH OR CORRECT VERSION.
Now everything is back to normal: the sector has been reallocated.
Compare the output just below to similar output near the top of this
article:
-----------------------------------------------------------------------------------------------
root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 1
-----------------------------------------------------------------------------------------------
Note: for some disks it may be necessary to update the SMART Attribute values by using
smartctl -t offline /dev/hda
The disk now passes its self-tests again:
-----------------------------------------------------------------------------------------------
root]# smartctl -t long /dev/hda [wait until test completes, then]
root]# smartctl -l selftest /dev/hda
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 239 -
# 2 Extended offline Completed: read failure 90% 217 0x016561e9
# 3 Extended offline Completed: read failure 90% 212 0x016561e9
# 4 Extended offline Completed: read failure 90% 181 0x016561e9
# 5 Extended offline Completed without error 00% 14 -
# 6 Extended offline Completed without error 00% 4 -
-----------------------------------------------------------------------------------------------
and no longer shows any offline uncorrectable sectors:
-----------------------------------------------------------------------------------------------
root]# smartctl -A /dev/hda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 1
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
-----------------------------------------------------------------------------------------------
A SECOND EXAMPLE
On this drive, the first sign of trouble was this email from smartd:
To: ballen
Subject: SMART error (selftest) detected on host: medusa-slave166.medusa.phys.uwm.edu
This email was generated by the smartd daemon running on host:
medusa-slave166.medusa.phys.uwm.edu in the domain: master001-nis
The following warning/error was logged by the smartd daemon:
Device: /dev/hda, Self-Test Log error count increased from 0 to 1
Running smartctl -a /dev/hda confirmed the problem:
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 80% 682 0x021d9f44
Note that the failing LBA reported is 0x021d9f44 (base 16) = 35495748 (base 10)
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 3
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 3
and one can see above that there are 3 sectors on the list of pending
sectors that the disk can't read but would like to reallocate.
The device also shows errors in the SMART error log:
Error 212 occurred at disk power-on lifetime: 690 hours
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 12 46 9f 1d e2 Error: UNC 18 sectors at LBA = 0x021d9f46 = 35495750
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name
-- -- -- -- -- -- -- -- --------- --------------------
25 00 12 46 9f 1d e0 00 2485545.000 READ DMA EXT
Signs of trouble at this LBA may also be found in SYSLOG:
[root]# grep LBA /var/log/messages | awk '{print $12}' | sort | uniq
LBAsect=35495748
LBAsect=35495750
So I decide to do a quick check to see how many bad sectors there
really are. Using the bash shell I check 70 sectors around the trouble
area:
[root]# export i=35495730
[root]# while [ $i -lt 35495800 ]
> do echo $i
> dd if=/dev/hda of=/dev/null bs=512 count=1 skip=$i
> let i+=1
> done
35495734
1+0 records in
1+0 records out
35495735
dd: reading `/dev/hda': Input/output error
0+0 records in
0+0 records out
35495751
dd: reading `/dev/hda': Input/output error
0+0 records in
0+0 records out
35495752
1+0 records in
1+0 records out
which shows that the seventeen sectors 35495735-35495751 (inclusive)
are not readable.
Next, we identify the files at those locations. The partitioning
information on this disk is identical to the first example above, and
as in that case the problem sectors are on the third partition
/dev/hda3. So we have:
L=35495735 to 35495751
S=5269320
B=4096
so that b=3778301 to 3778303 are the three bad blocks in the file
system.
[root]# debugfs
debugfs 1.32 (09-Nov-2002)
debugfs: open /dev/hda3
debugfs: icheck 3778301
Block Inode number
3778301 45192
debugfs: icheck 3778302
Block Inode number
3778302 45192
debugfs: icheck 3778303
Block Inode number
3778303 45192
debugfs: ncheck 45192
Inode Pathname
45192 /S1/R/H/714979488-714985279/H-R-714979984-16.gwf
debugfs: quit
And finally, just to confirm that this is really the damaged file:
[root]# md5sum /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf
md5sum: /data/S1/R/H/714979488-714985279/H-R-714979984-16.gwf: Input/output error
Finally we force the disk to reallocate the three bad blocks:
[root]# dd if=/dev/zero of=/dev/hda3 bs=4096 count=3 seek=3778301
[root]# sync
We could also probably use:
[root]# dd if=/dev/zero of=/dev/hda bs=512 count=17 seek=35495735
At this point we now have:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
which is encouraging, since the pending sectors count is now zero.
Note that the drive reallocation count has not yet increased: the
drive may now have confidence in these sectors and have decided not to
reallocate them..
A device self test:
[root#] smartctl -t long /dev/hda
(then wait about an hour) shows no unreadable sectors or errors:
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 692 -
# 2 Extended offline Completed: read failure 80% 682 0x021d9f44
[USEFUL HINTS ADDED BY OTHERS]
---------------------------------------------------------------------------
From: Kay Diederichs
I read your badblocks-howto at
http://smartmontools.sourceforge.net/BadBlockHowTo.txt and greatly
benefitted from it. One thing that's (maybe) missing is that often the
"smartctl -t long" scan finds a bad sector which is _not_ assigned to
any file. In that case it does not help to run debugfs, or rather
debugfs reports the fact that no file owns that sector. Furthermore,
it is somewhat laborious to come up with the correct numbers for
debugfs, and debugfs is slow ...
So what I suggest in the case of presence of
Current_Pending_Sector/Offline_Uncorrectable errors is to create a
huge file on that filesystem.
dd if=/dev/zero of=/some/mount/point bs=4k
creates the file. Leave it running until the partition/filesystem is
full. This will make the disk reallocate those sectors which do not
belong to a file. Check the "smartctl -a" output after that and make
sure that the sectors are reallocated. If any remain, use the debugfs
method. Of course the usual caveats apply - back it up first, and so
on.
---------------------------------------------------------------------------
From: Frederic BOITEUX
HOW TO LOCATE AND REPAIR BAD BLOCKS ON AN LVM VOLUME
* Smartd reports an error in a short test :
-------------------------------------------
# smartctl -a /dev/hdb
...
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 66 37383668
So the disk has a bad block located in LBA block 37383668
* In which physical partition is the bad block ?
------------------------------------------------
# sfdisk -lu /dev/hdb
Disk /dev/hdb: 9729 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0
Device Boot Start End #sectors Id System
/dev/hdb1 63 996029 995967 82 Linux swap / Solaris
/dev/hdb2 * 996030 1188809 192780 83 Linux
/dev/hdb3 1188810 156296384 155107575 8e Linux LVM
/dev/hdb4 0 - 0 0 Empty
It's in the /dev/hdb3 partition, a LVM2 partition.
From the LVM2 partition beginning, the bad block has an offset of
(37383668 - 1188810) = 36194858
We have to find in which LVM2 logical partition the block belongs to.
* In which logical partition is the bad block ?
-----------------------------------------------
*IMPORTANT* : LVM2 can use different schemes dividing its physical
partitions to logical ones : linear, striped, contiguous or
not... The following example assumes that allocation is linear !
The physical partition used by LVM2 is divided in PE (Physical Extent)
units of the same size, starting at pe_start' 512 bytes blocks from
the beginning of the physical partition.
The 'pvdisplay' command gives the size of the PE (in KB) of the
LVM partition :
# part=/dev/hdb3 ; pvdisplay -c $part | awk -F: '{print $8}'
4096
To get its size in LBA block size (512 bytes or 0.5 KB), we multiply this
number by 2 : 4096 * 2 = 8192 blocks for each PE.
To find the offset from the beginning of the physical partition is a
bit more difficult : if you have a recent LVM2 version, try :
# pvs -o+pe_start $part
Either, you can look in /etc/lvm/backup :
# grep pe_start $(grep -l $part /etc/lvm/backup/*)
pe_start = 384
Then, we search in which PE is the badblock, calculating the PE rank
in which the faulty block of the partition is :
physical partition's bad block number / sizeof(PE) =
36194858 / 8192 = 4418.3176
So we have to find in which LVM2 logical partition is used the PE
number 4418 (count starts from 0) :
# lvdisplay --maps |egrep 'Physical|LV Name|Type'
LV Name /dev/WDC80Go/racine
Type linear
Physical volume /dev/hdb3
Physical extents 0 to 127
LV Name /dev/WDC80Go/usr
Type linear
Physical volume /dev/hdb3
Physical extents 128 to 1407
LV Name /dev/WDC80Go/var
Type linear
Physical volume /dev/hdb3
Physical extents 1408 to 1663
LV Name /dev/WDC80Go/tmp
Type linear
Physical volume /dev/hdb3
Physical extents 1664 to 1791
LV Name /dev/WDC80Go/home
Type linear
Physical volume /dev/hdb3
Physical extents 1792 to 3071
LV Name /dev/WDC80Go/ext1
Type linear
Physical volume /dev/hdb3
Physical extents 3072 to 10751
LV Name /dev/WDC80Go/ext2
Type linear
Physical volume /dev/hdb3
Physical extents 10752 to 18932
So the PE #4418 is in the /dev/WDC80Go/ext1 LVM logical partition.
* Size of logical block of filesystem on /dev/WDC80Go/ext1 :
------------------------------------------------------------
It's a ext3 fs, so I get it like this :
# dumpe2fs /dev/WDC80Go/ext1 | grep 'Block size'
dumpe2fs 1.37 (21-Mar-2005)
Block size: 4096
* bad block number for the filesystem :
---------------------------------------
The logical partition begins on PE 3072 :
(# PE's start of partition * sizeof(PE)) + parttion offset[pe_start] =
(3072 * 8192) + 384 = 25166208
512b block of the physical partition, so the bad block number for the
filesystem is :
(36194858 - 25166208) / (sizeof(fs block) / 512)
= 11028650 / (4096 / 512) = 1378581.25
* Test of the fs bad block :
dd if=/dev/WDC80Go/ext1 of=block1378581 bs=4096 count=1 skip=1378581
If this dd command succeeds, without any error message in console or
syslog, then the block number calculation is probably wrong ! *Don't*
go further, re-check it and if you don't find the error, please
renunce !
* Search / correction follows the same scheme as for simple
partitions :
- find possible impacted files with debugfs (icheck
then ncheck
- reallocate bad block writing zeros in it, *using the fs block size* :
dd if=/dev/zero of=/dev/WDC80Go/ext1 count=1 bs=4096 seek=1378581
Et voilà !
---------------------------------------------------------------------------
This document is version $Id: BadBlockHowTo.txt,v 1.9 2006/06/12 02:16:50 ballen4705 Exp $
It is Copyright Bruce Allen (2004-6) and distributed under GPL2.
From: http://www.toad.com/gnu/sysadmin/index.html#ddrescue
Disk drive recovery: ddrescue, dd_rescue, dd_rhelp
If you have a disk drive with errors on it, that you'd like to be able to read the recoverable data from, GNU ddrescue is your best friend.
It is modeled after the two preceding programs, dd_rescue (with an underbar), and dd_rhelp. But GNU ddrescue it's far better than both -- I've tried all three, on the same drive, as well as trying to use plain old "dd". You should skip my learning process and just head straight for the best way, which is GNU ddrescue. I'll tell you about it.
So, a brief tutorial on things I learned about copying disk drives. "dd" will make a copy of a disk drive with errors, if you set "conv=noerror" so it will keep going after errors. The catch is that it just *removes* the erroneous sectors from its output, as if they didn't exist, which totally screws up the file system image. Fsck will tell you just how unhappy it is with such an image; it's unrecoverable without massive manual work, shifting big blocks of data around. Instead, you can use "dd conv=noerror,sync" which will write an output record (zeros) even if the input record has an error. You had better do this on single disk sectors, thus "dd bs=512 conv=noerror,sync". If you use a larger blocksize (read multiple disk sectors) at once, the first one that has an error will stop the read, and what will get written out will be zeroes for not only the bad sector, but for all subsequent sectors in that block.
"dd bs=512 conv=noerror,sync" works, but has many drawbacks. It's slow even on the error-free stuff since it's doing tiny reads and writes. It spends a lot of time chewing through the erroneous parts of the drive, rather than reading as much error-free stuff as it can, THEN going back to do the hard stuff. (When your drive is crapping out, it has a tendency to die the big death at any moment. You'd like to get as much info off it as possible before that happens. One example is if small particles of stuff are rattling around inside the drive; they make more and more errors, as you run the drive. Sometimes, putting the drive in the freezer for a few hours, in a ziploc bag to keep the moisture off, will revive it briefly; electronics work better at low temperatures than when they get hot.)
Kurt Garloff's dd_rescue was the first attempt to improve on this. It reads and copies bigger blocks until it sees an error, then slows down and goes back, and reads single sectors. After a while it speeds up again. It can also read backward, and can quit after it gets some specified number of errors. It keeps a 3-line display updated in your text window so you can see what it's doing. If you run it simply, it just does what "dd bs=64k" does until it sees an error, then backs up and does "dd bs=512". If it gets an error reading a sector, it doesn't write to that sector of the output file, but it skips past it to write the next good one, so everything stays in sync. It seeks the input and output in parallel so it makes an exact copy of the parts that it can read.
LAB Valentin's dd_rhelp. is a complex shell script that runs dd_rescue many times, trying to be strategic about copying the drive. It copies forward until it gets errors, then jumps forward by a big jump looking for either the end of the drive, or more easy-to-read stuff. Once it finds the end of the drive, then it starts working backward, trying to close up the "hole" that it hasn't read yet. As it encounters errors, it skips around looking for more error-free parts of the drive. It only reads each sector once. It reads the logfile output of dd_rescue to see what happened and to figure out what to do next.
One problem with dd_rhelp is that it's a shell script, so it's really slow and consumes massive resources. On one of my drives that had about 2900 bad sectors on it, dd_rhelp would waste upwards of 15 minutes deciding what blocks to tell dd_rescue to try reading next. During that time it makes about 100 new Unix processes every second.
Antonio Diaz Diaz's GNU ddrescue learned from these experiences. It combines both dd_rescue's ability to read big blocks and then shift gears, with dd_rhelp's ability to remember what parts of the disk have been looked at already. It keeps this info in a really simple logfile format, and keeps it updated every 30 seconds, or whenever it stops or is interrupted. It's written in C++ and it's small and fast.
It starts off running like "dd", blasting through large error-free areas. When it gets an error, it writes out any partial data that it received during that read, and KEEPS GOING to the next big block. It notes in the logfile that a bunch of sectors (the first erroneous one, plus whatever ones followed it in the multi-sector read) were skipped. And keeps going. So it reads through the entire disk in big blocks first. Then it goes back to "split" the skipped parts, trying to read each sector individually. The compact logfile always shows which chunks of disk have been read OK, have been read with errors, have been read one sector at a time with errors, or have never been read yet.
One catch about GNU ddrescue is that the author has some strange ideas about what "ought to" be in the C++ library. So in the current version (1.1), you'll be lucky to get it to compile without errors. All the errors are minor, and are not in key parts of the software, so you can dike them out if you need to. Sometime soon I'll make some nice clean patches for these parts, and submit them, but it's been this way for a year and people complain about this every month on the bug-ddrescue mailing list, and the maintainer doesn't fix it, so I'm not optimistic that the patches will be accepted. But use his software anyway; other than this quirk, it's really nice.
As an aside, it takes a lot of time and screen space for the kernel to log all the error messages from when you're reading from a failing disk drive. The messages also tend to screw up the screen that you're trying to work in. To speed up the logging, you can edit /etc/syslog.conf and insert a "-" before "/var/log/messages", then restart the syslog daemon. This tells syslog to not do an "fsync" after every log message it writes out. If you crash you'll be missing the last few messages, but if you don't crash you'll run about eight times as fast. Also, to make it stop printing those messages on your console, you have to edit the arguments to the "klogd" daemon, which is usually started by the same script that starts the syslog daemon (/etc/init.d/syslog). On my Red Hat 7.3 system, you can edit /etc/sysconfig/syslog and change the KLOGD_OPTIONS line so it includes " -c 0 " which will suppress all console messages, then restart the syslog daemon. (If you can figure out the "logging level" of the disk error messages, you can set the level higher than 0, but I was in a hurry.) Change these things back when you're done doing disk recovery, so you'll see kernel error messages, and so they'll get logged reliably, when the kernel is crashing a year later for some totally unrelated reason.
Speculation about fixing bad disk blocks to keep using the disk
Once you have copied your entire disk drive off to some other drive (or to a file on a bigger drive, which is often easier), what do you do with the failing drive? If there only a few errors, and the drive is fairly modern, you can probably just rewrite those sectors, and the drive will reallocate those sectors automatically to new, "spare" sectors that it keeps lying around for just this purpose. From the drive's point of view, it doesn't matter what you write to those sectors -- could be zeros, garbage, or good data; it will either reallocate them and write your data there, or it will just try writing your new data overtop of the bad data and see if it "sticks" (is readable afterward).
When I have a drive with only one or two failing sectors, I often find them with "smartctl -t long" (one at a time, sigh) and then write to them with a complicated series of commands. The very smart smartmontools maintainer, Bruce Allen, has written a BadBlockHowTo.txt about how to do this. (I hope somebody automates this error-prone process soon.) I have done this many times, on many generations of disk drives; on old 1980s SCSI drives there were utility programs that would reallocate individual sectors. I know of no free software program for doing this kind of low-level drive formatting, unfortunately. Modern drives just do it when you write.
OK, this next part is pure uninformed speculation. I HAVEN'T DONE THIS ON MY DISKS, AND YOU SHOULD NOT DO THIS TO YOUR DISKS UNLESS YOU ARE A WIZARD AND YOU KNOW WHAT YOU ARE DOING, AND YOU'RE WILLING TO TAKE THE RISKS YOURSELF WITHOUT WHINING.
It occurs to me that it OUGHT to work to do this: First, copy the entire drive to somewhere else with GNU ddrescue. The logfile will show you exactly where the erroneous sectors are. Look at it with a text editor. Make sure that's what it says. Then, make a copy of that logfile, and (here's the tricky part) run GNU ddrescue in a very strange way to write zeroes onto those bad sectors:
ddrescue -r1 /dev/zero /dev/baddisk my-logfile-copy
Note that your bad disk is the OUTPUT of this command, while the system file "/dev/zero" (as many zeroes as you ever wanted) is the INPUT to the command.
If you hadn't specified the logfile and the "-r1", this would copy zeroes over your entire bad disk. The logfile points out what disk blocks it couldn't read, so ddrescue will only try to read the parts it hasn't read before. The "-r1" tells it to go back and try to read them again, even though it failed the first time. But since it's reading from /dev/zero this time, all of those reads are going to succeed -- and then it will write those zeros into the exact places on the bad disk drive where you need to write new data to reallocate the bad sectors.
In the process, it destroys the logfile, which is why you made a copy of it. When it's done, the logfile will report that the entire disk was readable, because it was able to read from every "bad" sector in /dev/zero.
LIKE I SAID ABOVE, this is all pure speculation. I have never done this on my own drives. I might try it someday (the drive with >2000 errors would be a good candidate, except that it's a very old 2GB drive, it has already reallocated every spare sector in it, and I don't know what trashed those 2000 sectors so I don't really trust the drive anyway. Better to try it on a 200GB or up drive (newer, with better error recovery firmware) and with only a few errors (that probably weren't cause by some massive internal problem).) Don't blame me if you trash your drive this way. But do tell me what you think of the idea.
Disk drive recovery: ddrescue on MacOSX
My friend's Macintosh started acting very strange and slow today, and we eventually figured out that it's the disk drive, which shows failing SMART status. (MacOS is too stupid to give access to the real SMART info like the free smartctl, it just provides a red FAILED indication. However, smartctl may have been ported to MacOSX by now.)
So the race is on to get as much data as possible off the drive.
We immediately copied off a few key documents onto a USB Flash drive, then hooked up a spare USB hard drive. Then the question became: what software do we have to copy this failing drive to the external USB drive?
The snooty Mac utility Carbon Copy refused to copy to a USB drive. (And I don't know whether it would handle a source drive full of errors, anyway.)
I looked for MacOS ports of ddrescue, but there weren't any in a form that I could simply download and run (or even download, install, and run). Instead there was Darwinports, which wanted me to install almost a gigabyte of Apple proprietary versions of the GNU tools (Xcode), with a license agreement that you MUST CLICK AGREE on or it REFUSES TO INSTALL, even though the GNU license does not work that way. Plus their own small wart on the side of that, which seems to be some scripts that suck down sources and run them through "./configure; make; make install". I aborted that all-night download as soon as I found an alternative; I don't know if this disk drive is up to installing a gigabyte of software and then configuring and compiling a GNU program.
Fink at least has the concept of installing binaries rather than compiling everything from sources. But I downloaded it and tried running its installer, and the installer failed with an error message that didn't appear in the documentation, saying that the installation had failed but I should retry it. When I retried it, of course, it said it couldn't install on top of an existing installation. When I tried to run the half-installed one, it would instantly fail. And when I removed it according to the documentation, and tried a reinstall, I got the exact same problem. Yeah, I know, it's free software, so it's my problem. I don't recommend Fink to you.
So I have fallen back on trusty old "dd conv=sync,noerror" and I hope it doesn't lose my friend too much of the drive. My hat's off to the Mac community for the elaborateness of its infrastructure. Doesn't it need a few more bells and whistles, guys? Someday if I ever waste a week installing a compiler on a Mac, I'll build a simple binary of ddrescue, and put it up on this web site to help the next person who's trying to rescue their data from a failing drive.
OK, so John Perry Barlow came over, and had the development environment handy, so we built ddrescue by downloading the sources (ddrescue-1.1.tar.bz2, unpacking them in the Finder, opening a terminal, going into the unpacked ddrescue-1.1 folder, and typing "./configure" and "make". Here's the resulting binary of ddrescue and the matching documentation on how to use it. As I type this, it is copying my friend's whole dying hard drive onto an identical Firewire drive.
If your disk dies on MacOS, grab a copy of this binary and use it to copy the whole disk onto an identical (or larger) spare disk drive. If it works at all, you'll probably recover 99+% of your files. In fact, why don't you grab it now and put it on your Mac, so it'll be handy when you need it?
This page is by Carlo Wood, Mar 2008
Eindhoven University of Technology
Department of Mathematics and Computer Science
Master's Thesis
Measuring and Improving the Quality of
File Carving Methods
by
S.J.J. Kloet
Supervisor: Prof. Dr. W.J. Fokkink
Almere, October 29, 2007
http://www.uitwisselplatform.nl/frs/download.php/461/thesis.pdf
This is an excerpt from this web page: Link
NOTE: These steps are only for really bad hard disk muck-ups and accidentally deleted files. For normal filesystem inconsistencies, don't use these steps!
1. Once you realize that you've lost data, don't do anything else on that partition - you may cause that data to be overwritten by new data.
2. Unmount that partition. e.g., umount /home
3. Find out what actual device this partition refers to. You can usually get this information from the file /etc/fstab. We'll assume here that the device is /dev/hda3.
4.
Run the command: reiserfsck --rebuild-tree -S -l /root/recovery.log /dev/hda3
You need to be root to do this. Read the reiserfsck man page for what these options do and for more options. Some interesting options are '--rebuild-sb, --check'
After the command finishes, which might be a long time for a big partition, you can take a look at the logfile /root/recovery.log if you wish.
5. Mount your partition: mount /home
6. Look for the lost+found directory in the root of the partition. Here, that would be: /home/lost+found
7. This directory contains all the files that could be recovered. Unfortunately, the filenames are not preserved for a lot of files. You'll find some sub-directories - filenames within those are preserved!
8. Look through the files and copy back what you need.
The Ubuntu-rescue-remix live system provides a complete Unix command-line environment without any graphical user interface. Do you find the GNU/Linux command-line difficult to learn?
Here are some links to help beginners get used to the command line interface:
https://help.ubuntu.com/8.04/basic-commands/C/
https://help.ubuntu.com/community/UsingTheTerminal
https://help.ubuntu.com/community/CommandlineHowto
Please suggest any other resources to add to the list. As well, if you know of any way to improve the usability of the rescue remix, please suggest them in the Help and Discussion section.
Thanks.