File systems and data corruption

Thanks to some bad HDDs causing data corruption we were able to check what really happens with some file systems when underlying block device is unreliable.

We created file system on "bad HDD" and saved some files knowing that HDD will not store all the data as written and therefore corruption will occur. With such scenario corruption is unavoidable but to minimise damage it is important to detect corruption as early as possible.

Btrfs (best)

Reading corrupted file(s) resulted in "Input/output error" (i.e. "Cannot read source file"). The following was written to "/var/log/messages" and "/var/log/kern.log":

Mar 27 06:01:51 deblabr kernel: [430667.328062] btrfs csum failed ino 259 off 125612032 csum 1322675045 private 4050170413
Mar 27 06:01:51 deblabr kernel: [430667.328250] btrfs csum failed ino 259 off 125612032 csum 1322675045 private 4050170413
Mar 27 06:01:53 deblabr kernel: [430670.096957] btrfs csum failed ino 259 off 125612032 csum 1322675045 private 4050170413
Mar 27 06:01:53 deblabr kernel: [430670.106365] btrfs csum failed ino 259 off 125612032 csum 1322675045 private 4050170413
Mar 27 06:02:02 deblabr kernel: [430678.980359] btrfs csum failed ino 259 off 125612032 csum 1322675045 private 4050170413
Mar 27 06:02:02 deblabr kernel: [430678.982592] btrfs csum failed ino 259 off 125612032 csum 1322675045 private 4050170413

Btrfs can scan itself to detect corruption: btrfs scrub start -B /mnt/tmp

scrub done for 0b1a9d7d-28ad-4dc9-a195-ff3a19dff23d
    scrub started at Sun Mar 31 01:26:51 2013 and finished after 59 seconds
    total bytes scrubbed: 3.91GB with 995 errors
    error details: csum=995
    corrected errors: 0, uncorrectable errors: 995, unverified errors: 0

During scrub the following was logged to "/var/log/kern.log":

Mar 31 01:26:51 deblabr kernel: [759603.622059] btrfs: checksum error at logical 432504832 on dev /dev/sdr1, sector 1642176, root 5, inode 257, offset 3244032, length 4096, links 1 (path: itest.tar.xz)
Mar 31 01:26:51 deblabr kernel: [759603.622071] btrfs: bdev /dev/sdr1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Mar 31 01:26:51 deblabr kernel: [759603.622076] btrfs: unable to fixup (regular) error at logical 432504832 on dev /dev/sdr1
Mar 31 01:26:51 deblabr kernel: [759603.628914] btrfs: checksum error at logical 432508928 on dev /dev/sdr1, sector 1642184, root 5, inode 257, offset 3248128, length 4096, links 1 (path: itest.tar.xz)
Mar 31 01:26:51 deblabr kernel: [759603.628925] btrfs: bdev /dev/sdr1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Mar 31 01:26:51 deblabr kernel: [759603.628929] btrfs: unable to fixup (regular) error at logical 432508928 on dev /dev/sdr1
Mar 31 01:26:51 deblabr kernel: [759603.636107] btrfs: checksum error at logical 432513024 on dev /dev/sdr1, sector 1642192, root 5, inode 257, offset 3252224, length 4096, links 1 (path: itest.tar.xz)
Mar 31 01:26:51 deblabr kernel: [759603.636118] btrfs: bdev /dev/sdr1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Mar 31 01:26:51 deblabr kernel: [759603.636122] btrfs: unable to fixup (regular) error at logical 432513024 on dev /dev/sdr1

and to "/var/log/messages":

Mar 31 01:26:51 deblabr kernel: [759603.622059] btrfs: checksum error at logical 432504832 on dev /dev/sdr1, sector 1642176, root 5, inode 257, offset 3244032, length 4096, links 1 (path: itest.tar.xz)
Mar 31 01:26:51 deblabr kernel: [759603.628914] btrfs: checksum error at logical 432508928 on dev /dev/sdr1, sector 1642184, root 5, inode 257, offset 3248128, length 4096, links 1 (path: itest.tar.xz)
Mar 31 01:26:51 deblabr kernel: [759603.636107] btrfs: checksum error at logical 432513024 on dev /dev/sdr1, sector 1642192, root 5, inode 257, offset 3252224, length 4096, links 1 (path: itest.tar.xz)

NILFS2 (worst)

Unlike other file systems that move unchanged data only during defragmentation, NILFS2 run nilfs_cleanerd process that re-shuffles unmodified data and therefore amplifies damage from corruption.

NILFS2 do not check data on read so more corruption occurs during periods of nilfs_cleanerd activity.
Eventually when error affects btree node the errors logged to "/var/log/kern.log" may look like the following:

Mar 31 01:17:30 deblabr kernel: [759042.984783] NILFS: bad btree node (blocknr=938583): level = 192, flags = 0x73, nchildren = 49956
Mar 31 01:17:30 deblabr kernel: [759042.984850] NILFS: GC failed during preparation: cannot read source blocks: err=-5

Eventually due to errors NILFS2 will re-mount itself as read-only:

Mar 30 19:56:59 deblabr kernel: [739821.894963] NILFS: bad btree node (blocknr=1086570306): level = 239, flags = 0xe2, nchildren = 10392
Mar 30 19:56:59 deblabr kernel: [739821.894969] NILFS error (device dm-0): nilfs_bmap_last_key: broken bmap (inode number=1225452)
Mar 30 19:56:59 deblabr kernel: [739821.894969] 
Mar 30 19:56:59 deblabr kernel: [739821.894971] Remounting filesystem read-only
Mar 30 19:56:59 deblabr kernel: [739821.894973] NILFS warning (device dm-0): nilfs_truncate_bmap: failed to truncate bmap (ino=1225452, err=-5)

Little can be done to recover from such condition due to lack of fsck repair tool.

ext4

When corruption affects ext4 meta data it can re-mount itself in read-only mode.
Recovering ext4 with fsck.ext4 is trivial and corruption of old data is not happening unless defragmentation is run.

Conclusion

Btrfs is strategically important for data integrity.

Other Linux file systems do nothing to ensure that data is read exactly as it was written. Unless Btrfs is used data corruption is likely to be detected much later and therefore more damage will be done.

Tags:

Btrfs (best)

NILFS2 (worst)

ext4

Conclusion

See also