Friday, November 25, 2016

What is XFS/filesystem corruption and whose fault is it?

Sometimes OSDs fail to start. One reason this can happen is that the OSD's data store is in a partition that is formatted with an XFS filesystem, and this filesystem has been corrupted in some way.

Theoretically, XFS is "corruption-proof" but there are several preconditions that must be fulfilled before this can be relied upon:

  1. disk controller must be in "write through" mode
  2. alternatively, if the controller is in "write back" mode, it must be equipped with a Battery Back Up (BBU)
  3. if relying on "write back" mode with BBU, the battery in BBU must be in good condition
  4. even if the BBU battery is in good condition, it can only preserve the filesystem journal for so long
  5. the filesystem must not be mounted with the "nobarrier" mount option

If you experience a power outage or other crash, and XFS filesystems fail to mount afterwards, please double-check that these conditions were fulfilled before opening a bug against XFS or any systems, such as Ceph, that rely on XFS filesystem consistency.

Also, in any bug you open, provide detailed information about your disk controller, how it is configured, presence of BBU and state of the battery, how long the power was out, etc.