smartctl notes

Below is a list of smartctl commands I frequently use to quickly verify disk health and status, specially when you have smartd logging errors to messages log file.

  • Print all SMART (Self-Monitoring, Analysis and Reporting Technology) information for drive /dev/sda (Primary Master).

    smartctl -a /dev/sda

  • Enable SMART on device.

    smartctl --smart=on /dev/sda

  • Get info about the device:

    smartctl -i /dev/sda

  • Show the capabilities of drive. Also provides status when tests are being carried out.

    smartctl -c /dev/sda

  • Basic health status:

    smartctl -H /dev/sda

  • Display attributes. The attributes to look out for failing disk is Reallocated_Sector_Ct, Reallocated_Event_Count, Current_Pending_Sector and Offline_Uncorrectable. Their RAW_VALUE should normally be "0".

    smartctl -A /dev/sda

  • Immediate offline test which updates attributes value. Good to run after a badblocks fsck check before checking on the attributes values.

    smartctl -t offline /dev/sda

  • Run a thorough long test if you see suspect attributes with -A option as mentioned above.

    smartctl -t long /dev/sda

  • Examine self-test log. Shows if tests failed or passed.

    smartctl -l selftest /dev/sda

  • Display most recent error log.

    smartctl -l error /dev/sda

There are more examples in man smartctl.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Resolving sector errors on raid partition

On software raid partitions, CurrentPendingSector or OfflineUncorrectableSector errors as logged in syslog could be corrected just failing/removing the drive and re-attaching it back so the drive is rebuilt and the problem sectors get over-written.

Below, I have 4 CurrentPending and OfflineUncorrectable sectors:

# smartctl -A /dev/sdb | grep "Current_Pending_Sector\|Offline_Uncorrectable" />197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -     ;  4
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline    ;  -     ;  4

Doing a selftest, confirms that the first sector lies in the second partition:

# smartctl -l selftest /dev/sdb
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure    ;   90%     18654         3166126

Sector 3166126 lies in the second partition:

# fdisk -lu /dev/sdb

Disk /dev/sdb: 750.1 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders, total 1465149168 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *     ;     63      401624      200781   fd  Linux raid autodetect
/dev/sdb2          401625  1465144064   732371220   fd  Linux raid autodetect

Locate the raid partition:

# grep sdb2 /proc/mdstat
md1 : active raid10 sdb2[4] sdd2[3] sdc2[2] sda2[0]

Make the partition faulty and remove:

# mdadm --manage /dev/md1 -f /dev/sdb2
# mdadm --manage /dev/md1 -r /dev/sdb2

Re-attach the partition and let it rebuild:

# mdadm --manage /dev/md1 -a /dev/sdb2

Once rebuilt redo selftest and check on errors:

# smartctl -t long /dev/sdb
# smartctl -A /dev/sdb | grep "Current_Pending_Sector\|Offline_Uncorrectable" /># smartctl -l selftest /dev/sdb

Drive keeps extra space available to "remap" bad sectors. This happens automatically. If uncorrectable sector errors does not resolve or comes back time and again, it means re-mappable sectors are used up and drive will probably fail soon, so best to just replace the drive.

Comment