Article originally appeared as SMART Tools: Preventing Drive Failures (Linux Lab by Charles McColm, Full Circle Magazine issue 82, Feb 2014)
At our local computer refurbishing project, the top sources of hardware failure that we see are power supplies, CMOS batteries, RAM, and hard drives. The first three failures can cause systems not to POST (Power On Self-Test) correctly.
Hard drive failures are a bit more tricky. A really bad hard drive can cause a system to hang while displaying POST messages, or cause a system to randomly reboot (we see more of this on Windows systems), or slow a system down to a crawl, or it might not appear to do anything at all. Knowing a drive has issues before the drive fails can save a lot of work.
Of course everyone should be backing up their data, but knowing your drive might have an issue in the near future is helpful. Linux has several tools for examining hard drive failures. This month we’ll look at gsmartcontrol, a graphical version of the smartctl tool (from the smartmontools package).
Gsmartcontrol is a graphical version of the smartctl software. For new Linux users (like many of our volunteers) it gives a simple, but comprehensive, look at a hard drive’s health and capabilities as well as providing us with an easy method to do a short or long test on a hard drive.
Gsmartcontrol is not installed on most systems by default, so you’ll have to install the gsmartcontrol package. Installing gsmartcontrol also installs the smartmontools package (which contains smartctl, the command-line test tool).
When you first run gsmartcontrol, all hard drives which gsmartcontrol can see are displayed. A hard drive does not have to be mounted for gsmartcontrol to see it and you can have several hard drives in a system. To examine a drive in gsmartcontrol, simply double click on it. When the drive opens up, it opens to an identity view tab that gives information about the hard drive. Besides listing the model of hard drive, the identity screen lists other useful tidbits of information such as the hard drive’s serial number (useful if you ever have to claim for insurance, or, in our case, report serial numbers to equipment donors), the firmware version of the hard drive (which could be useful when diagnosing
problems on particular systems which might have issues with certain drives), the drive’s capacity (size), the last time it was checked, as well as overall SMART health status.
There are several other tabs: Attributes, Capabilities, Error Log, Self-test logs, and Perform Tests, each of which are useful. When a drive has an issue, the text of some of the tabs might appear red (Attributes and Error Log in our example). This feature makes it easy to spot potential issues. The red text doesn’t mean a drive has failed, but is a sign you might want to consider backing up sooner rather than later and look for another drive. Clicking on the red tabs reveals the potential point of failure.
In our example the Hitachi hard drive in my notebook has the Reallocated Sector Count highlighted in pink on the Attributes tab, indicating that at some point the system has come across a bad sector, marked it and reallocated it elsewhere (meaning we won’t have to worry about this sector any more because it’ll appear invisible to the OS). Red highlighting on any of the sections indicates a more serious error.
Wikipedia’s entry on SMART (Self-Monitoring, Analysis and Reporting Technology) is handy because interpreting the Raw values of these Attributes can be tricky. For some attributes it’s better to have a higher raw value while for others it’s better to have a low raw value. You can find the Wikipedia SMART entry here: http://en.wikipedia.org/wiki/S.M.A.R.T.
According to Wikipedia we should hope for a lower than Norm-ed value (1 00) in the Raw value section (1 66). We’re higher, which indicates failure. The higher the value, the more sectors the hard drive has reallocated. Similar, but more problematic are the Current Pending Sector Count and Uncorrectable Sector Count – both of which indicate failures where the sectors haven’t been rewritten somewhere else. It’s these kinds of errors that can cause a system to seemingly randomly reboot (or blue/black screen) when the OS comes across the sector.
The Capabilities tab of gsmartcontrol shows the SMART capabilities of the hard drive. For brevity we won’t go into this tab since it doesn’t indicate errors and is less useful preventing drive errors.
The Error Log tab shows up to the last 5 errors. The details section of the Error Log tab is interesting because it shows the exact address where an error occurs. The Lifetime hours section is also interesting because it shows approximately when the error occured, in this case at the 12756th hour (531 .5 days, under 2 years).
The Self-test Logs tab displays information from smart tests performed on the drive. The BIOS of some systems (HP for example) have a self-test that might show up in the Self-test Logs, as well as any tests performed on the Perform Tests tab of gsmartcontrol. Again, the Lifetime hours is important because it shows the last hour a self-test was performed (1 5020th hour in our example – 625 days).
Recalling that the error on our drive was at the 531 day mark we’ve gone almost 100 days since the error was found by smart.
To run a self-test on your drive, test for each drive. open the Perform Tests tab, choose either the Extended Self-test then click the Execute button. Short Self-tests typically take between 1 to 2
minutes while the Extended Self-test can take 30 minutes or more.
At the computer recycling project, we test each drive first using the Short Self-test; if the drive fails, or we suspect a drive might be failing despite passing the short test (the sound of the drive for example), we then run the Extended Self-test. Before finding gsmartcontrol we used to run the manufacturer test for each drive.
Running manufacturer’s tests is ultimately the best way to test a drive, but there are a few problems with using a manufacturer’s tool:
- Most manufacturer’s tools require you boot from software, meaning you have to reboot your computer to their tool – not good if you don’t want downtime.
- Tools get upgraded by the manufacturer and don’t always work on their older drives (or newer drives in the csase of old versions of the software)
- One manufacturer’s tool often wont work on a drive by another manufacturer, so if you have a mix of drives you have to get each manufacturer’s tool.
Gsmartcontrol (and smartctl) works on a large number of drives from a wide range of manufacturers; it’s Free Libre Open Source Software, and has an extensive but understandable user interface.
The command-line tool, smartctl, is also installed when you install gsmartcontrol (smartctl is in the smartmontools package). Both tools need to be run with root/administrative privileges.
Smartctl can display all the information gsmartcontrol displays, but doesn’t need a graphical user interface to do so. And, like gsmartcontrol, smartctl doesn’t require taking your system down. We won’t cover smartctl this month, but it’s worth mentioning since it’s handy for monitoring drives over a SSH connection and because you can run it in a cron/anacron job.
Charles McColm is the author of Instant XBMC, and the project manager of a not-for-profit computer reuse project. When not building PCs, removing malware,encouraging people to use Linux, and hosting local Ubuntu hours, Charles blogs at http://www.charlesmccolm.com/.