My Home Network

The latest installment in Frank Crawford's ongoing series of articles about his experiences maintaining a home network.

My Home Network

Well, welcome to a new year and an entirely different format for AUUGN. Things are a little late, but it should be up to the same quality that you have come to expect.

Of course, while it is a new year, the same old problems still exist, and much of this column is a continuation of my hardware issues from late last year. In particular, I intend to go through, in detail, some tools to monitor for problems, and then techniques to offer protection from them.

So, if you remember my last column, you would have seen at the start I had some issues with the reliability of the disk in my server. In fact, I even included an email notification of a problem. Now this problem continued on for a few months, since the disk didn't fail completely, but more important for this column, the tools to do this monitoring came from the smartmontools package. In particular, it makes use of smartd and smartctl to manage and report on the disk status.

Stepping back a bit, most modern ATA and SCSI hard disks have built in Self-Monitoring, Analysis and Reporting Technology (SMART), and this can provide advanced warning of disk degradation and failure. The smartmontools package, available from http://smartmontools.sourceforge.net/, provide a daemon, smartd, to perform continuous monitoring and a utility, smartctl, to control and monitor the SMART system built into disks. In addition, there is a configuration file, /etc/smartd.conf, with initial options for smartd.

The best place to start is with smartctl, and often this is required to enable SMART activities on the disk. There are a number of ways to enable SMART monitoring on a disk: some motherboards have BIOS options, secondly the specifications say that SMART settings should be preserved across power-cycling, starting a recent version of the smartd daemon, and finally manually with smartctl --smart=on /dev/hda (or whatever device).

The smartmontools package can give all sorts of useful information about your disk, including such items as temperature, error rate, etc, but almost certainly the various test options are the most important. There are really two useful test levels, short and long. The short test usually takes under ten minutes (in my case about 2 mins) and is good for a quick check, while the long test can take an hour or more (again mine take 110 mins) and is a far more thorough test than the short tests. This is invoked by the --test option, e.g.

smartctl --test=short /dev/hda

While these commands initiate the various tests, they tend to be run in the background, which still allows normal system operations (although it does impact on performance). The results of these tests, along with other hardware errors are kept available in the device and can be obtained with the --log command, and in particular the --log=selftest command for the background tests. The other important one is the option --log=error for the device error log.

Probably the easiest way to view this is with the --all, which also prints out additional information about the system (sorry but this is for a good disk, so you won't see any errors):

[root@bits frank]# smartctl --all /dev/hda
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST3160023A
Serial Number:    4LJ0DAGT
Firmware Version: 8.01
User Capacity:    160,041,885,696 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Sun Jun 12 16:20:42 2005 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection:                 ( 430) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 111) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   057   055   006    Pre-fail  Always       -       79303433
  3 Spin_Up_Time            0x0003   098   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       1
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       10100168
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       890
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   048   056   000    Old_age   Always       -       48
195 Hardware_ECC_Recovered  0x001a   057   055   000    Old_age   Always       -       79303433
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       879         -
# 2  Extended offline    Completed without error       00%       714         -
# 3  Extended offline    Completed without error       00%       549         -
# 4  Extended offline    Completed without error       00%       384         -
# 5  Extended offline    Completed without error       00%       219         -
# 6  Extended offline    Completed without error       00%        54         -
# 7  Extended offline    Completed without error       00%         5         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

As you can see, there is lots of information available on the disk status, but it isn't something that a user would want to measure regularly. This is where the daemon, smartd, come in. smartd polls each disk, it is configure to monitor, every 30 mins (which is configurable) and logs to syslog and/or sends email when a failure is detected (as in health check fails). In fact, most of what smartd does is controlled by the configuration file /etc/smartd.conf.

In my case I have the following two lines in my /etc/smartd.conf:

/dev/hda -a -m root -s L/../../6/02
/dev/hdc -a -m root -s L/../../7/02

These lines have the following effect:

-a		Default: equivalent to -H -f -t -l error -l selftest.
-m	root	Send warning email to root for -H, -l error, -l selftest, and -f.
-s	L/../../6/02	Schedule a long test at 2am on Saturday (6) for /dev/hda or Sunday (7) for /dev/hdc (see man page for explanation).
-t		Track changes to syslog (implied by the -a).

Aside from the mail sent out on an error, smartd logs messages to syslog whenever an attribute changes, such as:

Jun 12 15:40:37 bits smartd[5422]: Device: /dev/hdc, SMART Usage Attribute: 194 Temperature_Celsius changed from 47 to 46

Like all good Open Source projects, smartmontools isn't perfect, and it is still developing. In particular it works best with modern IDE drives, which tend to support the latest standards, has some support for SCSI drives, and little support for disks managed through raid controllers. In addition, many disks may come up with a message that the disk type is unknown, but this is more of a warning, and relates to the mapping of attribute numbers to names, in general it should still work. The full list of known disks can be seen with the command smartctl -P showall.

While monitoring the reliability of the disks within a server is a major function, monitoring the rest of the hardware is also an important issue. Most modern motherboards include internal monitoring hardware, which reports such details as CPU temperature, fan speed and system voltages. In addition some peripheral cards also include such monitoring hardware. This is supported by the lm_sensors package available for Linux since the 2.4 kernel.

To make use of these features support from the kernel is needed, although there are only a limited number of chipsets used. To quote from the lm_sensors site, http://secure.netroedge.com/~lm78/index.html, the home of the Linux hardware health monitoring hardware tools:

Most PC's built since late 1997 now come with a hardware health monitoring chip. This chip may be accessed via the ISA bus or the SMBus, depending on the motherboard.

The SMBus or System Management Bus is a 2-wire low-speed serial communications bus used for basic health monitoring and hardware management. In addition many device legacy functions, such as the keyboard and interrupt controllers are connected to an internal ISA Bus, even if no ISA slots exist any more.

Once kernel support for the various sensor chipsets is available (usually as modules), there are still some difficulties, due to the fact that different board manufactures wire up the inputs to the chips differently. Due to this, some configuration needs to be performed for each separate motherboard.

The first step to configuring the lm_sensors package for your system is to run the sensors-detect script which checks which PCI and other devices are available, attempts to load various modules and based on the results, generates a configuration file for future reboots.

Once the modules have been loaded, you can use sensors to report the current sensor readings. However, in practice, there is one more step that may be required, to modify the conversion factors from the sample definitions to ones that suit your motherboard settings. These are set in the file /etc/sensors.conf, and involve the following possible changes:

comment out sensors that are not valid (not all inputs are used on all motherboards), through the ignore statement,
modify the computation algorithm, such as double the value, through the compute statement,
set the minimum and maximum expected values to allow alarming, and
set labels for output, through the label statement.

For example the system I'm currently running on, generates the following output from sensors:

it8712-isa-0290
Adapter: ISA adapter
VCore:     +1.54 V  (min =  +1.42 V, max =  +1.57 V)
Vcc18:     +1.79 V  (min =  +1.71 V, max =  +1.89 V)
+3.3V:     +3.28 V  (min =  +3.14 V, max =  +3.47 V)
+5V:       +5.13 V  (min =  +4.76 V, max =  +5.24 V)
+12V:     +12.03 V  (min = +11.39 V, max = +12.61 V)
VBat:      +4.08 V
CPU Fan:  2636 RPM  (min =  998 RPM, div = 8)
CPU Temp:    +21°C  (low  =   +15°C, high =   +45°C)   sensor = diode
VID Pin:   +1.50 V

eeprom-i2c-0-50
Adapter: SMBus I801 adapter at 5000
Memory type:            DDR SDRAM DIMM

Memory size (MB):       512

You can see from this the current voltages, CPU fan speed and temperature and the VID value (which is generally tunable). In addition, while it can be suppressed, you will note that it also lists settings from the EEPROM relating to the current memory DIMM installed. This is generated with the following configuration:

chip "it87-*" "it8712-*"

# The values below have been tested on Asus CUSI, CUM motherboards.

#;# Modified by Frank Crawford for the Gigabit GA-8IE533 Motherboard.

# Voltage monitors as advised in the It8705 data sheet

#;#    label in0 "VCore 1"
    label in0 "VCore"
#;#    label in1 "VCore 2"
    label in1 "Vcc18"
    label in2 "+3.3V"
    label in3 "+5V"
    label in4 "+12V"
    label in5 "-12V"
    label in6 "-5V"
    label in7 "Stdby"
    label in8 "VBat"
    ignore in5
    ignore in6
    ignore in7
    ignore temp1
    ignore temp2
    ignore fan2
    ignore fan3

    # Adjust this if your vid is wrong; see doc/vid
#   set vrm 9.0
    # vid is not monitored by IT8705F
    # comment out if you have IT8712
    #ignore  vid
    label vid "VID Pin"

# Incubus Saturnus reports that the IT87 chip on Asus A7V8X-X seems
# to report the VCORE voltage approximately 0.05V higher than the board's
# BIOS does.  Although it doesn't make much sense physically, uncommenting
# the next line should bring the readings in line with the BIOS' ones in
# this case.
# compute in0 -0.05+@ , @+0.05

# If 3.3V reads 2X too high (Soyo Dragon and Asus A7V8X-X, for example),
# comment out following line.
#;#    compute in2   2*@ , @/2
#
    compute in3 ((6.8/10)+1)*@ ,  @/((6.8/10)+1)
    compute in4 ((30/10) +1)*@  , @/((30/10) +1)
# For this family of chips the negative voltage equation is different from
# the lm78.  The chip uses two external resistor for scaling but one is
# tied to a positive reference voltage.  See ITE8705/12 datasheet (SIS950
# data sheet is wrong)
# Vs = (1 + Rin/Rf) * Vin - (Rin/Rf) * Vref.
# Vref = 4.096 volts, Vin is voltage measured, Vs is actual voltage.

# The next two are negative voltages (-12 and -5).
# The following formulas must be used.  Unfortunately the datasheet
# does not give recommendations for Rin, Rf, but we can back into
# them based on a nominal +2V input to the chip, together with a 4.096V Vref.
# Formula:
#    actual V = (Vmeasured * (1 + Rin/Rf)) - (Vref * (Rin/Rf))
#    For -12V input use Rin/Rf = 6.68
#    For -5V input use Rin/Rf = 3.33
# Then you can convert the forumula to a standard form like:
    compute in5 (7.67 * @) - 27.36  ,  (@ + 27.36) / 7.67
    compute in6 (4.33 * @) - 13.64  ,  (@ + 13.64) / 4.33
#
# this much simpler version is reported to work for a
# Elite Group K7S5A board
#
#   compute in5 -(36/10)*@, -@/(36/10)
#   compute in6 -(56/10)*@, -@/(56/10)
#
    compute in7 ((6.8/10)+1)*@ ,  @/((6.8/10)+1)

#;#    set in0_min 1.5 * 0.95
#;#    set in0_max 1.5 * 1.05
    set in0_min vid * 0.95
    set in0_max vid * 1.05
#;#    set in1_min 2.4
#;#    set in1_max 2.6
    set in1_min 1.8 * 0.95
    set in1_max 1.8 * 1.05
    set in2_min 3.3 * 0.95
    set in2_max 3.3 * 1.05
    set in3_min 5.0 * 0.95
    set in3_max 5.0 * 1.05
    set in4_min 12 * 0.95
    set in4_max 12 * 1.05
#;#    set in5_max -12 * 0.95
#;#    set in5_min -12 * 1.05
#;#    set in6_max -5 * 0.95
#;#    set in6_min -5 * 1.05
#;#    set in7_min 5 * 0.95
#;#    set in7_max 5 * 1.05
    #the chip does not support in8 min/max

# Temperature
#
# Important - if your temperature readings are completely whacky
# you probably need to change the sensor type.
# Adujst and uncomment the appropriate lines below.
# The old method (modprobe it87 temp_type=0xXX) is no longer supported.
#
# 2 = thermistor; 3 = thermal diode; 0 = unused
#   set sensor1 3
#   set sensor2 3
#   set sensor3 3
# If a given sensor isn't used, you will probably want to ignore it
# (see ignore statement right below).

    label temp1       "M/B Temp"
#;#    set   temp1_over  40
#;#    set   temp1_low   15
    label temp2       "CPU Temp"
#;#    set   temp2_over  45
#;#    set   temp2_low   15
#   ignore temp3
#;#    label temp3       "Temp3"
    label temp3       "CPU Temp"
    set   temp3_over  45
    set   temp3_low   15

# The A7V8X-X has temperatures inverted, and needs a conversion for
# CPU temp.  Thanks to Preben Randhol for the formula.
#   label temp1       "CPU Temp"
#   label temp2       "M/B Temp"
#   compute temp1     (-15.096+1.4893*@), (@+15.096)/1.4893

# The A7V600 also has temperatures inverted, and needs a different
# conversion for CPU temp.  Thanks to Dariusz Jaszkowski for the formula.
#   label temp1       "CPU Temp"
#   label temp2       "M/B Temp"
#   compute temp1     (@+128)/3, (3*@-128)

# Fans
    label fan1 "CPU Fan"
#;#    set fan1_min 0
    set fan1_min 1000
#;#    set fan2_min 3000
#   ignore fan3
#;#    set fan3_min 3000

# The following is for the Inside Technologies 786LCD which uses either a
# IT8705F or a SIS950 for monitoring with the SIS630.
# You will need to load the it87 module as follows to select the correct
# temperature sensor type.
# modprobe it87 temp_type=0x31
# The sensors-detect program reports lm78 and a sis5595 and lists the it87 as
# a misdetect.  Don't do the modprobe for the lm78 or sis5595 as suggested.
#
# delete or comment out above it87 section and uncomment the following.
#chip "it87-*"
#    label in0 "VCore 1"
#    label in1 "VCore 2"
#    label in2 "+3.3V"
#    label in3 "+5V"
#    label in4 "+12V"
#    label in5 "3.3 Stdby"
#    label in6 "-12V"
#    label in7 "Stdby"
#    label in8 "VBat"
    # in0 will depend on your processor VID value, set to voltage specified in
    # bios setup screen
#    set in0_min 1.7 * 0.95
#    set in0_max 1.7 * 1.05
#    set in1_min 2.4
#    set in1_max 2.6
#    set in2_min 3.3 * 0.95
#    set in2_max 3.3 * 1.05
#    set in3_min 5.0 * 0.95
#    set in3_max 5.0 * 1.05
    # +- 12V are very poor tolerance on this board.  Verified with voltmeter
#    set in4_min 12 * 0.90
#    set in4_max 12 * 1.10
#    set in5_min 3.3 * 0.95
#    set in5_max 3.3 * 1.05
#    set in6_max -12 * 0.90
#    set in6_min -12 * 1.10
#    set in7_min 5 * 0.95
#    set in7_max 5 * 1.05
    # vid not monitored by IT8705F
#    ignore  vid

#    compute in3 ((6.8/10)+1)*@ ,  @/((6.8/10)+1)
#    compute in4 ((30/10) +1)*@  , @/((30/10) +1)
#    compute in6 (1+232/56)*@ - 4.096*232/56, (@ + 4.096*232/56)/(1+232/56)
#    compute in7 ((6.8/10)+1)*@ ,  @/((6.8/10)+1)
    # Temperature
#    label temp1       "CPU Temp"
#    ignore temp2
#    ignore temp3
    # Fans
#    set fan1_min 3000
#    ignore fan2
#    ignore fan3

Most of the valid entries are found through trial and error, comparing the values to that currently reported by the BIOS (where available).

While being able to view the current settings is useful, being able to track the values over time is very useful. This is supported by the sensord program, part of the lm_sensors package. Running this daemon logs both to syslog and optionally to a Round Robin Database (rrd), which along with a relevant CGI script, allows the plotting of various hardware health checks.

While the system pretty much automatically generates the RRD file, the CGI file, and in particular the rrdgraph commands need to be tailored to local requirements. However, once this is done, you can see some interesting correlations. Even better, you can tie the results from sensord to syslog and generate alarms whenever hardware problems occur, probably through the use of swatch (if you don't know of it, look it up, it is handy).

As a project for the avid reader you can tie in the temperature readings from the disk drives given by smartd with those from sensord and watch the variation over time. When I find time, I'll do this for myself.

So, with these tools, you can receive either earlier or immediate warning of hardware problems allowing you to quickly address them. Of course, you do really need to have a scheme to protect against data loss, but that is an issue for a later column. What do you think? Are these tools useful to protect your systems? Let me know.

>AUUG–The Organisation for Unix, Linux and Open Source Professionals

My Home Network