My Home Network
The latest installment in Frank Crawford's ongoing series of articles about his experiences maintaining a home network.
My Home Network
Well, welcome to a new year and an entirely different format for AUUGN. Things are a little late, but it should be up to the same quality that you have come to expect.
Of course, while it is a new year, the same old problems still exist, and much of this column is a continuation of my hardware issues from late last year. In particular, I intend to go through, in detail, some tools to monitor for problems, and then techniques to offer protection from them.
So, if you remember my last column, you would have seen at the start I had some issues with the reliability of the disk in my server. In fact, I even included an email notification of a problem. Now this problem continued on for a few months, since the disk didn't fail completely, but more important for this column, the tools to do this monitoring came from the smartmontools package. In particular, it makes use of smartd and smartctl to manage and report on the disk status.
Stepping back a bit, most modern ATA and SCSI hard disks have built in Self-Monitoring, Analysis and Reporting Technology (SMART), and this can provide advanced warning of disk degradation and failure. The smartmontools package, available from http://smartmontools.sourceforge.net/, provide a daemon, smartd, to perform continuous monitoring and a utility, smartctl, to control and monitor the SMART system built into disks. In addition, there is a configuration file, /etc/smartd.conf, with initial options for smartd.
The best place to start is with smartctl, and often this is required to enable SMART activities on the disk. There are a number of ways to enable SMART monitoring on a disk: some motherboards have BIOS options, secondly the specifications say that SMART settings should be preserved across power-cycling, starting a recent version of the smartd daemon, and finally manually with smartctl --smart=on /dev/hda (or whatever device).
The smartmontools package can give all sorts of useful information about your disk, including such items as temperature, error rate, etc, but almost certainly the various test options are the most important. There are really two useful test levels, short and long. The short test usually takes under ten minutes (in my case about 2 mins) and is good for a quick check, while the long test can take an hour or more (again mine take 110 mins) and is a far more thorough test than the short tests. This is invoked by the --test option, e.g.
smartctl --test=short /dev/hda
While these commands initiate the various tests, they tend to be run in the
background
, which still allows normal system operations
(although it does
impact on performance).
The results of these tests, along with other hardware errors are kept
available in the device and can be obtained with the --log
command, and in particular the --log=selftest command
for the background tests.
The other important one is the option --log=error for the device
error log.
Probably the easiest way to view this is with the --all, which also prints out additional information about the system (sorry but this is for a good disk, so you won't see any errors):
[root@bits frank]# smartctl --all /dev/hda smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST3160023A Serial Number: 4LJ0DAGT Firmware Version: 8.01 User Capacity: 160,041,885,696 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Sun Jun 12 16:20:42 2005 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 111) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 057 055 006 Pre-fail Always - 79303433 3 Spin_Up_Time 0x0003 098 098 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 1 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 10100168 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 890 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 5 194 Temperature_Celsius 0x0022 048 056 000 Old_age Always - 48 195 Hardware_ECC_Recovered 0x001a 057 055 000 Old_age Always - 79303433 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 879 - # 2 Extended offline Completed without error 00% 714 - # 3 Extended offline Completed without error 00% 549 - # 4 Extended offline Completed without error 00% 384 - # 5 Extended offline Completed without error 00% 219 - # 6 Extended offline Completed without error 00% 54 - # 7 Extended offline Completed without error 00% 5 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
As you can see, there is lots of information available on the disk status, but it isn't something that a user would want to measure regularly. This is where the daemon, smartd, come in. smartd polls each disk, it is configure to monitor, every 30 mins (which is configurable) and logs to syslog and/or sends email when a failure is detected (as in health check fails). In fact, most of what smartd does is controlled by the configuration file /etc/smartd.conf.
In my case I have the following two lines in my /etc/smartd.conf:
/dev/hda -a -m root -s L/../../6/02 /dev/hdc -a -m root -s L/../../7/02
These lines have the following effect:
-a | Default: equivalent to -H -f -t -l error -l selftest. | |
-m | root | Send warning email to root for -H, -l error, -l selftest, and -f. |
-s | L/../../6/02 | Schedule a long test at 2am on Saturday (6) for /dev/hda or Sunday (7) for /dev/hdc (see man page for explanation). |
-t | Track changes to syslog (implied by the -a). |
Aside from the mail sent out on an error, smartd logs messages to syslog whenever an attribute changes, such as:
Jun 12 15:40:37 bits smartd[5422]: Device: /dev/hdc, SMART Usage Attribute: 194 Temperature_Celsius changed from 47 to 46
Like all good Open Source projects, smartmontools isn't perfect, and it is still developing. In particular it works best with modern IDE drives, which tend to support the latest standards, has some support for SCSI drives, and little support for disks managed through raid controllers. In addition, many disks may come up with a message that the disk type is unknown, but this is more of a warning, and relates to the mapping of attribute numbers to names, in general it should still work. The full list of known disks can be seen with the command smartctl -P showall.
While monitoring the reliability of the disks within a server is a major function, monitoring the rest of the hardware is also an important issue. Most modern motherboards include internal monitoring hardware, which reports such details as CPU temperature, fan speed and system voltages. In addition some peripheral cards also include such monitoring hardware. This is supported by the lm_sensors package available for Linux since the 2.4 kernel.
To make use of these features support from the kernel is needed, although there are only a limited number of chipsets used. To quote from the lm_sensors site, http://secure.netroedge.com/~lm78/index.html, the home of the Linux hardware health monitoring hardware tools:
Most PC's built since late 1997 now come with a hardware health monitoring chip. This chip may be accessed via the ISA bus or the SMBus, depending on the motherboard.
The SMBus or System Management Bus is a 2-wire low-speed serial communications bus used for basic health monitoring and hardware management. In addition many device legacy functions, such as the keyboard and interrupt controllers are connected to an internal ISA Bus, even if no ISA slots exist any more.
Once kernel support for the various sensor chipsets is available (usually as modules), there are still some difficulties, due to the fact that different board manufactures wire up the inputs to the chips differently. Due to this, some configuration needs to be performed for each separate motherboard.
The first step to configuring the lm_sensors package for your system is to run the sensors-detect script which checks which PCI and other devices are available, attempts to load various modules and based on the results, generates a configuration file for future reboots.
Once the modules have been loaded, you can use sensors to report the current sensor readings. However, in practice, there is one more step that may be required, to modify the conversion factors from the sample definitions to ones that suit your motherboard settings. These are set in the file /etc/sensors.conf, and involve the following possible changes:
- comment out sensors that are not valid (not all inputs are used on all motherboards), through the ignore statement,
- modify the computation algorithm, such as double the value, through the compute statement,
- set the minimum and maximum expected values to allow alarming, and
- set labels for output, through the label statement.
For example the system I'm currently running on, generates the following output from sensors:
it8712-isa-0290 Adapter: ISA adapter VCore: +1.54 V (min = +1.42 V, max = +1.57 V) Vcc18: +1.79 V (min = +1.71 V, max = +1.89 V) +3.3V: +3.28 V (min = +3.14 V, max = +3.47 V) +5V: +5.13 V (min = +4.76 V, max = +5.24 V) +12V: +12.03 V (min = +11.39 V, max = +12.61 V) VBat: +4.08 V CPU Fan: 2636 RPM (min = 998 RPM, div = 8) CPU Temp: +21°C (low = +15°C, high = +45°C) sensor = diode VID Pin: +1.50 V eeprom-i2c-0-50 Adapter: SMBus I801 adapter at 5000 Memory type: DDR SDRAM DIMM Memory size (MB): 512
You can see from this the current voltages, CPU fan speed and temperature and the VID value (which is generally tunable). In addition, while it can be suppressed, you will note that it also lists settings from the EEPROM relating to the current memory DIMM installed. This is generated with the following configuration:
chip "it87-*" "it8712-*" # The values below have been tested on Asus CUSI, CUM motherboards. #;# Modified by Frank Crawford for the Gigabit GA-8IE533 Motherboard. # Voltage monitors as advised in the It8705 data sheet #;# label in0 "VCore 1" label in0 "VCore" #;# label in1 "VCore 2" label in1 "Vcc18" label in2 "+3.3V" label in3 "+5V" label in4 "+12V" label in5 "-12V" label in6 "-5V" label in7 "Stdby" label in8 "VBat" ignore in5 ignore in6 ignore in7 ignore temp1 ignore temp2 ignore fan2 ignore fan3 # Adjust this if your vid is wrong; see doc/vid # set vrm 9.0 # vid is not monitored by IT8705F # comment out if you have IT8712 #ignore vid label vid "VID Pin" # Incubus Saturnus reports that the IT87 chip on Asus A7V8X-X seems # to report the VCORE voltage approximately 0.05V higher than the board's # BIOS does. Although it doesn't make much sense physically, uncommenting # the next line should bring the readings in line with the BIOS' ones in # this case. # compute in0 -0.05+@ , @+0.05 # If 3.3V reads 2X too high (Soyo Dragon and Asus A7V8X-X, for example), # comment out following line. #;# compute in2 2*@ , @/2 # compute in3 ((6.8/10)+1)*@ , @/((6.8/10)+1) compute in4 ((30/10) +1)*@ , @/((30/10) +1) # For this family of chips the negative voltage equation is different from # the lm78. The chip uses two external resistor for scaling but one is # tied to a positive reference voltage. See ITE8705/12 datasheet (SIS950 # data sheet is wrong) # Vs = (1 + Rin/Rf) * Vin - (Rin/Rf) * Vref. # Vref = 4.096 volts, Vin is voltage measured, Vs is actual voltage. # The next two are negative voltages (-12 and -5). # The following formulas must be used. Unfortunately the datasheet # does not give recommendations for Rin, Rf, but we can back into # them based on a nominal +2V input to the chip, together with a 4.096V Vref. # Formula: # actual V = (Vmeasured * (1 + Rin/Rf)) - (Vref * (Rin/Rf)) # For -12V input use Rin/Rf = 6.68 # For -5V input use Rin/Rf = 3.33 # Then you can convert the forumula to a standard form like: compute in5 (7.67 * @) - 27.36 , (@ + 27.36) / 7.67 compute in6 (4.33 * @) - 13.64 , (@ + 13.64) / 4.33 # # this much simpler version is reported to work for a # Elite Group K7S5A board # # compute in5 -(36/10)*@, -@/(36/10) # compute in6 -(56/10)*@, -@/(56/10) # compute in7 ((6.8/10)+1)*@ , @/((6.8/10)+1) #;# set in0_min 1.5 * 0.95 #;# set in0_max 1.5 * 1.05 set in0_min vid * 0.95 set in0_max vid * 1.05 #;# set in1_min 2.4 #;# set in1_max 2.6 set in1_min 1.8 * 0.95 set in1_max 1.8 * 1.05 set in2_min 3.3 * 0.95 set in2_max 3.3 * 1.05 set in3_min 5.0 * 0.95 set in3_max 5.0 * 1.05 set in4_min 12 * 0.95 set in4_max 12 * 1.05 #;# set in5_max -12 * 0.95 #;# set in5_min -12 * 1.05 #;# set in6_max -5 * 0.95 #;# set in6_min -5 * 1.05 #;# set in7_min 5 * 0.95 #;# set in7_max 5 * 1.05 #the chip does not support in8 min/max # Temperature # # Important - if your temperature readings are completely whacky # you probably need to change the sensor type. # Adujst and uncomment the appropriate lines below. # The old method (modprobe it87 temp_type=0xXX) is no longer supported. # # 2 = thermistor; 3 = thermal diode; 0 = unused # set sensor1 3 # set sensor2 3 # set sensor3 3 # If a given sensor isn't used, you will probably want to ignore it # (see ignore statement right below). label temp1 "M/B Temp" #;# set temp1_over 40 #;# set temp1_low 15 label temp2 "CPU Temp" #;# set temp2_over 45 #;# set temp2_low 15 # ignore temp3 #;# label temp3 "Temp3" label temp3 "CPU Temp" set temp3_over 45 set temp3_low 15 # The A7V8X-X has temperatures inverted, and needs a conversion for # CPU temp. Thanks to Preben Randhol for the formula. # label temp1 "CPU Temp" # label temp2 "M/B Temp" # compute temp1 (-15.096+1.4893*@), (@+15.096)/1.4893 # The A7V600 also has temperatures inverted, and needs a different # conversion for CPU temp. Thanks to Dariusz Jaszkowski for the formula. # label temp1 "CPU Temp" # label temp2 "M/B Temp" # compute temp1 (@+128)/3, (3*@-128) # Fans label fan1 "CPU Fan" #;# set fan1_min 0 set fan1_min 1000 #;# set fan2_min 3000 # ignore fan3 #;# set fan3_min 3000 # The following is for the Inside Technologies 786LCD which uses either a # IT8705F or a SIS950 for monitoring with the SIS630. # You will need to load the it87 module as follows to select the correct # temperature sensor type. # modprobe it87 temp_type=0x31 # The sensors-detect program reports lm78 and a sis5595 and lists the it87 as # a misdetect. Don't do the modprobe for the lm78 or sis5595 as suggested. # # delete or comment out above it87 section and uncomment the following. #chip "it87-*" # label in0 "VCore 1" # label in1 "VCore 2" # label in2 "+3.3V" # label in3 "+5V" # label in4 "+12V" # label in5 "3.3 Stdby" # label in6 "-12V" # label in7 "Stdby" # label in8 "VBat" # in0 will depend on your processor VID value, set to voltage specified in # bios setup screen # set in0_min 1.7 * 0.95 # set in0_max 1.7 * 1.05 # set in1_min 2.4 # set in1_max 2.6 # set in2_min 3.3 * 0.95 # set in2_max 3.3 * 1.05 # set in3_min 5.0 * 0.95 # set in3_max 5.0 * 1.05 # +- 12V are very poor tolerance on this board. Verified with voltmeter # set in4_min 12 * 0.90 # set in4_max 12 * 1.10 # set in5_min 3.3 * 0.95 # set in5_max 3.3 * 1.05 # set in6_max -12 * 0.90 # set in6_min -12 * 1.10 # set in7_min 5 * 0.95 # set in7_max 5 * 1.05 # vid not monitored by IT8705F # ignore vid # compute in3 ((6.8/10)+1)*@ , @/((6.8/10)+1) # compute in4 ((30/10) +1)*@ , @/((30/10) +1) # compute in6 (1+232/56)*@ - 4.096*232/56, (@ + 4.096*232/56)/(1+232/56) # compute in7 ((6.8/10)+1)*@ , @/((6.8/10)+1) # Temperature # label temp1 "CPU Temp" # ignore temp2 # ignore temp3 # Fans # set fan1_min 3000 # ignore fan2 # ignore fan3
Most of the valid entries are found through trial and error, comparing the values to that currently reported by the BIOS (where available).
While being able to view the current settings is useful, being able to track the values over time is very useful. This is supported by the sensord program, part of the lm_sensors package. Running this daemon logs both to syslog and optionally to a Round Robin Database (rrd), which along with a relevant CGI script, allows the plotting of various hardware health checks.
While the system pretty much automatically generates the RRD file, the CGI file, and in particular the rrdgraph commands need to be tailored to local requirements. However, once this is done, you can see some interesting correlations. Even better, you can tie the results from sensord to syslog and generate alarms whenever hardware problems occur, probably through the use of swatch (if you don't know of it, look it up, it is handy).
As a project for the avid reader you can tie in the temperature readings from the disk drives given by smartd with those from sensord and watch the variation over time. When I find time, I'll do this for myself.
So, with these tools, you can receive either earlier or immediate warning of hardware problems allowing you to quickly address them. Of course, you do really need to have a scheme to protect against data loss, but that is an issue for a later column. What do you think? Are these tools useful to protect your systems? Let me know.