website/content/blog/monitoring-disks-smartattributes.md

245 lines
7.7 KiB
Markdown
Raw Normal View History

2024-10-17 15:06:15 -04:00
---
title: "Monitoring my Hard Drives with SMART Attributes"
date: 2024-10-17T13:05:59-04:00
draft: false
tags: []
math: false
medium_enabled: false
---
After having a hard drive fail on me once, I've been slowly upgrading my homelab to be more resilient. Currently I take daily backups using [Restic](https://restic.net/), push it offsite to two different services, and have 4 hard drives [set up in a RAID 10 configuration](https://brandonrozek.com/blog/switching-raid-10/).
The RAID 10 configuration means that I can safely lose one hard drive without needing to access my backups. Though you know the saying, "when it rains it pours". This means I can't ignore the possibility that multiple hard drives die at once.
Luckily, in many cases, we can detect early signs of drive failures. This is where SMART attributes come in. The **S**elf-**M**onitoring, **A**nalysis, and **R**eporting **T**echnology (SMART) system reports many different indicators on drive reliability. Unfortunately, which indicators the hard drive reports, and sometimes even the way the value is formatted is vendor-dependent.
In order to list the value of all the SMART attributes of a given drive (e.g `/dev/sda`),
```bash
sudo smartctl -A /dev/sda
```
Wikipedia maintains a list of [common SMART attributes](https://en.wikipedia.org/wiki/Self-Monitoring%2C_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes). In my server, I have a mix of Western Digital and Seagate drives. I looked at the SMART attributes that were in common between these drives, and additionally filtered by the ones where the values are vendor-agnostic. This gives me the following table:
| Attribute | Description |
| ---------------------- | ------------------------------------------------------------ |
| Reallocated_Sector_Ct | The number of bad sectors that have been found and reallocated. |
| Current_Pending_Sector | The number of sectors waiting to be remapped due to unrecoverable read errors. |
| Offline_Uncorrectable | The total number of uncorrectable errors when reading/writing a sector. |
In order to grab the value of a specific SMART attribute, we need to filter the `smartctl` output with `grep` and `awk`,
```bash
sudo smartctl -A /dev/sda | grep Reallocated_Sector_Ct | awk '{print $NF}'
```
Hopefully, the last command printed `0` for you...
Now that we know which attributes to make sure stay zero and we have a way to [get notified from our server](https://brandonrozek.com/blog/webhook-notifications-on-systemd-service-failure/), we can create a script that runs daily and notifies us only when a bad sector appears.
```bash
# Sends a webhook based on the argument given
# feel free to replace this with your own
# solution
sendMsg() {
local MSG=$1
local CLEAN_MSG
CLEAN_MSG=$(echo "$MSG" | jq -Rsa .)
curl -X POST --data-urlencode "payload={\"text\": $CLEAN_MSG}" "$WEBHOOK_URL"
}
```
Different failure modes:
1) Drive that we expect exists doesn't. It's likely too late for the drive at this point...
```bash
sendMissingDevice() {
local DEVICE=$1
local MSG="ALERT: '$DEVICE' not found"
sendMsg "$MSG"
}
```
2. The hard drive does not report one of the three smart attributes we're checking. In this case, you'll either have to skip the attribute check for this drive or find another common set of attributes to check for.
```bash
sendMissingAttribute() {
local DEVICE=$1
local ATTRIBUTE=$2
local MSG="ALERT: '$DEVICE' is missing attribute '$ATTRIBUTE'"
sendMsg "$MSG"
}
```
3. The indicator reports a non-zero value. For the attributes we're monitoring, this means that we have a bad sector in our hard drive.
```bash
sendAlert() {
local DEVICE=$1
local ATTRIBUTE=$2
local RAW_VALUE=$3
local MSG="WARNING: '$DEVICE' has a non-zero raw value for attribute '$ATTRIBUTE'.\n$DEVICE $ATTRIBUTE $RAW_VALUE"
sendMsg "$MSG"
}
```
Our main loop then iterates over all our devices and attributes we want to check for.
```bash
for DEVICE in "${DEVICES[@]}"; do
echo "Checking $DEVICE..."
# Check if the device exists
if [ ! -e "$DEVICE" ]; then
echo "Device $DEVICE not found."
sendMissingDevice "$DEVICE"
continue
fi
SMART_OUTPUT=$(smartctl -A "$DEVICE")
for ATTRIBUTE in "${ATTRIBUTES[@]}"; do
# Check if the attribute exists in the output
if ! echo "$SMART_OUTPUT" | grep -q "$ATTRIBUTE"; then
echo "Attribute '$ATTRIBUTE' not found"
sendMissingAttribute "$DEVICE" "$ATTRIBUTE"
continue
fi
RAW_VALUE=$(echo "$SMART_OUTPUT" | grep "$ATTRIBUTE" | awk '{print $NF}')
if [ "$RAW_VALUE" -gt 0 ]; then
echo "Attribute '$ATTRIBUTE' has raw value of '$RAW_VALUE'"
sendAlert "$DEVICE" "$ATTRIBUTE" "$RAW_VALUE"
fi
done
done
```
Put this all in a script located at `/usr/local/bin/monitor-disks.sh`. In order to have this script run daily, we'll need to first create a systemd service at `/etc/systemd/system/monitor-disks.service`.
```ini
[Unit]
Description=Monitors disks for bad sectors
Requires=
Wants=
# You'll want to run this script after all the
# hard drives come online
After=dev-sda1.device dev-sdb1.device dev-sdc1.device dev-sdd1.device
# See: https://brandonrozek.com/blog/webhook-notifications-on-systemd-service-failure/
OnFailure=webhook-notify@%i.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/monitor-disks.bash
[Install]
WantedBy=multi-user.target
```
Finally, we create a systemd timer at `/etc/systemd/system/monitor-disks.timer`.
```ini
[Unit]
Description=Check for bad sectors daily
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
```
For your convenience, the `monitor-disks.sh` file in its entirety:
```bash
#!/bin/bash
DEVICES=("/dev/sda" "/dev/sdb" "/dev/sdc" "/dev/sdd")
ATTRIBUTES=("Reallocated_Sector_Ct" "Current_Pending_Sector" "Offline_Uncorrectable")
WEBHOOK_URL="INSERT_WEBHOOK_URL_HERE"
if [ "$EUID" -ne 0 ]
then echo "Please run as root"
exit
fi
if ! command -v smartctl &> /dev/null; then
echo "smartctl is not installed"
exit 1
fi
if ! command -v jq &> /dev/null; then
echo "jq is not installed"
exit 1
fi
sendMsg() {
local MSG=$1
local CLEAN_MSG
CLEAN_MSG=$(echo "$MSG" | jq -Rsa .)
curl -X POST --data-urlencode "payload={\"text\": $CLEAN_MSG}" "$WEBHOOK_URL"
}
sendMissingDevice() {
local DEVICE=$1
local MSG="ALERT: '$DEVICE' not found"
sendMsg "$MSG"
}
sendMissingAttribute() {
local DEVICE=$1
local ATTRIBUTE=$2
local MSG="ALERT: '$DEVICE' is missing attribute '$ATTRIBUTE'"
sendMsg "$MSG"
}
sendAlert() {
local DEVICE=$1
local ATTRIBUTE=$2
local RAW_VALUE=$3
local MSG="WARNING: '$DEVICE' has a non-zero raw value for attribute '$ATTRIBUTE'.\n$DEVICE $ATTRIBUTE $RAW_VALUE"
sendMsg "$MSG"
}
for DEVICE in "${DEVICES[@]}"; do
echo "Checking $DEVICE..."
# Check if the device exists
if [ ! -e "$DEVICE" ]; then
echo "Device $DEVICE not found."
sendMissingDevice "$DEVICE"
continue
fi
SMART_OUTPUT=$(smartctl -A "$DEVICE")
for ATTRIBUTE in "${ATTRIBUTES[@]}"; do
# Check if the attribute exists in the output
if ! echo "$SMART_OUTPUT" | grep -q "$ATTRIBUTE"; then
echo "Attribute '$ATTRIBUTE' not found"
sendMissingAttribute "$DEVICE" "$ATTRIBUTE"
continue
fi
RAW_VALUE=$(echo "$SMART_OUTPUT" | grep "$ATTRIBUTE" | awk '{print $NF}')
if [ "$RAW_VALUE" -gt 0 ]; then
echo "Attribute '$ATTRIBUTE' has raw value of '$RAW_VALUE'"
sendAlert "$DEVICE" "$ATTRIBUTE" "$RAW_VALUE"
fi
done
done
```