diff --git a/content/blog/monitoring-disks-smartattributes.md b/content/blog/monitoring-disks-smartattributes.md new file mode 100644 index 0000000..d0457a7 --- /dev/null +++ b/content/blog/monitoring-disks-smartattributes.md @@ -0,0 +1,244 @@ +--- +title: "Monitoring my Hard Drives with SMART Attributes" +date: 2024-10-17T13:05:59-04:00 +draft: false +tags: [] +math: false +medium_enabled: false +--- + +After having a hard drive fail on me once, I've been slowly upgrading my homelab to be more resilient. Currently I take daily backups using [Restic](https://restic.net/), push it offsite to two different services, and have 4 hard drives [set up in a RAID 10 configuration](https://brandonrozek.com/blog/switching-raid-10/). + +The RAID 10 configuration means that I can safely lose one hard drive without needing to access my backups. Though you know the saying, "when it rains it pours". This means I can't ignore the possibility that multiple hard drives die at once. + +Luckily, in many cases, we can detect early signs of drive failures. This is where SMART attributes come in. The **S**elf-**M**onitoring, **A**nalysis, and **R**eporting **T**echnology (SMART) system reports many different indicators on drive reliability. Unfortunately, which indicators the hard drive reports, and sometimes even the way the value is formatted is vendor-dependent. + +In order to list the value of all the SMART attributes of a given drive (e.g `/dev/sda`), + +```bash +sudo smartctl -A /dev/sda +``` + +Wikipedia maintains a list of [common SMART attributes](https://en.wikipedia.org/wiki/Self-Monitoring%2C_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes). In my server, I have a mix of Western Digital and Seagate drives. I looked at the SMART attributes that were in common between these drives, and additionally filtered by the ones where the values are vendor-agnostic. This gives me the following table: + +| Attribute | Description | +| ---------------------- | ------------------------------------------------------------ | +| Reallocated_Sector_Ct | The number of bad sectors that have been found and reallocated. | +| Current_Pending_Sector | The number of sectors waiting to be remapped due to unrecoverable read errors. | +| Offline_Uncorrectable | The total number of uncorrectable errors when reading/writing a sector. | + +In order to grab the value of a specific SMART attribute, we need to filter the `smartctl` output with `grep` and `awk`, + +```bash +sudo smartctl -A /dev/sda | grep Reallocated_Sector_Ct | awk '{print $NF}' +``` + +Hopefully, the last command printed `0` for you... + +Now that we know which attributes to make sure stay zero and we have a way to [get notified from our server](https://brandonrozek.com/blog/webhook-notifications-on-systemd-service-failure/), we can create a script that runs daily and notifies us only when a bad sector appears. + +```bash +# Sends a webhook based on the argument given +# feel free to replace this with your own +# solution +sendMsg() { + local MSG=$1 + local CLEAN_MSG + CLEAN_MSG=$(echo "$MSG" | jq -Rsa .) + curl -X POST --data-urlencode "payload={\"text\": $CLEAN_MSG}" "$WEBHOOK_URL" +} +``` + +Different failure modes: + +1) Drive that we expect exists doesn't. It's likely too late for the drive at this point... + +```bash +sendMissingDevice() { + local DEVICE=$1 + + local MSG="ALERT: '$DEVICE' not found" + sendMsg "$MSG" +} +``` + +2. The hard drive does not report one of the three smart attributes we're checking. In this case, you'll either have to skip the attribute check for this drive or find another common set of attributes to check for. + +```bash +sendMissingAttribute() { + local DEVICE=$1 + local ATTRIBUTE=$2 + + local MSG="ALERT: '$DEVICE' is missing attribute '$ATTRIBUTE'" + sendMsg "$MSG" +} +``` + +3. The indicator reports a non-zero value. For the attributes we're monitoring, this means that we have a bad sector in our hard drive. + +```bash +sendAlert() { + local DEVICE=$1 + local ATTRIBUTE=$2 + local RAW_VALUE=$3 + + local MSG="WARNING: '$DEVICE' has a non-zero raw value for attribute '$ATTRIBUTE'.\n$DEVICE $ATTRIBUTE $RAW_VALUE" + sendMsg "$MSG" +} +``` + +Our main loop then iterates over all our devices and attributes we want to check for. + +```bash +for DEVICE in "${DEVICES[@]}"; do + echo "Checking $DEVICE..." + + # Check if the device exists + if [ ! -e "$DEVICE" ]; then + echo "Device $DEVICE not found." + sendMissingDevice "$DEVICE" + continue + fi + + SMART_OUTPUT=$(smartctl -A "$DEVICE") + + for ATTRIBUTE in "${ATTRIBUTES[@]}"; do + + # Check if the attribute exists in the output + if ! echo "$SMART_OUTPUT" | grep -q "$ATTRIBUTE"; then + echo "Attribute '$ATTRIBUTE' not found" + sendMissingAttribute "$DEVICE" "$ATTRIBUTE" + continue + fi + + RAW_VALUE=$(echo "$SMART_OUTPUT" | grep "$ATTRIBUTE" | awk '{print $NF}') + + if [ "$RAW_VALUE" -gt 0 ]; then + echo "Attribute '$ATTRIBUTE' has raw value of '$RAW_VALUE'" + sendAlert "$DEVICE" "$ATTRIBUTE" "$RAW_VALUE" + fi + done +done +``` + +Put this all in a script located at `/usr/local/bin/monitor-disks.sh`. In order to have this script run daily, we'll need to first create a systemd service at `/etc/systemd/system/monitor-disks.service`. + +```ini +[Unit] +Description=Monitors disks for bad sectors +Requires= +Wants= +# You'll want to run this script after all the +# hard drives come online +After=dev-sda1.device dev-sdb1.device dev-sdc1.device dev-sdd1.device +# See: https://brandonrozek.com/blog/webhook-notifications-on-systemd-service-failure/ +OnFailure=webhook-notify@%i.service + +[Service] +Type=oneshot +ExecStart=/usr/local/bin/monitor-disks.bash + +[Install] +WantedBy=multi-user.target + +``` + +Finally, we create a systemd timer at `/etc/systemd/system/monitor-disks.timer`. + +```ini +[Unit] +Description=Check for bad sectors daily +[Timer] +OnCalendar=daily +Persistent=true +[Install] +WantedBy=timers.target +``` + +For your convenience, the `monitor-disks.sh` file in its entirety: + +```bash +#!/bin/bash + +DEVICES=("/dev/sda" "/dev/sdb" "/dev/sdc" "/dev/sdd") +ATTRIBUTES=("Reallocated_Sector_Ct" "Current_Pending_Sector" "Offline_Uncorrectable") +WEBHOOK_URL="INSERT_WEBHOOK_URL_HERE" + +if [ "$EUID" -ne 0 ] + then echo "Please run as root" + exit +fi + +if ! command -v smartctl &> /dev/null; then + echo "smartctl is not installed" + exit 1 +fi + +if ! command -v jq &> /dev/null; then + echo "jq is not installed" + exit 1 +fi + +sendMsg() { + local MSG=$1 + local CLEAN_MSG + CLEAN_MSG=$(echo "$MSG" | jq -Rsa .) + curl -X POST --data-urlencode "payload={\"text\": $CLEAN_MSG}" "$WEBHOOK_URL" +} + +sendMissingDevice() { + local DEVICE=$1 + + local MSG="ALERT: '$DEVICE' not found" + sendMsg "$MSG" +} + +sendMissingAttribute() { + local DEVICE=$1 + local ATTRIBUTE=$2 + + local MSG="ALERT: '$DEVICE' is missing attribute '$ATTRIBUTE'" + sendMsg "$MSG" +} + +sendAlert() { + local DEVICE=$1 + local ATTRIBUTE=$2 + local RAW_VALUE=$3 + + local MSG="WARNING: '$DEVICE' has a non-zero raw value for attribute '$ATTRIBUTE'.\n$DEVICE $ATTRIBUTE $RAW_VALUE" + sendMsg "$MSG" +} + +for DEVICE in "${DEVICES[@]}"; do + echo "Checking $DEVICE..." + + # Check if the device exists + if [ ! -e "$DEVICE" ]; then + echo "Device $DEVICE not found." + sendMissingDevice "$DEVICE" + continue + fi + + SMART_OUTPUT=$(smartctl -A "$DEVICE") + + for ATTRIBUTE in "${ATTRIBUTES[@]}"; do + + # Check if the attribute exists in the output + if ! echo "$SMART_OUTPUT" | grep -q "$ATTRIBUTE"; then + echo "Attribute '$ATTRIBUTE' not found" + sendMissingAttribute "$DEVICE" "$ATTRIBUTE" + continue + fi + + RAW_VALUE=$(echo "$SMART_OUTPUT" | grep "$ATTRIBUTE" | awk '{print $NF}') + + if [ "$RAW_VALUE" -gt 0 ]; then + echo "Attribute '$ATTRIBUTE' has raw value of '$RAW_VALUE'" + sendAlert "$DEVICE" "$ATTRIBUTE" "$RAW_VALUE" + fi + done +done + +``` +