utilities/telegraf-plugins#7: Collect remaining lifetime for SSDs



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Created: 23-Jun-22 18:54



Description

I had a hardware failure earlier this week - this wouldn't have caught it, but it did make me think more about what I'd do if other systems had storage die.

Currently, there doesn't seem to be anything which collects SSD endurance information - I want to throw together a plugin to collect that information from smartctl.

Although I don't have an immediate need, it could later be extended to do collect SSD endurance from racadm and other similar vendor specific utils



Toggle State Changes

Activity


assigned to @btasker

The plugin seems to be working (commits should show up soon).

You can pull out the last read with the following Flux

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "ssd_lifetime")
  |> filter(fn: (r) => r.host == v.host)
  |> filter(fn: (r) => r._field == "perc_remaining")
  |> aggregateWindow(every: v.windowPeriod, fn: max)
  |> last()
  |> group()
  |> map(fn: (r) => ({
      device: r.device,
      lifetime_remaining: r._value
  }))
verified

mentioned in commit github-mirror/telegraf-plugins@7dd2a896f34f82aeb87c4bd37a2693d80d154f2d

Commit: github-mirror/telegraf-plugins@7dd2a896f34f82aeb87c4bd37a2693d80d154f2d 
Author: B Tasker                            
                            
Date: 2022-06-23T20:15:56.000+01:00 

Message

Add initial implementation of ssd endurance plugin for utilities/telegraf-plugins#7

Currently untested (and the README needs some work too)

+116 -0 (116 lines changed)

It might be better to use a normalised output so we don't have to try and mess about parsing the newer human readable output.

If we do

root@thor:~# smartctl --info --attributes --health -n standby  --format=brief /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     Crucial_CT512MX100SSD1
Serial Number:    14500E0A0AA0
LU WWN Device Id: 5 00a075 10e0a0aa0
Firmware Version: MU01
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jun 23 22:48:12 2022 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:    ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    0
  5 Reallocate_NAND_Blk_Cnt PO--CK   100   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    30737
 12 Power_Cycle_Count       -O--CK   100   100   000    -    1988
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Ave_Block-Erase_Count   -O--CK   094   094   000    -    202
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    85
180 Unused_Reserve_NAND_Blk PO--CK   000   000   000    -    4403
183 SATA_Interfac_Downshift -O--CK   100   100   000    -    0
184 Error_Correction_Count  -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   054   041   000    -    46 (Min/Max 7/59)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_ECC_Cnt -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    10
202 Percent_Lifetime_Remain P---CK   094   094   000    -    6
206 Write_Error_Rate        -OSR--   100   100   000    -    0
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   000    -    0
246 Total_LBAs_Written      -O--CK   100   100   000    -    38711241564
247 Host_Program_Page_Count -O--CK   100   100   000    -    1260865850
248 FTL_Program_Page_Count  -O--CK   100   100   000    -    1205219164
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

Then the output should be the same across all....

Nope

root@optimus:/home/ben/Documents/src.old/telegraf-plugins# smartctl --info --attributes --health -n standby  --format=brief /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.0-44-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SKHynix_HFS256GD9TNI-L2B0B
Serial Number:                      NJ08N870511009657
Firmware Version:                   11710C10
PCI Vendor/Subsystem ID:            0x1c5c
IEEE OUI Identifier:                0xace42e
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            ace42e 0005e838bb
Local Time is:                      Fri Jun 24 08:23:56 2022 BST

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    28,160,422 [14.4 TB]
Data Units Written:                 16,053,861 [8.21 TB]
Host Read Commands:                 276,475,321
Host Write Commands:                244,833,577
Controller Busy Time:               364
Power Cycles:                       89
Power On Hours:                     12,288
Unsafe Shutdowns:                   42
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               47 Celsius

I guess the second output format is NVME specific then

Just found https://github.com/influxdata/telegraf/issues/8701, I'll pass some info on.

Could probably chuck a PR in, but need to figure out how to deal with the NVME output, it looks like we'd prob need to add a regex (I just don't see a better way around it)