#7 Collect remaining lifetime for SSDs : utilities/telegraf-plugins#7

btasker Permalink
23-Jun-22 18:54

assigned to @btasker

btasker Permalink
23-Jun-22 19:47

The plugin seems to be working (commits should show up soon).

You can pull out the last read with the following Flux

from(bucket: "telegraf/autogen")
  |> range(start: v.timeRangeStart)
  |> filter(fn: (r) => r._measurement == "ssd_lifetime")
  |> filter(fn: (r) => r.host == v.host)
  |> filter(fn: (r) => r._field == "perc_remaining")
  |> aggregateWindow(every: v.windowPeriod, fn: max)
  |> last()
  |> group()
  |> map(fn: (r) => ({
      device: r.device,
      lifetime_remaining: r._value
  }))

btasker Permalink
23-Jun-22 19:59

verified

mentioned in commit github-mirror/telegraf-plugins@7dd2a896f34f82aeb87c4bd37a2693d80d154f2d

Commit: github-mirror/telegraf-plugins@7dd2a896f34f82aeb87c4bd37a2693d80d154f2d 
Author: B Tasker                            
                            
Date: 2022-06-23T20:15:56.000+01:00

Message

Add initial implementation of ssd endurance plugin for utilities/telegraf-plugins#7

Currently untested (and the README needs some work too)

+116 -0 (116 lines changed)

btasker Permalink
24-Jun-22 07:25

It might be better to use a normalised output so we don't have to try and mess about parsing the newer human readable output.

If we do

root@thor:~# smartctl --info --attributes --health -n standby  --format=brief /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     Crucial_CT512MX100SSD1
Serial Number:    14500E0A0AA0
LU WWN Device Id: 5 00a075 10e0a0aa0
Firmware Version: MU01
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jun 23 22:48:12 2022 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:    ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    0
  5 Reallocate_NAND_Blk_Cnt PO--CK   100   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    30737
 12 Power_Cycle_Count       -O--CK   100   100   000    -    1988
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Ave_Block-Erase_Count   -O--CK   094   094   000    -    202
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    85
180 Unused_Reserve_NAND_Blk PO--CK   000   000   000    -    4403
183 SATA_Interfac_Downshift -O--CK   100   100   000    -    0
184 Error_Correction_Count  -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   054   041   000    -    46 (Min/Max 7/59)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_ECC_Cnt -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    10
202 Percent_Lifetime_Remain P---CK   094   094   000    -    6
206 Write_Error_Rate        -OSR--   100   100   000    -    0
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   000    -    0
246 Total_LBAs_Written      -O--CK   100   100   000    -    38711241564
247 Host_Program_Page_Count -O--CK   100   100   000    -    1260865850
248 FTL_Program_Page_Count  -O--CK   100   100   000    -    1205219164
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

Then the output should be the same across all....

Nope

root@optimus:/home/ben/Documents/src.old/telegraf-plugins# smartctl --info --attributes --health -n standby  --format=brief /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.13.0-44-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SKHynix_HFS256GD9TNI-L2B0B
Serial Number:                      NJ08N870511009657
Firmware Version:                   11710C10
PCI Vendor/Subsystem ID:            0x1c5c
IEEE OUI Identifier:                0xace42e
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            ace42e 0005e838bb
Local Time is:                      Fri Jun 24 08:23:56 2022 BST

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    28,160,422 [14.4 TB]
Data Units Written:                 16,053,861 [8.21 TB]
Host Read Commands:                 276,475,321
Host Write Commands:                244,833,577
Controller Busy Time:               364
Power Cycles:                       89
Power On Hours:                     12,288
Unsafe Shutdowns:                   42
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               47 Celsius

I guess the second output format is NVME specific then

btasker Permalink
24-Jun-22 07:54

Just found https://github.com/influxdata/telegraf/issues/8701, I'll pass some info on.

Could probably chuck a PR in, but need to figure out how to deal with the NVME output, it looks like we'd prob need to add a regex (I just don't see a better way around it)

btasker Permalink
26-Jun-22 13:38

PR at https://github.com/influxdata/telegraf/pull/11391

utilities/telegraf-plugins#7: Collect remaining lifetime for SSDs

Issue Information

Activity