Skip to content

collector: add nvmesubsystem collector for NVMe-oF path health#3579

Open
sradco wants to merge 2 commits intoprometheus:masterfrom
sradco:add_collector_multipath
Open

collector: add nvmesubsystem collector for NVMe-oF path health#3579
sradco wants to merge 2 commits intoprometheus:masterfrom
sradco:add_collector_multipath

Conversation

@sradco
Copy link

@sradco sradco commented Mar 11, 2026

Add a new disabled-by-default collector that reads
/sys/class/nvme-subsystem/ to expose NVMe over Fabrics subsystem
connectivity metrics.

This complements the existing nvme collector (which reports
per-controller hardware stats) by monitoring the subsystem-level
path redundancy - how many controller paths are live, connecting,
or dead for each NVMe subsystem.

Exposed metrics:

  • node_nvmesubsystem_info
  • node_nvmesubsystem_paths
  • node_nvmesubsystem_paths_live
  • node_nvmesubsystem_path_state

Depends on prometheus/procfs#797

Signed-off-by: Shirly Radco sradco@redhat.com
Co-authored-by: AI Assistant noreply@cursor.com

@sradco sradco force-pushed the add_collector_multipath branch from 742c1b1 to a0a146e Compare March 11, 2026 19:18
@sradco
Copy link
Author

sradco commented Mar 11, 2026

Hi @SuperQ , I created this PR for a new multipath collector.
I would appreciate your review.

@sradco sradco force-pushed the add_collector_multipath branch from a0a146e to 1fa2099 Compare March 12, 2026 08:55
@sradco sradco changed the title Add multipath collector Add multipath collector for NVMe-oF subsystem path health Mar 12, 2026
@sradco sradco force-pushed the add_collector_multipath branch from 1fa2099 to 635b613 Compare March 12, 2026 09:13
@sradco sradco changed the title Add multipath collector for NVMe-oF subsystem path health collector: add nvmesubsystem collector for NVMe-oF path health Mar 12, 2026
Copy link

@jsafrane jsafrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. I tested it with 2 NVMe over TCP devices, I got:

node_nvmesubsystem_info{iopolicy="numa",model="Linux",nqn="tempdisk",serial="bab529a0f32e397e1319",subsystem="nvme-subsys0"} 1
node_nvmesubsystem_path_state{controller="nvme0",state="connecting",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme0",state="dead",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme0",state="live",subsystem="nvme-subsys0",transport="tcp"} 1
node_nvmesubsystem_path_state{controller="nvme0",state="resetting",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme0",state="unknown",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme1",state="connecting",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme1",state="dead",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme1",state="live",subsystem="nvme-subsys0",transport="tcp"} 1
node_nvmesubsystem_path_state{controller="nvme1",state="resetting",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme1",state="unknown",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_paths_live{subsystem="nvme-subsys0"} 2
node_nvmesubsystem_paths_total{subsystem="nvme-subsys0"} 2

Which looks reasonable.

Comment on lines +64 to +70
switch raw {
case "live", "connecting", "resetting", "dead":
return raw
case "deleting", "deleting (no IO)", "new":
return raw
default:
return "unknown"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the record, I checked that this is a complete list of all states reported by the kernel today.

@sradco sradco force-pushed the add_collector_multipath branch 3 times, most recently from 51fb644 to 8acec17 Compare March 16, 2026 10:15
@sradco sradco force-pushed the add_collector_multipath branch 2 times, most recently from 17875ea to d7d2b27 Compare March 16, 2026 17:33
@sradco sradco closed this Mar 17, 2026
@sradco sradco force-pushed the add_collector_multipath branch from d7d2b27 to 1a4cac6 Compare March 17, 2026 08:42
@sradco sradco reopened this Mar 17, 2026
@sradco
Copy link
Author

sradco commented Mar 17, 2026

PR is failing since its based on prometheus/procfs#797, which is still not merged

sradco and others added 2 commits March 18, 2026 19:24
sysfs metrics

Add a new disabled-by-default collector that reads
/sys/block/dm-* to discover Device Mapper multipath
devices and expose path health metrics. Multipath
devices are identified by checking that dm/uuid
starts with "mpath-", which distinguishes them from
LVM or other DM device types.

The path state is reported as-is from
/sys/block/<dev>/device/state, supporting both
SCSI devices (running, offline, blocked, etc.) and
NVMe devices (live, connecting, dead, etc.) without
hardcoding a fixed set of states.

All device-level metrics include both the DM
friendly name (device) and the kernel block device
name (sysfs_name, e.g. dm-0) to enable direct
correlation with node_disk_* I/O metrics without
recording rules.

No special permissions are required — the collector
reads only world-readable sysfs attributes.

Exposed metrics:
- node_dmmultipath_device_info
- node_dmmultipath_device_active
- node_dmmultipath_device_size_bytes
- node_dmmultipath_device_paths
- node_dmmultipath_device_paths_active
- node_dmmultipath_device_paths_failed
- node_dmmultipath_path_state

Signed-off-by: Shirly Radco <sradco@redhat.com>
Co-authored-by: AI Assistant <noreply@cursor.com>
Add a new disabled-by-default collector that reads
/sys/class/nvme-subsystem/ to expose NVMe over Fabrics subsystem
connectivity metrics.

This complements the existing nvme collector (which reports
per-controller hardware stats) by monitoring the subsystem-level
path redundancy — how many controller paths are live, connecting,
or dead for each NVMe subsystem.

Exposed metrics:
- node_nvmesubsystem_info
- node_nvmesubsystem_paths
- node_nvmesubsystem_paths_live
- node_nvmesubsystem_path_state

Signed-off-by: Shirly Radco <sradco@redhat.com>
Co-authored-by: AI Assistant <noreply@cursor.com>
@sradco sradco force-pushed the add_collector_multipath branch from 04fd251 to 705c86e Compare March 23, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants