LNet Health Monitoring Details

LNet health monitoring detects and responds to failures in network interfaces (NIs), peers, and routers, enabling automatic failover in multi-rail setups. Introduced in Lustre 2.12 and enhanced in 2.13 for routing, it remains stable through 2.17.0 (as of January 2026). This guide covers architecture, parameters, commands, and examples from the Lustre Operations Manual (updated 2025). For more, see Lustre Manual.

Architecture

Scope: Monitors NIs, peers, and routers; integrated with multi-rail and routing.
Tracking: Per-NI and per-peer health values; router health via periodic checks.
Detection: Heartbeats, RPC timeouts, pings; distinguishes local/remote failures.
Integration: Influences NI selection; supports asymmetrical routes (2.13+).

Health Values

Value	Range/Meaning
Health Value (per NI/Peer)	0-1000; starts at 1000; decrements on failure, increments on success.
Status	up/down; shown in lnetctl outputs.
Failure Counters	dropped, resend_count, local_error_count (via stats).
Router Health	Determined by pings; auto_down marks down.

Parameters

Parameter	Default	Description
health_sensitivity	100	Decrement amount on failure; 0 disables.
recovery_interval	1	Seconds between recovery pings.
transaction_timeout	30	Message timeout in seconds.
retry_count	2	Retries for recoverable failures.
peer_timeout	180	Seconds before aliveness query.
avoid_asym_router_failure	1	Requires healthy remote NI for route up.
alive_router_check_interval	60	Router ping interval (seconds).
check_routers_before_use	0	Enable pre-use router checks.

Commands

Command	Purpose
lnetctl global show	View global settings like health_sensitivity.
lnetctl set health_sensitivity <value>	Set sensitivity.
lnetctl net show -v 3	Show NI health (local).
lnetctl peer show -v 3	Show peer NI health (remote).
lnetctl stats show	View failure stats (resend_count, etc.).
lnetctl discover <nid>	Force peer re-discovery.
lctl set_param lnet.peer_timeout=<seconds>	Set peer timeout.

Recovery Mechanisms

Failover: Shifts traffic to healthy NIs in multi-rail.
Queues: Unhealthy NIs pinged every recovery_interval; success increments health.
Retries: Up to retry_count; local vs. remote handling differs.
Router Checker: Pings live/dead routers; updates health.
Discovery: PUSH notifies NI state changes.

Local vs. Remote Failures

Type	Impact	Recovery
Local NI	Outbound traffic; decrement local health.	Failover to other local NIs; local resend/no-resend.
Remote NI/Peer	Inbound/outbound; detected by timeout.	Remote resend/no-resend; no duplicates; discovery re-pings.

Integration with Multi-Rail and Routing

Multi-Rail: Selects highest health NI for messages.
Routing: Route up if gateway has healthy remote NI; asymmetrical routes supported.
Resiliency: Avoids down routers via health checks.

Recent Enhancements

2.12: Core health feature.
2.13: Multi-rail routing with health; asymmetrical routes.
2.15+: No major changes; indirect benefits from wide striping, etc.

Examples

Basic Tuning

lnetctl set health_sensitivity 100
lnetctl set recovery_interval 1
lnetctl set retry_count 3
lnetctl global show

Monitoring

lnetctl net show -v 3  # Local NI health
lnetctl peer show -v 3 # Remote health
lnetctl stats show     # Failure stats

Router Checker

lctl set_param lnet.alive_router_check_interval=60
lctl set_param lnet.check_routers_before_use=1

Client vs. Server: Same mechanisms; servers benefit more from router health in large clusters.