LNet Health Monitoring Details
LNet health monitoring detects and responds to failures in network interfaces (NIs), peers, and routers, enabling automatic failover in multi-rail setups. Introduced in Lustre 2.12 and enhanced in 2.13 for routing, it remains stable through 2.17.0 (as of January 2026). This guide covers architecture, parameters, commands, and examples from the Lustre Operations Manual (updated 2025). For more, see Lustre Manual.
Architecture
- Scope: Monitors NIs, peers, and routers; integrated with multi-rail and routing.
- Tracking: Per-NI and per-peer health values; router health via periodic checks.
- Detection: Heartbeats, RPC timeouts, pings; distinguishes local/remote failures.
- Integration: Influences NI selection; supports asymmetrical routes (2.13+).
Health Values
| Value | Range/Meaning |
| Health Value (per NI/Peer) | 0-1000; starts at 1000; decrements on failure, increments on success. |
| Status | up/down; shown in lnetctl outputs. |
| Failure Counters | dropped, resend_count, local_error_count (via stats). |
| Router Health | Determined by pings; auto_down marks down. |
Parameters
| Parameter | Default | Description |
| health_sensitivity | 100 | Decrement amount on failure; 0 disables. |
| recovery_interval | 1 | Seconds between recovery pings. |
| transaction_timeout | 30 | Message timeout in seconds. |
| retry_count | 2 | Retries for recoverable failures. |
| peer_timeout | 180 | Seconds before aliveness query. |
| avoid_asym_router_failure | 1 | Requires healthy remote NI for route up. |
| alive_router_check_interval | 60 | Router ping interval (seconds). |
| check_routers_before_use | 0 | Enable pre-use router checks. |
Commands
| Command | Purpose |
| lnetctl global show | View global settings like health_sensitivity. |
| lnetctl set health_sensitivity <value> | Set sensitivity. |
| lnetctl net show -v 3 | Show NI health (local). |
| lnetctl peer show -v 3 | Show peer NI health (remote). |
| lnetctl stats show | View failure stats (resend_count, etc.). |
| lnetctl discover <nid> | Force peer re-discovery. |
| lctl set_param lnet.peer_timeout=<seconds> | Set peer timeout. |
Recovery Mechanisms
- Failover: Shifts traffic to healthy NIs in multi-rail.
- Queues: Unhealthy NIs pinged every recovery_interval; success increments health.
- Retries: Up to retry_count; local vs. remote handling differs.
- Router Checker: Pings live/dead routers; updates health.
- Discovery: PUSH notifies NI state changes.
Local vs. Remote Failures
| Type | Impact | Recovery |
| Local NI | Outbound traffic; decrement local health. | Failover to other local NIs; local resend/no-resend. |
| Remote NI/Peer | Inbound/outbound; detected by timeout. | Remote resend/no-resend; no duplicates; discovery re-pings. |
Integration with Multi-Rail and Routing
- Multi-Rail: Selects highest health NI for messages.
- Routing: Route up if gateway has healthy remote NI; asymmetrical routes supported.
- Resiliency: Avoids down routers via health checks.
Recent Enhancements
- 2.12: Core health feature.
- 2.13: Multi-rail routing with health; asymmetrical routes.
- 2.15+: No major changes; indirect benefits from wide striping, etc.
Examples
Basic Tuning
lnetctl set health_sensitivity 100
lnetctl set recovery_interval 1
lnetctl set retry_count 3
lnetctl global show
Monitoring
lnetctl net show -v 3 # Local NI health
lnetctl peer show -v 3 # Remote health
lnetctl stats show # Failure stats
Router Checker
lctl set_param lnet.alive_router_check_interval=60
lctl set_param lnet.check_routers_before_use=1
Client vs. Server: Same mechanisms; servers benefit more from router health in large clusters.