Lustre Failover Examples

Lustre failover provides high availability by allowing backup servers to take over MDT (Metadata Target) or OST (Object Storage Target) services upon primary failure. Examples below use shared storage (e.g., SAN/RAID) and assume HA tools like Pacemaker for automatic switching. Based on Lustre 2.17.0 (January 2026), with principles from the Lustre Operations Manual (updated 2025). Failover does not provide data redundancy, use RAID or FLR for that. For beginners: Failover is like having a spare tire—if the main one fails, you switch seamlessly to keep going. This is crucial in HPC environments where downtime can cost thousands in lost compute time.

Warnings:

Additional Best Practices:

Prerequisites

Before setting up failover, ensure the following. These steps prepare your environment to avoid common pitfalls.

MDT Failover Examples

MDT handles metadata operations, so failover here ensures quick recovery of file listings and permissions. Active/active uses DNE (Distributed Namespace Environment) for scaling.

Active/Active MDT Failover (with DNE)

Two active MDS nodes, with one active MDT on each. Failover node takes over on failure. This maximizes hardware use but requires careful balancing.

# Format MDT0 with failover
# Explanation: --fsname sets the filesystem name, --mgs for management if combined, --mdt for metadata target, --index assigns ID, --servicenode specifies primary and failover NIDs.
mkfs.lustre --fsname=testfs --mgs --mdt --index=0 --servicenode=mds1@tcp --servicenode=mds2@tcp /dev/sdX

# Format MDT1 with separate failover
# Explanation: Similar, but index=1 for second MDT; nodes swapped for load balancing.
mkfs.lustre --fsname=testfs --mdt --index=1 --servicenode=mds2@tcp --servicenode=mds1@tcp /dev/sdb

# Mount on respective primaries
# Explanation: Creates mount point and mounts the target. Do this on the primary node for each.
mkdir /mnt/testfs-mdt0000
mount.lustre /dev/sdX /mnt/testfs-mdt0000  # On mds1
mkdir /mnt/testfs-mdt0001
mount.lustre /dev/sdb /mnt/testfs-mdt0001  # On mds2

# On failover node (passive, do not mount until failover)
# Pacemaker config example (simplified)
# Explanation: Creates filesystem resource for monitoring mount status.
pcs resource create mdt0_fs Filesystem device="/dev/sdX" directory="/mnt/testfs-mdt0000" fstype="lustre" op monitor interval=60s
# Explanation: Creates Lustre-specific resource.
pcs resource create mdt0 Lustre target="/mnt/testfs-mdt0000" op monitor interval=60s
# Explanation: Groups resources together for atomic failover.
pcs resource group add mdt0_group mdt0_fs mdt0
pcs resource meta mdt0_group target-role="Started"

# As above, create separate groups for each MDT

# Client mount with failover NIDs
# Explanation: Clients specify multiple NIDs for automatic reconnection on failure.
mount -t lustre mds1@tcp:mds2@tcp:/testfs /mnt/testfs

After setup, verify with lfs df -h on clients. If issues, check logs with dmesg | grep Lustre.

OST Failover Examples

OST failover focuses on data storage; active/active distributes load for better performance.

Active/Active OST Failover

Two OSS nodes share OSTs, each serving half; on failure, survivor takes all. This is efficient for large clusters.

# Format OST with failover
# Explanation: --ost for object storage, --mgsnode points to MGS, --servicenode for failover.
mkfs.lustre --fsname=testfs --ost --index=0 --mgsnode=mgs@tcp --servicenode=oss1@tcp --servicenode=oss2@tcp /dev/sdc

# Mount on oss1 (active for this OST)
# Explanation: Mount on the primary OSS.
mkdir /mnt/testfs-ost0000
mount.lustre /dev/sdc /mnt/testfs-ost0000

# Pacemaker config
# Explanation: Similar to MDT, creates filesystem and Lustre resources.
pcs resource create ost0_fs Filesystem device="/dev/sdc" directory="/mnt/testfs-ost0000" fstype="lustre"
pcs resource create ost0 Lustre target="/mnt/testfs-ost0000"
pcs resource group add ost0_group ost0_fs ost0

Manual Failover Test

Simulate failure and switch. Useful for testing without HA software.

# On primary: Unmount and stop
# Explanation: Cleanly unmount to avoid errors.
umount /mnt/testfs-ost0000

# On failover: Mount
# Explanation: Mount on the backup node.
mount.lustre /dev/sdc /mnt/testfs-ost0000

# Mark degraded (optional during RAID rebuild, prevents new allocations)
# Explanation: Sets degraded flag to avoid writing new data during recovery.
lctl set_param obdfilter.testfs-OST0000.degraded=1

Monitor recovery with lctl get_param obdfilter.*.recovery_status.

Integration with Pacemaker

Pacemaker handles automatic failover. It's a cluster resource manager that detects failures and migrates resources.

# Install
# Explanation: Installs Pacemaker and dependencies on RHEL-like systems.
dnf install pacemaker corosync pcs fence-agents-all

# Configure cluster
# Explanation: Authenticates nodes and sets up the cluster.
pcs cluster auth node1 node2
pcs cluster setup --name lustre_ha node1 node2
pcs cluster start --all
pcs cluster enable --all

# Add STONITH (fencing)
# Explanation: Configures fencing device (e.g., IPMI) to power off failed nodes, preventing corruption.
pcs stonith create fence_ipmi fence_ipmilan pcmk_host_list="node1 node2" ipaddr="bmc_ip" login="admin" passwd="pass" lanplus=1

# Add Lustre resources (as above)

Monitor: pcs status. For SLES, use zypper instead of dnf. Test by killing processes or pulling cables.

Best Practices

These practices ensure reliable failover operations.

For recovery tuning, see manual sections on IR (Imperative Recovery) and VBR (Version-Based Recovery). No major failover changes in 2.17, but check release notes for minor fixes.

Additional Tips

For large-scale deployments, consider active/active for all components to utilize hardware fully. Enable debug logging during tests: lctl set_param debug=+ha. Join Lustre community forums for real-world advice. In cloud setups, use managed shared storage like EBS with multi-attach. Always benchmark post-failover performance with IOR to ensure no degradation.