Lustre Startup Tutorial
Starting a Lustre file system ensures clean recovery, proper module loading, and availability for clients. The correct order is: MGS/MDT first (metadata), then OSS/OSTs (data), then clients last. This guide is for Lustre 2.17.0 (January 2026), based on the Lustre Operations Manual (updated 2025). Use for boot, maintenance, or recovery. For production, integrate with HA tools like Pacemaker for automatic startup.
Prerequisites
- All nodes powered on; storage devices available.
- Lustre packages installed; modules loadable.
- Check logs for prior issues:
journalctl -u lustre. - If HA (e.g., Pacemaker), start cluster resources.
- Run e2fsck manually if corruption suspected: e2fsck -fp /dev/sdX.
Correct Startup Order
| Step | Component | Reason |
|---|---|---|
| 1 | MGS/MDT/MDS | Provides configuration and metadata; required for OSTs/clients. |
| 2 | OSS/OSTs | Registers with MGS; provides data storage. |
| 3 | Clients | Mounts after servers are up; avoids timeouts. |
| 4 | Verify & Tune | Check recovery, adjust params. |
Step-by-Step Startup
Assumes a simple setup (e.g., from 3-node tutorial: Node1 MDS, Node2 OSS, Node3 client).
1. Start MGS/MDT/MDS (Node1)
# Load modules
modprobe lustre
# Mount MDT
mkdir /mnt/testfs-mdt0
mount.lustre /dev/sdb /mnt/testfs-mdt0 # Adjust device
# Verify
lctl device_list
lfs df -h # Shows MDT only
2. Start OSS/OSTs (Node2)
# Load modules
modprobe lustre
# Mount OSTs
mkdir /mnt/testfs-ost0 /mnt/testfs-ost1
mount.lustre /dev/sdc /mnt/testfs-ost0 # Adjust devices
mount.lustre /dev/sdd /mnt/testfs-ost1
# Verify
lfs df -h # From Node1 or Node3 after client; shows OSTs
3. Mount Clients (Node3)
# Load modules
modprobe lustre
# Mount
mkdir /mnt/testfs
mount -t lustre MGSNID@NET:/FSNAME /mnt/MOUNTPOINT # e.g., 192.168.1.1@tcp0:/testfs
# Verify
df -h /mnt/testfs
lfs df -h
4. Verify and Tune
# Check recovery (on MDS/OSS)
lctl get_param *.recovery_status # Wait for "completed_clients" to match total
# Tune if needed (e.g., on clients)
lctl set_param osc.*.max_dirty_mb=32
# Check connections
lshowmount -v # From MDS
Best Practices for Friendly Startup
- Automate: Use systemd units or Pacemaker for production.
- Monitor Recovery: Set
recovery_time_soft=300for timeouts. - Enable Features:
lctl set_param mdt.*.enable_remote_dir=1for DNE. - Post-Start Checks: Run
lfs check servers; verify quotas if enabled. - HA: Start Pacemaker:
pcs cluster start --all. - Avoid Cold Starts: Warm up caches with reads if possible.
- Logging: Increase debug:
lctl set_param debug=+ha.
Common Issues
| Issue | Fix |
|---|---|
| Mount Fails (ETIMEDOUT) | Check network/LNet; start servers first. |
| Recovery Slow | Tune recovery_time_hard=600; evict stuck clients. |
| Modules Not Loaded | modprobe lustre; check deps. |
For shutdown, reverse the order. See manual for advanced configs.