Lustre Performance Benchmarking

Lustre is a high-performance, parallel distributed file system commonly used in high-performance computing (HPC) environments for large-scale data storage and I/O operations. It consists of Metadata Servers (MDS), Object Storage Servers (OSS), and clients that access the filesystem. Benchmarking Lustre helps evaluate its performance in terms of I/O bandwidth (data transfer rates), IOPS (Input/Output Operations Per Second), and metadata operations (like file creation and deletion). This guide is designed for users with varying experience levels, including beginners. It covers key tools, detailed usage examples, tuning tips, and metrics for Lustre versions 2.17.0 and 2.15.x (as of January 2026). Always start by benchmarking the raw hardware (e.g., disks and network) to establish baselines—aim for 85-90% efficiency when running through Lustre. Refer to recent updates from the Lustre Operations Manual and discussions from LUG (Lustre User Group) 2025 on advancements like DNE3 (Distributed Namespace Enhancement 3) for automated metadata scaling.

Introduction for Beginners

If you're new to Lustre benchmarking, understand that these tests simulate real-world workloads to identify bottlenecks in storage, network, or configuration. Key concepts:

I/O Bandwidth: Measures how much data can be read or written per second (e.g., in GB/s). Useful for large file transfers.
IOPS: Counts the number of small read/write operations per second. Important for applications with many small files.
Metadata Operations: Involves file system actions like creating, listing, or deleting files, which can be a bottleneck in directories with millions of files.
MPI (Message Passing Interface): Many tools use MPI for parallel execution across multiple nodes/clients.
Prerequisites: Ensure you have administrative access, a mounted Lustre filesystem, and tools like Git and compilers installed. Tests should be run in a controlled environment to avoid impacting production systems.

Warning: Benchmarking can generate heavy load, potentially causing data loss if not careful. Always back up important data, run tests on non-production mounts, and monitor system health (e.g., CPU, memory, disk space) during runs.

Benchmarking Tools

These tools are essential for testing different aspects of Lustre performance. For beginners, start with simple single-node tests before scaling to multiple clients.

Tool	Purpose	Installation/Availability	Beginner Notes
IOR	I/O bandwidth testing (sequential/random reads/writes). Simulates large-scale data transfers.	Clone from github.com/hpc/ior; build with MPI (e.g., `./configure --with-lustre; make`).	Easy to start with default options; focus on bandwidth metrics. Requires MPI for parallel runs.
mdtest	Metadata operations (create/stat/remove). Tests how quickly the filesystem handles file management.	Included with IOR; build together.	Ideal for metadata-heavy workloads like simulations with many small files. Watch for ops/sec rates.
fio	Flexible I/O patterns (bandwidth, IOPS, latency). Highly customizable for mixed workloads.	Install via package manager (e.g., `dnf install fio` on RHEL).	Beginner-friendly with job files; experiment with block sizes to mimic your application's I/O.
obdfilter-survey	OST performance survey (disk/network). Checks individual Object Storage Targets (OSTs).	Built-in Lustre tool; run via `lctl`.	Use to isolate slow OSTs; output shows per-thread performance.
sgpdd-survey	Raw hardware I/O. Benchmarks underlying disks without Lustre overhead.	Lustre utility script in `/usr/lib64/lustre/tests/`.	Run this first to set expectations; compare to Lustre results for efficiency.
IO500	Runs IOR, mdtest, find with standard parameters to show best/worst performance envelope of system. Provides a comprehensive score for ranking storage systems.	Clone from github.com/IO500/io500; build with MPI (e.g., ./prepare; make).	Great for standardized comparisons; submit results to IO500 list for global rankings.
llstat / llobdstat	Simple local real-time stats monitoring utilities available with every Lustre installation. Tracks I/O and metadata stats.	Built-in; use lctl get_param.	Like 'top' for Lustre; monitor during tests to spot live issues.
jobstats	Per-job load monitoring for each client process/job. Helps identify resource-hungry jobs.	Enabled on MGS; query via lctl.	Enable globally; useful in shared clusters to enforce fair usage.
lljobstat	Local tool to monitor job stats on a server (2.15+). Top-like utility to monitor RPCs sent to server to isolate high load.	Built-in for recent versions.	Run on servers to debug client-induced overloads.

Running Benchmarks

Prerequisites: Mount the Lustre filesystem (e.g., mount -t lustre mgsnode:/fsname /lustre), ensure NTP is synced across nodes for accurate timings, disable firewalls/SELinux temporarily for tests (restore after). Run benchmarks on multiple clients/servers for aggregate results to simulate real HPC workloads. Clear caches and stats before each run (lctl set_param -P osc.*.stats=clear; mdc.*.stats=clear) to ensure consistent measurements. For beginners, start with small-scale tests (e.g., 1-4 processes) and gradually increase.

Best Practices: Run each test 3-5 times and average results to account for variability. Use dedicated test directories to avoid interfering with other data. Document your environment (Lustre version, hardware specs) for reproducibility.

Warnings: High-thread counts can overwhelm systems—monitor temperatures and logs (e.g., dmesg). Avoid running during peak hours in shared environments. If tests fail, check for errors in /var/log/messages or Lustre debug logs (lctl debug_daemon start).

IOR Example (Bandwidth)

IOR is great for testing large, contiguous I/O. For beginners: The -a POSIX API works well with Lustre; -F creates one file per process to avoid lock contention; -C ensures collective operations for better parallelism.

# Install IOR (example on RHEL; ensure MPI is in PATH)
dnf install openmpi-devel git
git clone https://github.com/hpc/ior.git
cd ior
./bootstrap
./configure --with-lustre
make install

# Run sequential write (adjust -np for processes, -t 1m for transfer size (1MB chunks), -b 1g for block size (1GB total per process), -F for file per process, -C for collective, -e for fsync, -k keep file after test, -vv verbose, -o output file)
mpirun -np 16 ior -a POSIX -t 1m -b 1g -F -C -e -k -vv -o /lustre/testfile

Interpret: Output shows write/read rates in MB/s or GiB/s. Compare to theoretical limits (e.g., network speed * number of clients). If low, check stripe count (lfs getstripe /lustre) or network congestion.

mdtest Example (Metadata)

mdtest focuses on non-data operations. Beginners: Use -u for unique directories per process to reduce contention; -i 3 repeats the test for averaging.

# Run with IOR build (mdtest binary should be in same path)
mpirun -np 32 mdtest -d /lustre/testdir -i 3 -b 10 -z 1 -n 10000 -u -C -T -R -X

Parameters: -d test directory, -i iterations, -b branching factor (directory tree depth), -z depth, -n files per process, -u unique dirs, -C create, -T stat, -R random stat, -X remove. Results: ops/sec for each operation—higher is better. If slow, consider DNE for metadata distribution.

fio Example (IOPS)

fio allows custom workloads. For beginners: The job file defines parameters; direct=1 bypasses OS cache for true hardware tests.

# Create fio job file (random read/write; bs=4k for small blocks typical of IOPS tests)
cat <<EOF > lustre.fio
[global]
ioengine=psync  # POSIX sync I/O
direct=1  # Bypass page cache
bs=4k  # Block size
size=1g  # File size per job
numjobs=16  # Threads per client
directory=/lustre
runtime=60  # Run for 60 seconds
group_reporting=1  # Aggregate results

[randrw]
rw=randrw  # Random read/write
rwmixread=50  # 50% reads
EOF

fio lustre.fio

Results: Shows IOPS, latency (in us/ms), bandwidth. Tune bs for your workload (e.g., 4k for databases, 1m for streaming). If latency is high, check queue depths or add more servers.

obdfilter-survey Example (OST Survey)

This surveys OST performance. Beginners: Run with low threads first; case=netdisk tests both network and disk.

# Run network+disk test (adjust rslt=/tmp/survey for output, osts=4 for number of OSTs to test, thrlo=1 thrhi=4 for thread range, size=2048 for record size in KB)
sh /usr/lib64/lustre/tests/obdfilter-survey rslt=/tmp/survey osts=4 thrlo=1 thrhi=4 size=2048 case=netdisk

Output: Aggregate bandwidth, per-OST min/max rates. Use to detect faulty disks (low min) or network issues (inconsistent rates).

Monitoring During Tests

Monitoring helps correlate benchmarks with system behavior. For beginners: Start with simple commands; enable jobstats only if needed, as it adds slight overhead.

# Jobstats (enable first on all MDTs: lctl set_param -P mdt.*.job_stats=enable)
lctl get_param job_stats  # Shows per-JobID ops/bytes read/written

# Real-time stats (e.g., OST I/O every 5 seconds)
llstat -i 5 ost.OSS.ost_io

# Clear stats for fresh start
lctl set_param osc.*.stats=clear mdc.*.stats=clear

Tuning Tips

Tuning optimizes Lustre for your hardware and workload. Beginners: Change one parameter at a time and re-benchmark.

Client: Increase readahead for sequential reads (lctl set_param llite.*.max_read_ahead_mb=2048); boost max RPCs in flight (lctl set_param mdc.*.max_rpcs_in_flight=64) for parallelism.
Server: Raise thread limits (lctl set_param ost.OSS.ost_io.threads_max=512); enable read cache for HDDs (lctl set_param osd-ldiskfs.*.read_cache_enable=1).
Network: Adjust LNet credits/buffers for high-throughput; use NRS (Network Request Scheduler) policies like TBF (Token Bucket Filter) for rate limiting jobs.
Best Practices: Use OST pools to group similar storage (e.g., SSD for metadata); enable JobID tracking for accountability; integrate monitoring with tools like CollectL or LMT (Lustre Monitoring Tool); always test interoperability after tuning (e.g., with different client OS versions).
Recent Advancements (2025): DNE3 for automated metadata scaling (from LUG 2025); enhanced jobstats in 2.16+ for finer granularity; optimizations for flash/SSD, including better wear-leveling integration.

Warnings: Over-tuning can lead to instability—e.g., too many threads may cause OOM (Out of Memory) errors. Persist changes with -P flag carefully, as they affect all clients.

Key Metrics

These are typical aggregate values for large-scale systems; scale down for smaller setups.

Metric	Typical Values (Aggregate)	Source	Beginner Interpretation
Bandwidth	Up to 50 TB/s	IOR/fio; llobdstat	High values indicate good for big data transfers; if low, check striping.
IOPS	Up to 225M	fio/mdtest	Essential for random access; aim for application needs (e.g., 10k+ for databases).
Metadata Ops	1M creates/s, 2M stats/s	mdtest; rpc_stats	Slow metadata can bottleneck workflows; use DNE to distribute.
RPC Latency	Monitor timeouts/queue depth	llstat	High latency (>1ms) suggests overload; reduce with more servers.

Additional Resources and Troubleshooting

For more details:

Lustre Manual: doc.lustre.org (search for "performance tuning").
JIRA Tickets: Check OpenSFS JIRA for LU- prefixed issues on performance.
Community: Join Lustre mailing lists or attend LUG events (e.g., LUG 2025 slides on DNE3).
Common Pitfalls: Forgetting to clear caches leads to inflated results; mismatched stripe sizes cause poor performance—use lfs setstripe -c -1 for full striping.
Advanced Tools: Consider integrating with Grafana for visual monitoring or using IO500 for competitive benchmarking.

If issues arise, enable debug logging (lctl set_param debug=+io), reproduce the problem, and analyze with lctl debug_file. Always update to the latest patch level for bug fixes.