Overview of Performance Monitoring Challenges in Fortran Development
Table of contents
When we in academic/scientific code development, we always use Fortran as the programming lanuguage, especially in HPC or numerical computing scenarios! (I think this is because some of the historical reasons). And we always need to moinitring the code running on the cluster, so that we can make a good understanding of our code performance. Here I want to address the most common issue during the monitoring and refine and expand on these point, providing the most relevant columns to monitor and some concise solutions.
First let’s clear some basic tools to use during the monitoring, the more detailed can be check on previous article.
Overview
top
orhtop
/atop
uptime
ps
sar
CPU
pidstat
mystat
vmstat
Memory
free
vmstat
Disk IO
iostat
Network
netstat
1. Bad Code - Intensive Computing (Algorithm/Coding Inefficiencies)
Issue: Inefficient algorithms, poorly structured loops, redundant calculations, lack of vectorization, suboptimal compiler optimization flags in Fortran code. This manifests as the code taking longer than expected to perform computations.
Key Columns to Check:
%user
(top
,mpstat
,vmstat
,pidstat -u
): Primary indicator. High%user
directly shows CPU time spent in Fortran code.%CPU
(per-process intop
,pidstat
): For individual Fortran processes, shows their CPU usage percentage.load average
(top
,uptime
): High load average, along with high%user
, suggests CPU saturation.
Concise Solutions/Actions:
Profile Fortran Code: Use profilers (like
gprof
,perf
, compiler profilers).Identify Hotspots: Pinpoint the most time-consuming subroutines/code sections from profiler output.
Algorithm Optimization: Consider more efficient algorithms (if applicable).
Code Restructuring: Optimize loops, reduce redundant calculations, inline functions if needed.
Enable Compiler Optimizations: Use appropriate compiler flags (e.g.,
-O3 -march=native -ffast-math
).Vectorization: Write code in a vectorizable style (e.g., use array operations, contiguous memory access in loops).
2. Parallel Communication Issues (MPI/OpenMP Imbalance)
Issue: In parallel Fortran codes using MPI or OpenMP, imbalances can occur when:
Load Imbalance: Some processes/threads have significantly more work than others, or some do the calculation much quicker than others, which will lead to idle waiting time.
Communication Overhead: Excessive or inefficient communication between processes/threads.
Synchronization Bottlenecks: Waiting at barriers or synchronization points for other processes/threads to catch up.
Key Columns to Check:
%system
(top
,mpstat
,vmstat
): Elevated%system
in parallel codes can indicate communication overhead within MPI/OpenMP libraries.%iowait
(iostat
,vmstat
,sar
): In some cases of parallel I/O or network communication, we might see slightly elevated%iowait
if processes are waiting for I/O or network operations to complete for communication. But%system
is usually more indicative of communication overhead itself.MPI Profiling Metrics (from MPI profilers like
mpiP
, TAU): Crucial for parallel codes! These tools provide metrics like:Time in MPI Communication: Percentage of total time spent in MPI routines.
Message Counts and Sizes: Volume of communication.
Load Imbalance Metrics: Idle time of processes, time spent waiting in barriers.
Concise Solutions/Actions:
MPI Profiling (Essential): Use MPI profilers to analyze communication patterns.
Load Balancing: Redistribute workload to ensure even distribution across processes/threads.
Optimize Communication Algorithms: Use more efficient communication algorithms if possible.
Reduce Communication Frequency/Volume: Minimize data exchange. Aggregate messages if sending many small messages.
Non-Blocking Communication (MPI): Use non-blocking sends/receives to overlap computation and communication.
3. Memory Related Issues (Fortran Code and Data Management)
Issue: Fortran codes, especially in scientific computing, often deal with very large datasets (arrays). Memory issues arise from:
Large Array Allocations: Declaring arrays that are excessively large or inefficiently sized.
Data Structure Choice: Using memory-inefficient data structures.
Memory Leaks: (Though less common in Fortran, possible with dynamic allocation).
Data Copying: Unnecessary creation and copying of large arrays.
Insufficient RAM: System simply runs out of physical memory.
Key Columns to Check:
swap used
(free
,vmstat
,top
): Critical! Highswap used
is a major performance killer. Aim for near zero swap.available
memory (free
): Lowavailable
memory suggests memory pressure.RES
(Resident Set Size) (top
,ps
,pidstat -r
): ObserveRES
of Fortran process. Is it growing unexpectedly or reaching system limits?VIRT
(Virtual Memory Size) (top
,ps
): Less direct, but very highVIRT
can sometimes be a warning.%MEM
(per-process intop
,pidstat
): Memory usage percentage for Fortran process.Cache Miss Rates
(L1-dcache-misses, LLC-load-misses fromperf stat): (More advanced) High cache miss rates often point to memory access patterns that are inefficient and can be related to poor data locality or large memory footprints.
Detailed Example and Column Interpretation (Memory Issue):
Let's say we are running a Fortran simulation, and we observe the following in
vmstat 1
andtop
over time:vmstat 1
Output:procs -----------memory---------- ---swap-- -----io---- -system-- --------cpu-------- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 123456 20000 5000 10000 100 50 200 150 500 1000 70 20 10 0 0 1 0 123500 19500 5000 10000 110 60 220 160 520 1020 72 18 9 1 0 1 0 124000 19000 5000 10000 120 70 240 170 540 1040 74 16 8 2 0 1 0 125000 18000 5000 10000 130 80 260 180 560 1060 76 14 7 3 0 1 0 126000 17000 5000 10000 140 90 280 190 580 1080 78 12 6 4 0 ... (over time, `swpd` increases, `free` and `available` decrease) ...
swpd
(swap used) is increasing over time (e.g., 123456, 123500, 124000...). This is a critical warning sign.free
memory (free
column invmstat
) is decreasing (e.g., 20000, 19500, 19000...). RAM is getting scarce.si
(swap in) andso
(swap out) are non-zero (e.g., 100, 50, 110, 60...). The system is actively swapping memory to disk.cache
andbuff
memory might be relatively stable or decreasing slightly.
top
Output (Process-Specific):PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1234 user 20 0 10.0g 9.5g 1024 R 98.5 99.0 1:30.50 fortran_program
%MEM
for Fortran process is very high (e.g., 99.0%). Process is consuming almost all available RAM.RES
is also very high (e.g., 9.5g). Process is using a large amount of physical RAM.VIRT
(10.0g) is also large.
Interpretation:
- Clear Memory Bottleneck: The increasing
swap used
, decreasingfree
memory, active swapping (si
,so
), and high%MEM
/RES
for Fortran process all strongly indicate that code is running out of physical RAM and resorting to slow swap space. This will severely degrade performance.
Concise Solutions/Actions (Memory Issues):
Memory Profiling: Use memory profilers (e.g.,
valgrind --tool=massif
).Reduce Array Sizes: Declare arrays only as large as needed, consider dynamic allocation.
Optimize Data Structures: Use memory-efficient data structures.
Minimize Data Copying: Avoid unnecessary array copies. Use pointers/references carefully.
Algorithm Optimization: Redesign algorithms to be less memory-intensive.
Fix Memory Leaks: If found by profilers, deallocate dynamically allocated memory properly.
Increase RAM (If possible): Adding RAM is often the most direct solution for memory-bound Fortran codes.
4. Fortran Code Itself - Cache Misses, Memory Allocation Patterns
Issue: While "bad code" (as above point 1) is about algorithm inefficiency, this point is about how Fortran code interacts with memory and CPU caches, which can significantly impact performance, even with efficient algorithms.
Cache Misses: Poor data locality leading to frequent cache misses (CPU has to fetch data from slow RAM).
Inefficient Memory Allocation Patterns: Frequent allocations and deallocations, especially of small blocks, can add overhead. (Less of a primary bottleneck for most Fortran codes, but can be a factor in some cases).
Key Columns to Check:
Cache Miss Rates
(L1-dcache-misses, LLC-load-misses fromperf stat): Primary Indicatorperf
stat
is the tool to use. High cache miss rates are a direct sign of poor memory access patterns.%user
(top, mpstat, vmstat, pidstat -u): High%user
can be exacerbated by cache misses, as the CPU spends more cycles waiting for data from memory. However, high%user
alone doesn't prove cache misses, but in conjunction with high cache miss rates fromperf
, it confirms the issue.Instructions per Cycle (IPC)
(fromperf stat
): Low IPC (Instructions Per Cycle) often correlates with memory bottlenecks, including cache misses. A lower IPC means the CPU is doing fewer instructions per clock cycle, often because it's stalled waiting for memory.
ps
andstrace
are NOT directly used for cache miss analysis. They are for process information and system call tracing, respectively.Concise Solutions/Actions (Cache Misses, Memory Allocation):
Use
perf stat
: Runperf stat -e L1-dcache-misses,LLC-load-misses,instructions,cycles ./fortran_executable
.Analyze Cache Miss Rates: Look for high
L1-dcache-miss-rate
andLLC-load-miss-rate
(relative to instructions executed).Improve Data Locality:
Loop Reordering/Blocking: Optimize loop structure for contiguous array access.
Array Layout Optimization: Consider array-of-structures vs. structure-of-arrays if relevant.
Data Alignment: Ensure data is properly aligned in memory.
Memory Allocation Optimization (If Profiling Shows Issue):
Reduce Frequent Allocations/Deallocations: Try to reuse memory, allocate larger chunks less frequently.
Use Static Allocation (if possible): For fixed-size data, static allocation can be more efficient than dynamic allocation in some cases.
5. Disk I/O Issues (Data Read/Write Bottlenecks)
Issue: Fortran code performance limited by slow disk read/write operations, especially when dealing with large datasets stored on disk.
Key Columns to Check:
%iowait
(wa) (top
,vmstat
,mpstat
): Primary indicator of I/O bottleneck at the CPU level. High%iowait
means the CPU is idle, waiting for I/O.%util
(iostat
): High%util
on the disk device used by Fortran code indicates disk saturation.await
(iostat
): Highawait
for disk I/O operations means long latency.svctm
(iostat
): Highsvctm
means the disk itself is taking longer to service requests.kB_read/s
,kB_wrtn/s
(iostat, vmstat): Low throughput (data read/written per second) might be a bottleneck.r
/b
,w
/b
(vmstat
): System-wide I/O block rates.bi
,bo
(vmstat
): Blocks in and blocks out per second (system-wide).
Concise Solutions/Actions (Disk I/O):
I/O Profiling: Use
pidstat -d
to see I/O per process.strace
can trace system calls, including I/O calls.Reduce I/O Operations: Minimize disk reads/writes where possible.
Optimize I/O Patterns: Use buffered I/O, read/write in larger blocks, use binary file formats.
Asynchronous I/O: Consider asynchronous I/O for non-blocking operations.
Memory Mapping: For read-only files.
Faster Storage: Use SSDs instead of HDDs.
Local Storage: If possible, use local disk storage instead of network file systems (NFS) for I/O-intensive tasks.
6. Network Issues in Calculation Clusters (Why relevant even for "just calculation")
Issue: In compute clusters, even if the primary purpose is calculation, network issues are still highly relevant because:
Parallel Computing (MPI): Distributed Fortran codes using MPI heavily rely on the network for inter-process communication. Network latency and bandwidth directly impact parallel performance.
Data Input/Output: Input data for calculations might be read from network storage (e.g., shared file systems). Output data might be written back to network storage.
Job Submission/Management: Cluster job schedulers (like Slurm, PBS, LSF) use the network for job submission, monitoring, and control. Network issues can affect job submission and overall cluster management.
Remote Access/Monitoring: Typically access and monitor compute nodes in a cluster remotely via the network. Network problems can hinder ability to interact with and manage Fortran jobs.
Key Columns to Check (Network - Require network-specific tools):
rxpck/s
,txpck/s
,rxkB/s
,txkB/s
,%ifutil
(sar -n DEV
): Network traffic volume and interface utilization on compute nodes and network interfaces used for MPI or data transfer.Network Latency (ping times): Measure latency between compute nodes. High latency impacts MPI performance.
Packet Loss, Errors, Dropped Packets (
netstat -i
,ip -s link
): Indicate network problems that can severely degrade performance and reliability.MPI Profiling (Network Communication Metrics): MPI profilers will show detailed network communication times, message sizes, and communication patterns, which are essential for diagnosing network-related issues in parallel Fortran.
Concise Solutions/Actions (Network Issues in Clusters):
Network Monitoring: Use network monitoring tools (like
sar -n DEV
,netstat
,iftop
,tcpdump
, cluster-level network monitoring systems).Check Network Infrastructure: Verify network cables, switches, routers, and network card health.
Optimize MPI Communication (as in point 2): Efficient MPI communication patterns reduce network load.
Network Configuration: Ensure proper network configuration (MTU size, network interface settings).
Dedicated Network for MPI (e.g., InfiniBand): For high-performance MPI applications, using a dedicated high-speed interconnect like InfiniBand is often essential to minimize network bottlenecks.
Local Storage (If possible): For I/O, using local disk storage on compute nodes can reduce reliance on network file systems.
By systematically monitoring these columns and using the appropriate performance analysis tools, we can effectively diagnose and address a wide range of performance bottlenecks in the Fortran code development workflow. Remember that context is key – interpret these metrics in relation to the expected behavior of application and system workload.
More tools
Essential Tools for Deeper Fortran Performance Analysis (Beyond System Metrics):
While system metrics provide the initial direction, for serious Fortran performance tuning, we must use profiling tools:
Fortran Compilers Profilers: Many Fortran compilers (like GNU
gfortran
, Intel Fortran Compilerifort
, PGI/NVIDIA HPC SDK Fortran compilernvfortran
) have built-in profiling capabilities or integrate with profilers. Use compiler flags to enable profiling (e.g.,-pg
withgfortran
,-profile
withifort
). These profilers can give us subroutine-level CPU time, call counts, etc.gprof
(GNU Profiler): A classic profiling tool that works with code compiled with-pg
. Provides call graphs and function-level profiling.perf
(Linux Performance Events): A very powerful system-wide profiler for Linux. Can profile CPU cycles, cache misses, branch mispredictions, system calls, and much more. Useperf top
for real-time profiling orperf record
andperf report
for detailed analysis.perf stat
is great for getting summary statistics.valgrind
(Memory Profiling -massif
, Cache Profiling -cachegrind
): Excellent for memory profiling (identifying memory leaks, excessive memory allocation) and cache profiling (identifying cache miss hotspots).valgrind
tools can be more computationally expensive to run but provide very detailed insights.MPI Profilers (for parallel Fortran): Tools like
mpiP
,TAU (Tuning and Analysis Utilities)
, HPCToolkit, ParaProf, Intel Trace Analyzer and Collector, ARM MAP. Essential for analyzing communication performance, load balance, and scaling behavior of parallel Fortran codes.
In Daily Work:
Start with System Metrics: Use
top
,vmstat
,iostat
,mpstat
to get a general overview of system resource usage while running Fortran code. Look at%user
,%system
,%iowait
,swap used
,load average
, disk%util
,await
.Identify Potential Bottlenecks: Based on system metrics, narrow down the likely bottleneck: CPU-bound (high
%user
), I/O-bound (high%iowait
, disk metrics), Memory-bound (highswap used
), or Network-bound (network metrics, MPI issues).Use Profiling Tools: Once we have a suspected bottleneck, use appropriate profilers to get deeper, code-level insights.
For CPU: Compiler profiler,
gprof
,perf
.For Memory:
valgrind --tool=massif
, memory debuggers.For Cache:
valgrind --tool=cachegrind
,perf stat
.For I/O:
pidstat -d
,strace
(to trace system calls).For Parallelism: MPI profilers.
Iterate and Optimize: Based on profiler output, identify the performance hotspots in Fortran code. Optimize the code, recompile, rerun, and re-profile to measure the impact of optimizations. Repeat this cycle until achieve satisfactory performance.