Overview of Performance Monitoring Challenges in Fortran Development

When we in academic/scientific code development, we always use Fortran as the programming lanuguage, especially in HPC or numerical computing scenarios! (I think this is because some of the historical reasons). And we always need to moinitring the code running on the cluster, so that we can make a good understanding of our code performance. Here I want to address the most common issue during the monitoring and refine and expand on these point, providing the most relevant columns to monitor and some concise solutions.

First let’s clear some basic tools to use during the monitoring, the more detailed can be check on previous article.

Overview
- top or htop /atop
- uptime
- ps
- sar
CPU
- pidstat
- mystat
- vmstat
Memory
- free
- vmstat
Disk IO
- iostat
Network
- netstat

1. Bad Code - Intensive Computing (Algorithm/Coding Inefficiencies)

Issue: Inefficient algorithms, poorly structured loops, redundant calculations, lack of vectorization, suboptimal compiler optimization flags in Fortran code. This manifests as the code taking longer than expected to perform computations.
Key Columns to Check:
- %user (top, mpstat, vmstat, pidstat -u): Primary indicator. High %user directly shows CPU time spent in Fortran code.
- %CPU (per-process in top, pidstat): For individual Fortran processes, shows their CPU usage percentage.
- load average (top, uptime): High load average, along with high %user, suggests CPU saturation.
Concise Solutions/Actions:
- Profile Fortran Code: Use profilers (like gprof, perf, compiler profilers).
- Identify Hotspots: Pinpoint the most time-consuming subroutines/code sections from profiler output.
- Algorithm Optimization: Consider more efficient algorithms (if applicable).
- Code Restructuring: Optimize loops, reduce redundant calculations, inline functions if needed.
- Enable Compiler Optimizations: Use appropriate compiler flags (e.g., -O3 -march=native -ffast-math).
- Vectorization: Write code in a vectorizable style (e.g., use array operations, contiguous memory access in loops).

2. Parallel Communication Issues (MPI/OpenMP Imbalance)

Issue: In parallel Fortran codes using MPI or OpenMP, imbalances can occur when:
- Load Imbalance: Some processes/threads have significantly more work than others, or some do the calculation much quicker than others, which will lead to idle waiting time.
- Communication Overhead: Excessive or inefficient communication between processes/threads.
- Synchronization Bottlenecks: Waiting at barriers or synchronization points for other processes/threads to catch up.
Key Columns to Check:
- %system (top, mpstat, vmstat): Elevated %system in parallel codes can indicate communication overhead within MPI/OpenMP libraries.
- %iowait (iostat, vmstat, sar): In some cases of parallel I/O or network communication, we might see slightly elevated %iowait if processes are waiting for I/O or network operations to complete for communication. But %system is usually more indicative of communication overhead itself.
- MPI Profiling Metrics (from MPI profilers like mpiP, TAU): Crucial for parallel codes! These tools provide metrics like:
  - Time in MPI Communication: Percentage of total time spent in MPI routines.
  - Message Counts and Sizes: Volume of communication.
  - Load Imbalance Metrics: Idle time of processes, time spent waiting in barriers.
Concise Solutions/Actions:
- MPI Profiling (Essential): Use MPI profilers to analyze communication patterns.
- Load Balancing: Redistribute workload to ensure even distribution across processes/threads.
- Optimize Communication Algorithms: Use more efficient communication algorithms if possible.
- Reduce Communication Frequency/Volume: Minimize data exchange. Aggregate messages if sending many small messages.
- Non-Blocking Communication (MPI): Use non-blocking sends/receives to overlap computation and communication.

3. Memory Related Issues (Fortran Code and Data Management)

Issue: Fortran codes, especially in scientific computing, often deal with very large datasets (arrays). Memory issues arise from:
- Large Array Allocations: Declaring arrays that are excessively large or inefficiently sized.
- Data Structure Choice: Using memory-inefficient data structures.
- Memory Leaks: (Though less common in Fortran, possible with dynamic allocation).
- Data Copying: Unnecessary creation and copying of large arrays.
- Insufficient RAM: System simply runs out of physical memory.
Key Columns to Check:
- swap used (free, vmstat, top): Critical! High swap used is a major performance killer. Aim for near zero swap.
- available memory (free): Low available memory suggests memory pressure.
- RES (Resident Set Size) (top, ps, pidstat -r): Observe RES of Fortran process. Is it growing unexpectedly or reaching system limits?
- VIRT (Virtual Memory Size) (top, ps): Less direct, but very high VIRT can sometimes be a warning.
- %MEM (per-process in top, pidstat): Memory usage percentage for Fortran process.
- Cache Miss Rates(L1-dcache-misses, LLC-load-misses fromperf stat): (More advanced) High cache miss rates often point to memory access patterns that are inefficient and can be related to poor data locality or large memory footprints.

Detailed Example and Column Interpretation (Memory Issue):

Let's say we are running a Fortran simulation, and we observe the following in vmstat 1 and top over time:

vmstat 1 Output:

  procs -----------memory---------- ---swap-- -----io---- -system-- --------cpu--------
   r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
   1  0   123456 20000  5000  10000  100  50   200   150  500 1000 70 20 10  0  0
   1  0   123500 19500  5000  10000  110  60   220   160  520 1020 72 18  9  1  0
   1  0   124000 19000  5000  10000  120  70   240   170  540 1040 74 16  8  2  0
   1  0   125000 18000  5000  10000  130  80   260   180  560 1060 76 14  7  3  0
   1  0   126000 17000  5000  10000  140  90   280   190  580 1080 78 12  6  4  0
   ... (over time, `swpd` increases, `free` and `available` decrease) ...

swpd (swap used) is increasing over time (e.g., 123456, 123500, 124000...). This is a critical warning sign.
free memory (free column in vmstat) is decreasing (e.g., 20000, 19500, 19000...). RAM is getting scarce.
si (swap in) and so (swap out) are non-zero (e.g., 100, 50, 110, 60...). The system is actively swapping memory to disk.
cache and buff memory might be relatively stable or decreasing slightly.

top Output (Process-Specific):
```
    PID USER      PR  NI    VIRT    RES    SHR  S  %CPU %MEM     TIME+   COMMAND
   1234 user      20   0    10.0g   9.5g   1024 R  98.5 99.0     1:30.50 fortran_program
```
- %MEM for Fortran process is very high (e.g., 99.0%). Process is consuming almost all available RAM.
- RES is also very high (e.g., 9.5g). Process is using a large amount of physical RAM.
- VIRT (10.0g) is also large.

Interpretation:

Clear Memory Bottleneck: The increasing swap used, decreasing free memory, active swapping (si, so), and high %MEM/RES for Fortran process all strongly indicate that code is running out of physical RAM and resorting to slow swap space. This will severely degrade performance.

Concise Solutions/Actions (Memory Issues):

Memory Profiling: Use memory profilers (e.g., valgrind --tool=massif).
Reduce Array Sizes: Declare arrays only as large as needed, consider dynamic allocation.
Optimize Data Structures: Use memory-efficient data structures.
Minimize Data Copying: Avoid unnecessary array copies. Use pointers/references carefully.
Algorithm Optimization: Redesign algorithms to be less memory-intensive.
Fix Memory Leaks: If found by profilers, deallocate dynamically allocated memory properly.
Increase RAM (If possible): Adding RAM is often the most direct solution for memory-bound Fortran codes.

4. Fortran Code Itself - Cache Misses, Memory Allocation Patterns

Issue: While "bad code" (as above point 1) is about algorithm inefficiency, this point is about how Fortran code interacts with memory and CPU caches, which can significantly impact performance, even with efficient algorithms.
- Cache Misses: Poor data locality leading to frequent cache misses (CPU has to fetch data from slow RAM).
- Inefficient Memory Allocation Patterns: Frequent allocations and deallocations, especially of small blocks, can add overhead. (Less of a primary bottleneck for most Fortran codes, but can be a factor in some cases).
Key Columns to Check:
- Cache Miss Rates (L1-dcache-misses, LLC-load-misses fromperf stat): Primary Indicator perf stat is the tool to use. High cache miss rates are a direct sign of poor memory access patterns.
- %user (top, mpstat, vmstat, pidstat -u): High %user can be exacerbated by cache misses, as the CPU spends more cycles waiting for data from memory. However, high %user alone doesn't prove cache misses, but in conjunction with high cache miss rates from perf, it confirms the issue.
- Instructions per Cycle (IPC) (from perf stat): Low IPC (Instructions Per Cycle) often correlates with memory bottlenecks, including cache misses. A lower IPC means the CPU is doing fewer instructions per clock cycle, often because it's stalled waiting for memory.
ps and strace are NOT directly used for cache miss analysis. They are for process information and system call tracing, respectively.
Concise Solutions/Actions (Cache Misses, Memory Allocation):
- Use perf stat: Run perf stat -e L1-dcache-misses,LLC-load-misses,instructions,cycles ./fortran_executable.
- Analyze Cache Miss Rates: Look for high L1-dcache-miss-rate and LLC-load-miss-rate (relative to instructions executed).
- Improve Data Locality:
  - Loop Reordering/Blocking: Optimize loop structure for contiguous array access.
  - Array Layout Optimization: Consider array-of-structures vs. structure-of-arrays if relevant.
  - Data Alignment: Ensure data is properly aligned in memory.
- Memory Allocation Optimization (If Profiling Shows Issue):
  - Reduce Frequent Allocations/Deallocations: Try to reuse memory, allocate larger chunks less frequently.
  - Use Static Allocation (if possible): For fixed-size data, static allocation can be more efficient than dynamic allocation in some cases.

5. Disk I/O Issues (Data Read/Write Bottlenecks)

Issue: Fortran code performance limited by slow disk read/write operations, especially when dealing with large datasets stored on disk.
Key Columns to Check:
- %iowait (wa) (top, vmstat, mpstat): Primary indicator of I/O bottleneck at the CPU level. High %iowaitmeans the CPU is idle, waiting for I/O.
- %util (iostat): High %util on the disk device used by Fortran code indicates disk saturation.
- await (iostat): High await for disk I/O operations means long latency.
- svctm (iostat): High svctm means the disk itself is taking longer to service requests.
- kB_read/s, kB_wrtn/s (iostat, vmstat): Low throughput (data read/written per second) might be a bottleneck.
- r/b, w/b (vmstat): System-wide I/O block rates.
- bi, bo (vmstat): Blocks in and blocks out per second (system-wide).
Concise Solutions/Actions (Disk I/O):
- I/O Profiling: Use pidstat -d to see I/O per process. strace can trace system calls, including I/O calls.
- Reduce I/O Operations: Minimize disk reads/writes where possible.
- Optimize I/O Patterns: Use buffered I/O, read/write in larger blocks, use binary file formats.
- Asynchronous I/O: Consider asynchronous I/O for non-blocking operations.
- Memory Mapping: For read-only files.
- Faster Storage: Use SSDs instead of HDDs.
- Local Storage: If possible, use local disk storage instead of network file systems (NFS) for I/O-intensive tasks.

6. Network Issues in Calculation Clusters (Why relevant even for "just calculation")

Issue: In compute clusters, even if the primary purpose is calculation, network issues are still highly relevant because:
- Parallel Computing (MPI): Distributed Fortran codes using MPI heavily rely on the network for inter-process communication. Network latency and bandwidth directly impact parallel performance.
- Data Input/Output: Input data for calculations might be read from network storage (e.g., shared file systems). Output data might be written back to network storage.
- Job Submission/Management: Cluster job schedulers (like Slurm, PBS, LSF) use the network for job submission, monitoring, and control. Network issues can affect job submission and overall cluster management.
- Remote Access/Monitoring: Typically access and monitor compute nodes in a cluster remotely via the network. Network problems can hinder ability to interact with and manage Fortran jobs.
Key Columns to Check (Network - Require network-specific tools):
- rxpck/s, txpck/s, rxkB/s, txkB/s, %ifutil (sar -n DEV): Network traffic volume and interface utilization on compute nodes and network interfaces used for MPI or data transfer.
- Network Latency (ping times): Measure latency between compute nodes. High latency impacts MPI performance.
- Packet Loss, Errors, Dropped Packets (netstat -i, ip -s link): Indicate network problems that can severely degrade performance and reliability.
- MPI Profiling (Network Communication Metrics): MPI profilers will show detailed network communication times, message sizes, and communication patterns, which are essential for diagnosing network-related issues in parallel Fortran.
Concise Solutions/Actions (Network Issues in Clusters):
- Network Monitoring: Use network monitoring tools (like sar -n DEV, netstat, iftop, tcpdump, cluster-level network monitoring systems).
- Check Network Infrastructure: Verify network cables, switches, routers, and network card health.
- Optimize MPI Communication (as in point 2): Efficient MPI communication patterns reduce network load.
- Network Configuration: Ensure proper network configuration (MTU size, network interface settings).
- Dedicated Network for MPI (e.g., InfiniBand): For high-performance MPI applications, using a dedicated high-speed interconnect like InfiniBand is often essential to minimize network bottlenecks.
- Local Storage (If possible): For I/O, using local disk storage on compute nodes can reduce reliance on network file systems.

By systematically monitoring these columns and using the appropriate performance analysis tools, we can effectively diagnose and address a wide range of performance bottlenecks in the Fortran code development workflow. Remember that context is key – interpret these metrics in relation to the expected behavior of application and system workload.

More tools

Essential Tools for Deeper Fortran Performance Analysis (Beyond System Metrics):

While system metrics provide the initial direction, for serious Fortran performance tuning, we must use profiling tools:

Fortran Compilers Profilers: Many Fortran compilers (like GNU gfortran, Intel Fortran Compiler ifort, PGI/NVIDIA HPC SDK Fortran compiler nvfortran) have built-in profiling capabilities or integrate with profilers. Use compiler flags to enable profiling (e.g., -pg with gfortran, -profile with ifort). These profilers can give us subroutine-level CPU time, call counts, etc.
gprof (GNU Profiler): A classic profiling tool that works with code compiled with -pg. Provides call graphs and function-level profiling.
perf (Linux Performance Events): A very powerful system-wide profiler for Linux. Can profile CPU cycles, cache misses, branch mispredictions, system calls, and much more. Use perf top for real-time profiling or perf record and perf report for detailed analysis. perf stat is great for getting summary statistics.
valgrind (Memory Profiling - massif, Cache Profiling - cachegrind): Excellent for memory profiling (identifying memory leaks, excessive memory allocation) and cache profiling (identifying cache miss hotspots). valgrind tools can be more computationally expensive to run but provide very detailed insights.
MPI Profilers (for parallel Fortran): Tools like mpiP, TAU (Tuning and Analysis Utilities), HPCToolkit, ParaProf, Intel Trace Analyzer and Collector, ARM MAP. Essential for analyzing communication performance, load balance, and scaling behavior of parallel Fortran codes.

In Daily Work:

Start with System Metrics: Use top, vmstat, iostat, mpstat to get a general overview of system resource usage while running Fortran code. Look at %user, %system, %iowait, swap used, load average, disk %util, await.
Identify Potential Bottlenecks: Based on system metrics, narrow down the likely bottleneck: CPU-bound (high %user), I/O-bound (high %iowait, disk metrics), Memory-bound (high swap used), or Network-bound (network metrics, MPI issues).
Use Profiling Tools: Once we have a suspected bottleneck, use appropriate profilers to get deeper, code-level insights.
- For CPU: Compiler profiler, gprof, perf.
- For Memory: valgrind --tool=massif, memory debuggers.
- For Cache: valgrind --tool=cachegrind, perf stat.
- For I/O: pidstat -d, strace (to trace system calls).
- For Parallelism: MPI profilers.
Iterate and Optimize: Based on profiler output, identify the performance hotspots in Fortran code. Optimize the code, recompile, rerun, and re-profile to measure the impact of optimizations. Repeat this cycle until achieve satisfactory performance.

Overview of Performance Monitoring Challenges in Fortran Development

Table of contents

More tools