|
HiCLAS1 Technical Reports
HPC-2007-4: Why is AERMOD-HPCS faster than the U.S. EPA AERMOD Distribution? Copyright © 2007 HiCLAS1 George Delic and Arnold R. Srackangast
1. INTRODUCTION This is a performance report for IA-32 commodity platforms when applied to the Air Quality Model (AQM) AERMOD. This study is intended to quantitatively measure performance differences as hardware and programming environments change and to relate these differences to the underlying causes. New results are presented for AERMOD in two version: the source code model released by the U.S. EPA (hereafter AERMOD-EPA/SRC) and the High Performance Computing (HPC) version developed by HiCLAS1 (AERMOD-HPCS). Both version are designed to execute the U.S. EPA's regulatory AERMOD model on a single processor CPU (or core). The previous reports in this series discuss the Quality Assurance process used in comparing U.S. EPA's distribution of the executable model and AERMOD-HPCS, whereas, in this report, both AERMOD in EPA and HPC versions were compiled from source with identical compiler options. Important performance bottle-necks are identified with the aid of proprietary software to collect and compute performance metrics using a publicly available hardware performance interface. These results give insight into the algorithm’s performance on commodity architectures and provide some answers as to why AERMOD-HPCS has superior performance. 2.0 CHOICE OF HARDWARE AND OPERATING SYSTEM The hardware used for the results reported here is the Intel Pentium 4 Xeon processor with two separate platforms as summarized in Table 1. Each used the Linux™ operating systems for both 32-bit and 64-bit platforms, respectively. The hardware used for the results reported here is the Intel Pentium 4 Xeon (P4) and Pentium Xeon 64EMT (P4e) processors. The operating system (OS) is HiPERiSM Consulting, LLC’s modification of the Linux™ 2.6.9 kernel to include a patch that enables access to hardware performance counters. This modification allows the use of the Performance Application Programming Interface performance event library [see the PAPI, 2005 reference] to collect hardware performance counter values as the code executes. The performance metrics are defined with a view to giving insight into how the application is mapped to the architectural resources by an un-named compiler.
The P4 and P4e architectures offer Streaming Single-Instruction-Multiple-Data Extensions (SSE) to enable vectorization of loops operating on multiple elements in a data set with a single operation. One important performance feature is that such vector instructions are enhanced in AERMOD-HPCS and contribute to improved performance. 3.0 CHOICE OF BENCHMARKS The AERMOD model describes pollutant dispersion and deposition and is now an approved regulatory model for new source reviews and other permitting applications. It is available in the AERMOD-EPA version at the U.S. EPA’s Support Center for Regulatory Air Models at the EPA URL portal [see EPA-SCRAM reference]. It is predominantly a Fortran 77 code developed over ten years ago but has since used (in small part) Fortran 90 features. As such, and typical of that generation of environmental models, AERMOD was developed on a PC platform, with a small memory size requirement, poor vector character, and I/O bound performance characteristics. The code has good potential for parallelism, but the conversion task is complicated due to an elaborate call tree that also inhibits vectorization due to multiple levels of procedure calls within loop structures. The version used here is AERMOD 07026 and was designated as AERMOD-EPA in the previous reports to denote the executable distributed by the U.S. EPA. In this report the U.S. EPA results were obtained with a version compiled from the U.S. EPA source code distributed at the URL cited above. To distinguish results from those of the previous reports the U.S. EPA version used in this report is designated AERMOD-EPA/SRC. To create the High Performance Computing (HPC) version of AERMOD the source code for the U.S. EPA distribution was progressively modified to enhance performance. The resulting code is designated AERMOD-HPCS, and is now at v1.8 (July, 2007). For performance testing in this report the four Cases listed in Table 2 were used as benchmarks.
4.0 HARDWARE PERFORMANCE EVENTS This report presents results of AERMOD-EPA/SRC and AERMOD-HPCS on Linux™ operating systems for both 32-bit and 64-bit platforms, with a detailed analysis of hardware performance metrics, to give insight into why AERMOD-HPCS gives enhanced performance over AERMOD-EPA/SRC. The PAPI interface defines over a hundred hardware performance events, but not all of these events are available on all platforms. For the Intel hardware under discussion the number of hardware events that can be collected are, respectively, 28 (P4) and 25 (P4e). Not all events can be collected in a single execution due to the fact that the number of hardware counters is small (typically four). Thus, multiple executions are needed to collect all available events on any given platform. Performance metrics are defined using the PAPI events and measured in the expectation that they will give insight into how resource utilization differs between the two versions of AERMOD. 5.0 PERFORMANCE METRICS 5.1 Rate performance metrics Rate metrics are generally measured in units per second and often the unit is millions per second. They are obtained from the PAPI defined events by normalization to a unit of time (typically 1 second). The following discussion will use only those rate metrics of relevance in identifying bottle-necks in AERMOD. A simple example of a rate metric is Mflops, or million floating point operations (flops) per second. Note that operations and instructions are distinguished as one instruction may not correspond to one operation in a pipelined vector architecture. 5.2 Ratio performance metrics Ratio metrics are generally measured as a ratio of two absolute values derived from PAPI events. The following discussion will use only those ratio metrics of relevance in identifying bottle-necks in AERMOD. One example of a ratio metric is the number of memory instructions per floating point instruction. Ratio metrics are valuable in determining load balance and resource utilization efficiency. 6.0 PERFORMANCE COMPARISON OF AERMOD-HPCS AND AERMOD-EPA/SRC In the following discussion for Figs. 1 to 9 results are shown in two frames for 32-bit and 64-bit Linux, respectively. Each frame shows (from left to right) first the AERMOD-EPA/SRC (epa) result followed by that for AERMOD-HPCS (hpc) for all four cases of Table 2. 6.1 Operations and instructions Fig. 1: Mflops for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). This shows that for all four cases of Table, 2 Mflops have increased, when AERMOD-HPCS is compared with AERMOD-EPA/SRC. The number of floating point operations is the same so a higher rate corresponds to higher efficiency and completion in less time. This result is true for both 32-bit and 64-bit Linux platforms. As described above, SSE optimizations allow a compiler to use the enhanced SSE instruction set. However, this option gives little performance gain for AERMOD because of the lack of vector loop structure and the predominance of control transfer instructions which stall the vector pipeline and loose cycles to loading of new instructions. Nevertheless, AERMOD-HPCS shows an advantage over AERMOD-EPA/SRC. Fig. 2: Million vector/SSE instructions per second for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). This shows that the vector instruction rate increased most significantly with AERMOD-HPCS on the 64-bit platform and less so on the 32-bit platform. This is due to the availability of considerably more hardware resources on the P4e compared to the P4. On the latter platform cases 4 and 5 show a reduction in vector instruction rate, but still have faster performance in AERMOD-HPCS when compared to AERMOD-EPA/SRC. This suggest that enhanced vector instruction issue is not the only cause of better performance for AERMOD-HPCS. 6.2 Memory footprint In comparing performance the memory behavior is of special interest and, for AERMOD in general, the rate of total memory instructions issued is voluminous. This is the case for both 32-bit and 64-bit platforms. Fig. 3: Million total memory instructions per second for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). It is interesting to observe that the memory instruction rate is higher for AERMOD-HPCS compared to AERMOD-EPA/SRC. This confirms that a high rate of memory instruction issue need not be an indicator of a performance bottleneck. Benchmarks with good vector character that deliver of the order of 1Gflop on a P4 can also show high memory access rates. Nevertheless, a memory intensive application, without a dominant vector code character (as is AERMOD), is performance constricted on commodity architectures where memory bandwidth is limited by the FSB and cache design. The consequence of AERMOD’s memory footprint is that the path to memory can become a limiting critical performance bottle-neck. This bottle-neck is somewhat ameliorated in AERMOD-HPCS compared to AERMOD-EPA as is described in the following. Fig. 4: Memory instructions per floating point instruction for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). This shows the load balance of memory versus floating point instructions and demonstrates the extent to which AERMOD is a memory-bound application. As a consequence, AERMOD is extremely sensitive to any inefficiency in memory access. It is notable that AERMOD-HPCS reduced this load imbalance somewhat, but it is still critically out of balance. Fig. 5: Memory instructions per flop for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). To confirm the previous result this shows the memory instructions per flop. Clearly, for each flop there are more then 14 memory instructions in all cases on either P4 or P4e platforms. This is a gross imbalance suggesting that the CPU is starved of data and spends excessive cycles in an idle state. 6.3 Branching instructions One type of control transfer instructions is branch instructions. AERMOD reports branch instruction rates that are more than an order of magnitude larger than those shown by good vector code on the same platform. Fig. 6: Logarithm of total number of mispredicted branch instructions for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). In all cases, on both 32-bit and 64-bit platforms, AERMOD-HPCS has a reduction of the branch instruction rates and this correlates positively with higher Mflops. It should be presumed that a reduction of this type of control transfer instructions is due to a more efficient use of hardware resources in AERMOD-HPCS. Fig. 7: Million branch instructions per second for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). This shows that the underlying cause of reduced misspredicted branches in AERMOD-HPCS relative to AERMOD-EPA/SRC is that the rate of branches taken (TKN) and not taken (NTK) are brought into balance, thereby reducing the risk of lost cycles and flushed pipelines in the CPU. 6.4 TLB Cache usage Between the processor and the first level of cache (L1) there is the TLB cache. The translation lookaside buffer (TLB) is a small buffer (or cache) to which the processor presents a virtual memory address and looks up a table for a translation to a physical memory address. If the address is found in the TLB table then there is a hit (no translation is computed) and the processor continues. The TLB buffer is usually small, and efficiency depends on hit rates as high as 98%. If the translation is not found (a TLB miss) then several cycles are lost while the physical address is translated. Therefore TLB misses degrade performance. PAPI offers counters for TLB miss events for both instruction and data. In the case of AERMOD it is the instruction TLB misses that are critical because of the voluminous incidence of control transfer instructions due to procedure calls. Higher instruction TLB miss rates suggest that the processor pipeline stalls more frequently because of a higher rate of control transfer instructions. Fig. 8: TLB misses per second for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). While the TLB data miss rates (DM) have increased in AERMOD-HPCS relative to the EPA version, performance has improved, suggesting that it is the TLB instruction miss rates that are important for performance in AERMOD. The AERMOD-HPCS version is more efficient in reducing instruction TLB miss rates (IM) through optimization and resource allocation compared to the EPA version. The most dramatic reduction is in Case 2 for the 32-bit platform. Note that the units in Fig. 8 are not million per second, but are nevertheless voluminous. 6.5 Cache usage Both the P4 and P4e platforms discussed here have L1 and L2 caches. A cache miss on either of these occurs when data or instructions are not found in the cache and an excursion to higher level cache, or memory, is necessitated. Cache misses result in lost performance because of increasing latency in the memory hierarchy. Memory latency is smallest at the register level and increases by an order of magnitude for a L1 cache reference, and another order of magnitude to access L2 cache. In the case of AERMOD this analysis will focus on the L1 cache behavior. Fig. 9: Million L1 cache misses per second for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). L1 instruction cache miss rates are significantly smaller than the data cache miss rates (as shown by the scaling factor used for the former in this figure). However, the instruction cache is more important in determining CPU efficiency. Because of the voluminous memory instructions in AERMOD such miss rates are amplified in their consequences for CPU utilization efficiency. The important performance feature to observe is that AERMOD-HPCS significantly reduces L1 instruction cache miss rates. Fig. 10: For instructions this shows the correlation of L1 cache misses (million per second) and TLB misses (number per second) for AERMOD-HPCS (hpc) compared to that for the U.S. EPA’s distribution (epa). This is another view of the penalties associated with excursions to cache by the processor. Although there are only four data points, there is a clear trend showing increased instruction TLB miss rates (horizontal axis) also lead to increased L1 instruction cache miss rates (vertical axis). Thus there is a double impact here due to memory latency. Note also the dramatic difference in scale for the AERMOD-EPA/SRC version (left frame) and the AERMOD-HPCS (right frame) versions. 7.0 SPEEDUP OF AERMOD-HPC OVER AERMOD-EPA
Fig.11: Speedup of AERMOD-HPC as measured by the ratio of the wall clock time for the U.S. EPA AERMOD version compiled from source code divided by the wall clock time for AERMOD-HPC. The left hand frame is for 32-bit hardware and the right hand frame is for 64EMT hardware, with 32-bit and 64-bit Linux operating systems, respectively. The code transformation applied in AERMOD-HPCS take cognizance of procedures occurring at the leaves of a deep calling tree. Such procedures invariably have no loop structure but consist of simple arithmetic statements and conditional code blocks. These are the reasons for lack of vectorizable loops and the high rates of branching instructions in the U.S. EPA version of AERMOD. As a result the extremely high instruction TLB misses for AERMOD are a critical source of performance limitations. They lead to very high memory instruction rates due in part to high TBL instruction miss rates and also in part due to correlated L1 instruction cache miss rates. This behavior is ameliorated by the improved efficiency of the AERMOD-HPCS version in minimizing the performance consequences of this behavior. AERMOD-HPCS is faster than the U.S. EPA version AERMOD-EPA/SRC because it delivers:
8.0 CONCLUSIONS This performance analysis of the U.S. EPA version of AERMOD, shows that it is a memory intensive application with large rates of control transfer instructions such as branching logic and high procedure calling overhead. These features result in large observed rates for branching instructions and instruction TLB misses. These, in turn, result in stalled pipelines and cycles lost to arithmetic operations. In combination these two characteristics of the AERMOD code place a limit on the optimal performance possible from AERMOD on commodity platforms. This is because, by design, commodity hardware solutions offer a cost effective compromise between processor clock rates, cache size, and bandwidth (or latency) to memory. The AERMOD-HPCS version goes some way to ameliorate these performance limitations and as a result gains in computational efficiency which translates into reduced wall clock time. However, there is still scope for further improvements and progress will be reported in subsequent reports in this series. REFERENCES EPA-SCRAM: U.S. EPA, Technology Transfer Network, Support Center for Regulatory Air Models http://www.epa.gov/scram001/. PAPI, 2005: Performance Application Programming Interface, http://icl.cs.utk.edu/papi. Note that the use of PAPI requires a Linux kernel patch (as described in the distribution).
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||