|
HiCLAS1 Technical Reports
HPC-2009-3: AERMOD-HPCS for Linux™ (Part 1) Copyright © 2009 HiCLAS1 PERFORMANCE ANALYSIS OF AERMOD-HPCS (1.8, Build 2) ON Linux™ PLATFORMS George Delic and Arnold R. Srackangast
1. INTRODUCTION This is a performance report for commodity platforms when applied to the Air Quality Model (AQM) AERMOD on Linux™ systems and is a companion to the previous reports for Windows™ versions. New results are presented for AERMOD in the High Performance Computing (HPC) version developed by HiCLAS1 (AERMOD-HPCS) released as Build 2 of v1.8 (hereafter v1.8.2). The purpose of this report is to demonstrate the superior performance of a serial version of AERMOD-HPCS on commodity architectures with the 32-bit and 64-bit Linux™ operating system. No release for Linux™ is made available by the U.S. EPA and therefore no comparisons with a U.S. EPA release are discussed here. However, it is informative to compare AERMOD-HPCS performance on a Windows™ 32-bit operating system with the 32-bit and 64-bit Linux™ operating system results discussed here. A variety of older and newer processors from two vendors: Intel Corporation (Intel) and Advanced Micro Devices (AMD) have been used. A discussion of hardware features is presented to give insight into AERMOD performance because processor features such as clock speed, bus speed, cache size, and memory architecture were found to be significant determinants of runtime performance. Also of interest is the issue of numerical differences when concentration results of both version of AERMOD are compared on different platforms. For this purpose a companion report (HPC-2009-4) details the Quality Assurance process with a discussion of numerical precision in AERMOD EPA releases and AERMOD-HPCS on these Linux™ platforms to address the requirement of a Model Equivalence Demonstration (MED). A fundamental goal of these studies is to discover to what extent portability of performance and numerical precision is at all possible and some surprising results were uncovered. Consideration is also given to questions of Windows™ versus Linux™ releases in response to frequent requests from the AERMOD end-user community. For this purpose some analysis is presented on the trade-off for workload throughput with either Intel or AMD processors. 2.0 CHOICE OF HARDWARE AND OPERATING SYSTEM The hardware used for the results reported here includes four Intel® Pentium 4™ (Intel), one Advanced Micro Devices (AMD) processor, and the Intel Itanium processor (for further information visit the Web addresses given in the References Section). In this study older and newer generations of CPUs are included. The goal was to survey a variety of processors in common use for AERMOD simulations and make an assessment of their suitability. All platforms discussed in this report used either a 32-bit or 64-bit Linux™ operating system on these architectures. As a consequence of the scope of this study, extensive tabulations of results are not reproduced here but are available as a downloadable PDF files: HPC-2009-3-Table1.pdf. The following table is an abbreviated version showing only platform-specific information to facilitate the comparisons made in the following discussion.. Note that platforms 3 and 5, have processors with more than one core. Hardware architecture varies in complexity from left to right in this table, as does the richness of instruction sets available for modern compilers. Table 1: Platforms used in this analysis with their attributes
3.0 CHOICE OF COMPILERS The source code for AERMOD is distributed by the U.S. EPA at http://www.epa.gov/scram001 and was compiled for all the results designated here as AERMOD-EPA. Other results designated here as AERMOD-HPCS were obtained from a compilation of AERMOD-HPCS source code that was modified from the U.S. EPA source distribution available at the above named U.S. EPA SCRAM Web portal. The compiler used for AERMOD-HPCS in this analysis (and distribution) is un-named but has been chosen after testing of the most popular compilers currently available. Considerable effort has been invested in exhaustively testing multiple compiler options to enable the best performance consistent with the code structure changes implemented at HiCLAS1, especially with a view to enhancing vector performance in a predominantly scalar code. With this release, AERMOD-HPCS v1.8 (Build 2, hereafter v1.8.2), significant new optimization levels were applied that were not available two years ago in preparing the previous build for windows. For this reason only Build 2 is offered for the Linux release. New optimization options represent highly significant (new) compiler technology advances over that previously available even a few years ago. As a result of this proliferation of instruction sets and processors, it is has become more difficult to produce compiled code with truly portable performance and precision across multiple CPUs that have different pipeline and memory architectures. Thus, while Build 2 of AERMOD-HPCS v1.8 has been compiled to target multiple CPUs of the modern generation to offer portable performance, it may not execute on older generations of processors. More discussion on this subject is presented below. There is still ample scope to improve performance and subsequent releases of AERMOD-HPCS will be made available once they pass QA and MED testing. 4.0 CHOICE OF BENCHMARKS The AERMOD model describes pollutant dispersion and deposition and is now an approved regulatory model for new source reviews and other permitting applications. It is available in the AERMOD-EPA version at the U.S. EPA’s Support Center for Regulatory Air Models at the URL portal named above, however, only as a Windows™ executable application, without a corresponding Linux™ release. The version used here is AERMOD 07026 and is designated as AERMOD-EPA to designate compilation for U. S. EPA source. To create the High Performance Computing (HPC) version of AERMOD the source code for the U.S. EPA distribution was progressively modified to enhance performance. The resulting code is designated AERMOD-HPCS, and at v1.8 (the current release is Build 2) it was deemed to be a sufficient improvement over AERMOD-EPA to warrant exhaustive Quality Assurance (QA) for the purposes of a Model Equivalence Demonstration. For QA testing the four Cases listed in Table 2 of HPC-2007-1 were used as benchmarks. These benchmarks are considered to be representative of actual applications for AERMOD and input and output files for Case 2 are included in the distribution for the purpose of testing the installation after download of the AERMOD-HPCS executable model. 5.0 BENCHMARK RESULTS For the platforms listed in Table 1 the following sub-sections show selected performance results in graphical form and comparison of AERMOD-HPCS and AERMOD-EPA is grouped in categories such as speedup, workload time, and workload throughput. The next section explores the architectural factors that influence performance and how they relate to AERMOD code structure and cost benefit issues. 5.1 Speedup of AERMOD-HPCS v1.8.2 on Windows versus Linux Is was noted that the same build of AERMOD-HPCS executes faster on Linux that it does on Windows and Tables 2 and 3, respectively, show the results of a comparison with 32-bit and 64-bit Linux™ using the Windows results of report HPC-2009-1. For table 2 the platforms are closely similar, but used the respective operating systems, whereas in Table 3 the same dual-boot platform was used. The speedup results of Tables 2 and 3 are summarized in Fig. 1 as the workload time for all four cases. It is clear from Tables 2 and 3 that the smallest workload time is for the 64-bit Linux™ platform. Table 2: Comparison of results for workload time on 32-bit Windows XP and 32-bit Linux operating systems with two nearly identical Intel™ processors.
Table 3: Comparison of results for workload time on 32-bit Windows XP and 64-bit Linux operating systems with identical Intel™ processors (node17 dual boot system).
Fig. 1: Speedup as the ratio of runtimes for AERMOD-HPCS v1.8 (Build 2) with each of the four Cases listed in Table 2 of report HPC-2007-1. The left hand pair is for two 32-bit operating systems and the right hand pair is for 32-bit Windows™ versus 64-bit Linux™. 5.2 Workload throughput of AERMOD-HPCS versus AERMOD-EPA In this study the four cases constitute the workload and, when executed sequentially in serial mode on all platforms, the total time on each platform is as shown in Fig. 2. However, a more suitable metric for workload throughput is the inverse of the workload time, which is akin to a process frequency, or rate. Fig. 3 shows that the highest value of the workload throughput is for AERMOD-HPCS Build 2 on node17 (the dual quad core Intel platform) with native 64-bit Linux™ compilation and execution. This last result should be compared to the Windows™ version of AERMOD-HPCS on Platform 5 in Fig. 6 of report HPC-2009-1 (these results are for the same platform with a dual boot capability).
Fig. 2: The time (in hours) to complete a workload of all four cases (see Table 2 of HPC-2007-1), with AERMOD-HPCS Build 2 on six platforms, respectively, with smaller values signaling better results.
Fig. 3: Throughput on six platforms for a workload of all four cases (see Table 2 of HPC-2007-1), with AERMOD-HPCS Build 2, on six platforms, respectively, with larger values signaling better results 6.0 PLATFORM BENEFITS FOR AERMOD 6.1 Architecture issues All the discussion of architectural issues raised in Section 6 of report HPC-2009-1 apply irrespective of the operating system and should be studied in conjunction with this report. Below, some commentary is added on instruction sets and variability of performance results across this group of processors. 6.2 SSE instruction sets Streaming Single-Instruction-Multiple-Data Extensions (SSE) instruction sets first appeared with the Intel® Pentium 3™ processor (SSE), and as SSE2 with the Xeon generation and contemporary AMD processors . These instruction sets have now progressed into a fourth generation (SSE4) that modern compilers use in allocating resources to compiled code on modern CPUs such as the quad core processors used here. Table 4 sketches a road map of the SSE instruction set evolution path to-date. Because of this multiplicity of processors and instruction sets it is a challenge to develop code that has truly portable performance. For example, in this study it was discovered that, while the AMD Opteron processor supports SS3 instruction sets, AERMOD-HPCS in a 64-bit Linux build for this option fails to execute on this processor (for a explanation of the cause see AMDvsIntel in the references section). However, the 32-bit build with an earlier generation of SSE2 support does execute and produced the results presented above for nodeD3 (platform 3). For this reason, either 32-bit or 64-bit builds of AERMOD-HPCS should be tested at any local installation. Future work at HiCLAS1 will develop a 64-bit Build of AERMOD-HPCSv1.8.2 that does execute on AMD64 platforms with 64-bit Linux™. Table 4: Generations of SSE support and applicability of AERMOD-HPCS Builds.
6.3 Variation across platforms Inspection of Figs. 2 and 3 shows a wide range of variability in performance across platforms used in this study. In the case of 64-bit Linux operating systems, the legacy technology of the Intel Itanium™ processor (platform 6) has been outpaced by the newer quad core x86 processors (platforms 3 and 5), and is also outperformed by the 64EMT architecture (platform 4). In the past, the Itanium platform was preferred in scientific applications because of the availability of a native 64-bit word length and the ability to address more than 4GB of global memory (a limitation of Linux™ on x86_32 systems). However, these features are now available with the newer x86_64 Linux™ kernels on x86 architectures (AMD64, 64EMT, and later). A comparison of results on 32-bit versus 64-bit Linux™ shows only a modest gain for the latter on the 64EMT generation (platform 4), but a very sharp gain of more than a factor of two on the Intel X5450 processor (platform 5). The AMD processor used in this study (platform 3) did not support the 64-bit version of the AERMOD-HPCS Build 2 executable was loaded on a 64-bit Linux™ operating system. For this reason, the results shown for platform 3 are from use of the 32-bit Linux™ Build of AERMOD-HPCS executed on a 64-bit Linux™ operating system (for a explanation of the cause see AMDvsIntel in the references section). 7.0 CONCLUSIONS This performance analysis of AERMOD-HPCS on Linux™ platforms shows that it delivers a solution that is as much as 22% times faster (in the mean) than the AERMOD-HPCS model build on a Windows™ operating system for the same processor. With Quad core processors the Intel CPU provided superior workload throughput when compared to an AMD Opteron™ processor used in this study. A discussion of SSE generations suggested that processors at the local site need to be examined for the level of SSE support they offer (irrespective of vendor specification claims). The next report in this series discusses details of numerical results on these Linux™ platforms based on a comparison of individual concentrations produced by AERMOD-HPCS and the compiled version of U.S. EPA (unmodified) source code. REFERENCES AMD: http://products.amd.com/en-us/default.aspx Intel: http://www.intel.com/products/processor_number/chart/xeon5000.htm AMDvsIntel:
http://developer.amd.com/documentation/articles/pages/4292005119.aspx
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||