|
HiCLAS1 Technical Reports
HPC-2009-1: AERMOD-HPCS for Microsoft Windows™ (Part 3) Copyright © 2009 HiCLAS1 PERFORMANCE ANALYSIS OF AERMOD-HPCS (1.8, Build 2) ON Microsoft Windows™ PLATFORMS George Delic and Arnold R. Srackangast
1. INTRODUCTION This is a performance report for commodity platforms when applied to the Air Quality Model (AQM) AERMOD. New results are presented for AERMOD in two version: the executable model released by the U.S. EPA (hereafter AERMOD-EPA) and the High Performance Computing (HPC) version developed by HiCLAS1 (AERMOD-HPCS) released as Build 2 of v1.8 (hereafter v1.8.2) with the previous release designated as Build 1 (hereafter v1.8.1). Both versions are designed to execute the U.S. EPA's regulatory AERMOD model on a single processor CPU (or core) - no parallel version is studied in this report. The purpose of this report is to demonstrate the superior performance of a serial version of AERMOD-HPCS on commodity architectures with the 32-bit Windows™ operating system and also explore the performance differences between the two AERMOD-HPCS builds. Subsequent reports present results of AERMOD-HPCS on Linux™ operating systems for both 32-bit and 64-bit platforms. A variety of older and newer processors from two vendors: Intel Corporation (Intel) and Advanced Micro Devices (AMD) have been used. A discussion of hardware features is presented to give insight into AERMOD performance because processor features such as clock speed, bus speed, cache size, and memory architecture were found to be significant determinants of runtime performance. Also of interest is the issue of numerical differences when concentration results of both version of AERMOD are compared on different platforms. For this purpose a companion report (HPC-2009-2) details the Quality Assurance process with a discussion of numerical precision in AERMOD-EPA and AERMOD-HPCS on these Windows™ platforms to address the requirement of a Model Equivalence Demonstration (MED). A fundamental goal of these studies is to discover to what extent portability of performance and numerical precision is at all possible with either variant of AERMOD and some surprising results were uncovered. Consideration is also given to questions of cost versus benefit in response to frequent requests from the AERMOD end-user community. For this purpose some analysis is presented on the trade-off for workload throughput with more costly, and higher performance processors, versus lower performance, and less costly ones. 2.0 CHOICE OF HARDWARE AND OPERATING SYSTEM The hardware used for the results reported here includes three Intel® Pentium 4 Xeon™ (Intel) and two Advanced Micro Devices (AMD) processors (for further information visit the Web addresses given in the References Section). In both cases older and newer generations of CPUs are included: from laptops to quad core servers. The goal was to survey a variety of processors in common use for AERMOD simulations and make an assessment of their suitability. All platforms discussed in this report used a 32-bit Microsoft Windows™ operating system on these architectures. As a consequence of the scope of this study, extensive tabulations of results are not reproduced here but are available as downloadable PDF files: HPC-2009-1-Table1.pdf and HPC-2009-1-Table2.pdf. The following table is an abbreviated version of the header information from these tables showing only platform-specific data to facilitate the comparisons made in the following discussion.. Note that platforms 3, 4 and 5, have processors with more than one core with nodeD2 the laptop and node17 a true multi-core server. Hardware architecture increases in complexity from left to right in this table, as does the richness of instruction sets available for modern compilers and more detail on this subject is given in a subsequent report (HPC-2009-3). Finally, as is clear from Table 1, all discussion in this report refers to a 32-bit Windows™ operating system. Table 1: Platforms used in this analysis with their attributes
3.0 CHOICE OF COMPILERS The compiler used for AERMOD-EPA executable distributed by the U.S. EPA is not known, but is assumed to the the Compaq Visual Fortran compiler (CVF) and for details see report HPC-2007-2. The executable distributed by the U.S. EPA was obtained from the distribution center at http://www.epa.gov/scram001 and applied in all the results designated here as AERMOD-EPA (U.S. EPA's AERMOD.EXE). Other results designated here as AERMOD-HPCS were obtained from a compilation of AERMOD-HPCS source code that was modified from the U.S. EPA source distribution available at the above named U.S. EPA SCRAM Web portal. The compiler used for AERMOD-HPCS in this analysis (and distribution) is un-named but has been chosen after testing of the most popular compilers currently available. Considerable effort has been invested in exhaustively testing multiple compiler options to enable the best performance consistent with the code structure changes implemented at HiCLAS1, especially with a view to enhancing vector performance in a predominantly scalar code. With this release, AERMOD-HPCS v1.8 (Build 2, hereafter v1.8.2), significant new optimization levels were applied that were not available two years ago in preparing the previous build. These represent highly significant (new) compiler technology advances over that available for CVF a decade ago. As one example, the CVF compiler pre-dates the advent of Streaming Single-Instruction-Multiple-Data Extensions (SSE) instruction sets. These first appeared with the Pentium 3 processor (SSE), and as SSE2 with the Xeon generation. They have now progressed into a fourth generation (SSE4) that modern compilers use in allocating resources to compiled code on modern CPUs such as that with node17. As a result of this proliferation of instruction sets and processors, it is has become more difficult to produce compiled code with truly portable performance and precision across multiple CPUs that have different pipeline and memory architectures. Thus, while Build 2 of AERMOD-HPCS v1.8 has been compiled to target multiple CPUs of the modern generation to offer portable performance, it may not execute on older generations of processors, and in such cases the previous Build of AERMOD-HPCS will continue to be available from HiCLAS1. There is still ample scope to improve performance and subsequent releases of AERMOD-HPCS will be made available once they pass QA and MED testing. 4.0 CHOICE OF BENCHMARKS The AERMOD model describes pollutant dispersion and deposition and is now an approved regulatory model for new source reviews and other permitting applications. It is available in the AERMOD-EPA version at the U.S. EPA’s Support Center for Regulatory Air Models at the URL portal named above. The version used here is AERMOD 07026 and is designated as AERMOD-EPA or AERMOD.EXE. To create the High Performance Computing (HPC) version of AERMOD the source code for the U.S. EPA distribution was progressively modified to enhance performance. The resulting code is designated AERMOD-HPCS, and at v1.8 (the current release is Build 2) it was deemed to be a sufficient improvement over AERMOD-EPA to warrant exhaustive Quality Assurance (QA) for the purposes of a Model Equivalence Demonstration. For QA testing the four Cases listed in Table 2 of HPC-2007-1 were used as benchmarks. These benchmarks are considered to be representative of actual applications for AERMOD and input and output files for Case 2 are included in the distribution for the purpose of testing the installation after download of the AERMOD-HPCS executable model. 5.0 BENCHMARK RESULTS For the platforms listed in Table 1 the following sub-sections show selected performance results in graphical form and comparison of AERMOD-HPCS and AERMOD-EPA is grouped in categories such as speedup and workload throughput. The next section explores the architectural factors that influence performance and how they relate to AERMOD code structure and cost benefit issues. 5.1 Speedup of AERMOD-HPCS v1.8 versus AERMOD-EPA Speedup of AERMOD-HPCS over AERMOD-EPA is shown in Fig. 1 (Build 1), and Fig. (Build 2). The new results are in Fig. 2. It is obvious that for all four test cases Build 2 gives better performance on Intel platforms (1, 2, 5) whereas Build 1 gives better performance on AMD platforms (3, 4).
Fig. 1: Speedup as the ratio of runtimes for AERMOD-HPCS v1.8.1 (Build 1) and AERMOD-EPA with each of the four Cases listed in Table 2 of report HPC-2007-1 for the five platforms of Table 1. Performance enhancement ranges from 1.6 to 2.8 times faster than AERMOD-EPA (depending on the platform and case used in the benchmark).
Fig. 2: Speedup as the ratio of runtimes for AERMOD-HPCS v1.8.2 (Build 2) and AERMOD-EPA with each of the four Cases listed in Table 2 of report HPC-2007-1 for the five platforms of Table 1. Performance enhancement ranges from 1.6 to 3.3 times faster than AERMOD-EPA (depending on the platform and case used in the benchmark). The individual speedups from Figs. 1 and 2 for all cases may be aggregated into a mean speedup on each platform. This is shown in Fig. 3 where it is easier to discern the performance differences of (a) Build 2 versus Build 1, and (b) AMD versus Intel platforms. Of specific interest here is the question: how much performance gain does Build 2 of AERMOD-HPCS offer over Build 1? To answer this question, Fig. 4 shows the comparison of mean speed-up difference of the two AERMOD-HPCS builds as a percentage. On issue (a) the Intel platforms, in the mean, Build 2 of AERMOD-HPCS shows improved performance. However, for issue (b) both builds show a reduced speed-up on AMD platforms compared to the Intel results. The reason for this is that AERMOD.EXE gains a performance boost from the specific memory architecture of the AMD CPUs. Nevertheless, AERMOD-HPCS v1.8 (Build 1) still has a mean speed up over AERMOD.EXE that is close to a factor of two.
Fig. 3: The mean speedup over the four cases for AERMOD-HPCS versus AERMOD-EPA on the five platforms of Table 1. Performance enhancement ranges from 1.6 to 3.1 times faster than AERMOD-EPA, depending on the platform and the AERMOD-HPCS build used in the benchmark.
Fig. 4: The percentage gain in mean speedup over AERMOD-EPA by AERMOD-HPCS Build 2 relative to Build 1 for the five platforms of Table 1. The mean here is for speedup over the four cases on each platform. The negative percentages show that on the AMD platforms (3 and 4) AERMOD-HPCS Build 1 gives more speedup over AERMOD-EPA than does Build 2. 5.2 Workload throughput of AERMOD-HPCS versus AERMOD-EPA Of more relevance than individual timing of separate cases is the answer to the question: how much time is required to complete all cases? This question is equivalent to asking: what is the workload throughput? In this study the four cases constitute the workload and, when executed sequentially in serial mode on all platforms, the total time is as shown in Fig. 5. However, a more suitable metric for workload throughput is the inverse of the workload time, which is akin to a process frequency, or rate. Fig. 6 shows values of this workload throughput metric on the five platforms of Table 1. The lowest value corresponds to AERMOD-EPA on platform 3 (node16, the AMD laptop), and the highest value of the workload throughput is for AERMOD-HPCS Build 2 on node17 (the dual quad core Intel platform).
Fig. 5: The time (in hours) to complete a workload of all four cases (see Table 2 of HPC-2007-1), with AERMOD-HPCS Build 2, Build 1, and AERMOD-EPA, respectively, on the five platforms of Table 1, where smaller values signaling better results.
Fig. 6: Throughput on the five platforms of Table 1 for a workload of all four cases (see Table 2 of HPC-2007-1), with AERMOD-HPCS Build 2, Build 1, and AERMOD-EPA, respectively. Larger values signal better results and the lowest throughput is for AERMOD-EPA on Platform 3 (the AMD laptop), and highest is AERMOD-HPCS Build 2 on node17. 6.0 COST BENEFIT ANALYSIS OF PLATFORMS FOR AERMOD 6.1 Clock Speed It is a commonly held belief that a higher clock rate will give higher performance throughput. However, with the emergence of multi-core CPUs, clock speeds have not risen in the last two years, but have tended to grow smaller. So it is meaningful to ask: what are the consequences for AERMOD applications? A comparison of the results on, platforms 1 (node10) and platform 5 (node17), shows that the latter offers approximately two times more throughput for any of the three version of AERMOD. These two platforms have virtually the same clock speed (3MHz), thus it may be concluded that factors other than clock speed determine efficiency for AERMOD throughput. 6.2 Bus Speed Inspection of Table 1 for Bus speed shows that node17 has a 2.5 higher rate on the Front Side Bus (FSB) compared to node10. For AERMOD this is important because of its memory foot print (AERMOD is a memory-bound application and for a discussion see Section 6.2 of HPC-2007-4). On the X5450 quad core Intel Architecture of node17 the bus rate corresponds to a maximum transfer rate of 10.66Gbytes per second on a quad-pumped bus running off a 333MHz clock. Thus, this processor transfers data four times per bus clock and in addition the address bus can deliver addresses two times per bus clock. Also, while not important for the serial benchmarks here, the X5450 CPU supports a Dual Independent Bus (DIB) with one CPU on each bus. Thus memory bandwidth is significantly enhanced on the new architecture. This is in contrast to node10 where two CPU's share the same FSB. The AMD architecture does not have a FSB, but nevertheless memory bandwidth and latency is an issue. 6.3 Cache size Cache comes either in two or three levels with Level 1 (L1) closest to the processor and usually each core will have its own L1 cache split between instructions and application data. Level 2 (L2) cache is usually shared between cores on a multi core CPU, but may be dedicated if only one core is active. Thus on node17, with the serial benchmarks used in this report the L2 cache is massive when compared to the other platforms. Further, the much larger L1 cache on node17 compared to node10 is a clear indicator of where the performance boost could originate. However, it is interesting to note that, while both AMD platforms (node16 and nodeD2) have much larger L1 cache size per core, they do not outperform either node10 or node17. 6.4 Memory As is obvious from the preceding discussion, memory, and memory architecture is critical in performance for a memory bound application such as AERMOD (for the reasons see report HPC-2007-4). It is in the memory architecture that Intel and AMD processors differ, and hence performance differs, even for the same executable code. Cache size and placement are other important factors (as discussed above). AERMOD is a small memory model compared to other AQM's, but it should be observed that, as higher levels of optimization are applied in compilation of AERMOD-HPCS, the resulting executable is larger for Build 2 compared to Build 1. On the subject of memory, it is also remarked that the Windows operating systems used in this report are 32-bit operating systems and cannot address more than approximately, 3.5GB in the user space. Therefore, while eight cores are visible, eight concurrent executions of AERMOD or AERMOD-HPCS could result in swapping and degraded memory performance (although tests with HiPERiSM servers has not shown such performance degradation, even when all eight cores execute AERMOD-HPCS workloads). 6.5 Compiler Maturity Compiler development tends to lag processor technology - thus one speaks of "compiler maturity" as the extent to which compilers track architectural developments. A measure of this is in the comparison of Build 1 and Build 2 of AERMOD-HPCS based, respectively, on a two-step difference in released versions of the compiler separated by two years. Both builds use the same source code and obviously there has been a change over two years in the compiler's ability to extract more performance by as much as 20% (see platform 2 in Fig. 4). However, when comparing AMD and Intel platforms it becomes clear that this performance is not necessarily portable across architectures. With modern compiler technology performance may be gained, but at the cost of some portability. The later is probably the result of the diverse proliferation of architectures and instruction sets with more modern CPUs and for more discussion on this issue see report HPC-2009-3. 6.6 Cost-benefit and Risk Assessment AERMOD users persist in asking "what is the best processor/platform configuration for AERMOD throughput". A response to such a question also requires the issue of cost and risk to be addressed. Cost refers to both the Total Cost Of ownership (TOC) and the cost of Risk (lost production due to resource failure or downtime), cost of replacement, or cost of redundancy. As an example of Cost and Risk assessment some comments are made here with regard to the use of a simple laptop (node16) versus a dual quad-core server (node17). The performance difference is in the range 5 to 6 (see Fig. 6), and the cost differential is also a factor of 6, with the laptop cost at $620 (in 2008). For this comparison the performance difference means a longer wait on throughput with the laptop versus the server. However, risk level is higher with the server because if it fails production is halted while the server is repaired. Furthermore, if the server needs to be replaced, the TOC doubles. On the other hand, for the cost of one server, a small task farm of six laptops offers a lower risk because failure of one laptop does not cripple production throughput greatly. In the former case a plan for redundancy implies a second server must be available to guarantee uninterrupted production throughput, whereas a redundant laptop is a considerably less costly redundancy plan in the case of the laptop farm. Table 3 summarizes some of these issues. Other issues in a complete capacity plan, but not discussed here are: power and cooling requirements, software costs, costs of securing data against loss, backup and restore plans, and administrative costs, etc. Table 3: Cost and risk comparison of a dual quad core server and a dual core laptop in application to AERMOD.
7.0 CONCLUSIONS This performance analysis of AERMOD-HPCS on Microsoft Windows™ platforms shows that it delivers a solution that is 1.6 to 3.3 times faster than the U.S. EPA's distribution on the AERMOD executable for a range of platforms and cases. The new Build 2 release of AERMOD-HPCS offers as much as 20% enhanced performance over the previous release on Intel platforms compared to the U.S. EPA's distribution of an executable. However, on AMD processors the earlier release of AERMOD-HPCS provides the better workload throughput. When different platforms are compared with the Microsoft Windows™ operating system the best workload throughput is observed on an Intel dual quad core server. Some discussion of cost and risk of server versus a simple laptop farm suggests that either is suitable for AERMOD usage but the choice between them must depend on the production criteria at the local site. The next report in this series discusses details of numerical results on these Windows™ platforms based on a comparison of individual concentrations produced by AERMOD-HPCS and the U.S. EPA version. While these two reports are limited to a 32-bit operating system, subsequent reports extend the survey to both 32-bit and 64-bit Linux operating systems across multiple platforms. REFERENCES AMD: http://products.amd.com/en-us/default.aspx Intel: http://www.intel.com/products/processor_number/chart/xeon5000.htm
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||