Cluster Performance- It's not just about Clock Rate Anymore.

By Dan Stanzione and Tommy Minyard, Texas Advanced Computing Center

For a long time now, we've known that sustained performance on clusters is a complicated thing to measure, especially when talking about parallel jobs. In columns like this one, and others too numerous to count, it has been stressed that sustained performance for real applications is as much about balance (processor, interconnect, filesystem, memory) as it is about clock frequency. But despite that, the speed of the processor has still been the driving force for many when making decisions about clusters. Look no further than the Top 500 list to see that this is true. The Top 500 uses the High Performance Linpack (HPL) benchmark, which, while somewhat sensitive to interconnect performance, generally delivers a pretty high fraction of the peak performance of the processor (in the last few years, that means 60-70% on poorly balanced clusters, and 75-85% on well balanced ones... you can see this yourself by working out the ratio of Rmax to Rpeak on the reported numbers in the list, particularly for the older systems).

Given that HPL delivers a pretty good fraction of peak performance on most processors, not surprisingly, higher clock rate has meant higher HPL, higher Top 500 number, and the impression that your new cluster is "faster." The big gotcha here is the not-so-well-kept-secret in the HPC community that *peak* performance and *real* application performance didn’t really have that much to do with one another, and the performance of HPL did not reflect the ability of a cluster to get work done. This has become especially true with the last generation of new quad-core processors. The fact of the matter is, while processors have been following the Moore's law curve, most of our real applications have been increasingly starved for memory bandwidth (i.e. the ability to get data from main memory into those increasingly fast processors). HPL doesn't really suffer too much from inadequate memory bandwidth, so the magnitude of the problem hasn't been quite as obvious.

Intel has been well aware of this, however, and has taken a quantum leap forward in memory bandwidth with the Intel Xeon processor 5500 sequence “Nehalem” series of processors (and continued into the current Intel Xeon processor 5600 sequence “Westmere” processors and beyond). If you've been out shopping for cluster processors, it might appear on the surface that things have been pretty stagnant. Two, three or even four years ago, you could get a quad-core processor, issuing four floating point instructions per cycle, running somewhere between 2-3GHz. If you looked recently, you could get a quad-core Nehalem processor, issuing four floating point instructions per cycle, running between 2-3GHz. So what happened to Moore's Law performance doubling, you may ask. Well, it happened, but it's primarily in memory bandwidth. If you look at peak performance, things look about the same. Let's say two to three years ago you were looking at some Intel Harpertown , and let's assume for sake of round numbers they ran at exactly 2.5GHz. The peak performance of one of these chips would be 2.5GHz*(4 instructions per cycle)*(4 cores per chip) = 40 GigaFLOPS or so of peak performance. If you looked at the quad-core versions of the Nehalem at 2.5GHz, the math would be the same. But that's the clean theory world of peak performance.

Let's look at some real performance instead. The figure below shows the performance of the Weather Research Forecast (WRF) V3.1.1 application, typical of climate and weather models, running on 3.0GHz Intel Xeon processor E5450 “Harpertown” processors, compared to 2.66GHz Intel Xeon processor X5550 “Nehalem” processors in a 2-socket Dell blade configuration. On the Harpertown processors, the performance flattens out with just four cores on a node in use, stays constant up to eight cores and then scales almost linearly when going from one to two nodes, as expected. In contrast, the Nehalem processor continues to increase in performance up to eight cores, with single core performance better than Harpertown by 40%. However, with all eight cores in use on a node, the Nehalem beats the Harpertown by almost 4 to 1! The reason for this is clear when looking at the memory bandwidth available on a node -- see the second figure below. In the case of the Harpertown, the memory bandwidth flattens out to a maximum at just two cores, while for the Nehalem, memory bandwidth continues increasing out to all eight cores.

WRF V3.1.1 performance with increasing number of cores

STREAM memory
bandwidth on one node with increasing number of cores

As you can see, on a real application when using all cores on a node, the Nehalem outpaces the older architectures by better than 2:1. So, your performance did double... just not if you judge by HPL numbers. So, a cluster with a peak performance of 100 TeraFLOPS today is a whole lot more productive than a peak 100 TeraFLOPS cluster of a couple of years ago. In fact, a modern 50TF cluster may be faster for many workloads than a 100TF cluster of 2008; but the 100TF one will still rank higher on the Top 500 list.

The boost in memory bandwidth that Intel introduced with the Nehalem architecture has been a real game changer in overall system performance, but it's really thrown a wrench in the way we look at the HPL benchmark and things like peak performance. These were never perfect measures, but they were the best we had. Our grain of salt has gotten a whole lot bigger lately. And while the Nehalem architecture gave us a quantum leap in memory bandwidth, increasing core count, and the creeping up of clock speeds again -- well beyond 3Ghz -- will make it difficult to maintain this fantastic bandwidth per core in future products. Further muddying the waters, we're seeing lots of other products that will claim high peak numbers, including the introduction of GPU-based systems. All of these new products will first claim new heights in peak performance, and sometime later claim fantastic HPL performance, but keep in mind that the real cost benefit analysis should focus on your particular application workload, and no matter what architecture you have, speedy floating point units aren't useful if you can't keep them fed with data. So, keep an eye on benchmarks for your workloads, not just eye popping peak numbers.

The good news, however, is that with recent architectures, significant improvements in memory bandwidth delivered to each socket have been made, which have done a great deal to close the gap between peak performance and delivered performance, giving us huge productivity gains at the same clock rate. Just keep in mind this won't always show up on the Top 500.

Leave a COMMENT here

	Dan Stanzione, Ph.D. Deputy Director Texas Advanced Computing Center The University of Texas at Austin Dr. Stanzione is the deputy director of the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. He is the principal investigator (PI) for several projects including “World Class Science through World Leadership in High Peformance Computing;” “Digital Scanning and Archive of Apollo Metric, Panoramic, and Handheld Photography;” “CLUE: Cloud Computing vs. Supercomputing—A Systematic Evaluation for Health Informatics Applications;” and GDBase: An Engine for Scalable Offline Debugging.” In addition, Dr. Stanzione serves as Co-PI for “The iPlant Collaborative: A Cyberinfrastructure-Centered Community for a New Plant Biology,” an ambitious endeavor to build a multidisciplinary community of scientists, teachers and students who will develop cyberinfrastructure and apply computational approaches to make significant advances in plant science. He is also a Co-PI for TACC’s Ranger supercomputer, the first of the “Path to Petascale” systems supported by the National Science Foundation (NSF) deployed in February 2008. Prior to joining TACC, Dr. Stanzione was the founding director of the Fulton High Performance Computing Institute (HPCI) at Arizona State University (ASU). Before ASU, he served as an AAAS Science Policy Fellow in the Division of Graduate Education NSF. Dr. Stanzione began his career at Clemson University, his alma mater, where he directed the supercomputing laboratory and served as an assistant research professor of electrical and computer engineering. Dr. Stanzione's research focuses on such diverse topics as parallel programming, scientific computing, Beowulf clusters, scheduling in computational grids, alternative architectures for computational grids, reconfigurable/adaptive computing, and algorithms for high performance bioinformatics. He is strong advocate of engineering education, facilitates student research, and teaches specialized computation engineering courses. Education Ph.D., Computer Engineering, 2000; M.S., Computer Engineering, 1993; B.S. Electrical Engineering, 1991, Clemson University.
	Tommy Minyard, Ph.D. Director of Advanced Computing Systems Texas Advanced Computing Center The University of Texas at Austin Dr. Minyard is the director of the Advanced Computing Systems group at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. His group is responsible for operating and maintaining the center’s production systems and infrastructure; ensuring world-class science through HPC leadership; enhancing HPC research using clusters; performing fault tolerance for large-scale cluster environments; and conducting system performance measurement and benchmarking. Dr. Minyard holds a doctorate in Aerospace Engineering from The University of Texas at Austin where he specialized in developing parallel algorithms for simulating high-speed turbulent flows with adaptive, unstructured meshes. While completing his doctoral research in aerospace engineering, Dr. Minyard worked at the NASA Ames Research Center and the Institute for Computer Applications in Science and Engineering. After continuing his research at UT Austin as a post-doctorate research assistant, he joined CD-Adapco as a software development specialist to continue his career in computational fluid dynamics. Dr. Minyard returned to UT Austin in 2003 to join the Texas Advanced Computing Center. Education Ph.D., Aerospace Engineering, 1997; M.S., Aerospace Engineering, 1993; B.S. Aerospace Engineering, 1991, The University of Texas at Austin.

Dan Stanzione, Ph.D.
Deputy Director
Texas Advanced Computing Center
The University of Texas at Austin

Dr. Stanzione is the deputy director of the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. He is the principal investigator (PI) for several projects including “World Class Science through World Leadership in High Peformance Computing;” “Digital Scanning and Archive of Apollo Metric, Panoramic, and Handheld Photography;” “CLUE: Cloud Computing vs. Supercomputing—A Systematic Evaluation for Health Informatics Applications;” and GDBase: An Engine for Scalable Offline Debugging.”

In addition, Dr. Stanzione serves as Co-PI for “The iPlant Collaborative: A Cyberinfrastructure-Centered Community for a New Plant Biology,” an ambitious endeavor to build a multidisciplinary community of scientists, teachers and students who will develop cyberinfrastructure and apply computational approaches to make significant advances in plant science. He is also a Co-PI for TACC’s Ranger supercomputer, the first of the “Path to Petascale” systems supported by the National Science Foundation (NSF) deployed in February 2008.

Prior to joining TACC, Dr. Stanzione was the founding director of the Fulton High Performance Computing Institute (HPCI) at Arizona State University (ASU). Before ASU, he served as an AAAS Science Policy Fellow in the Division of Graduate Education NSF. Dr. Stanzione began his career at Clemson University, his alma mater, where he directed the supercomputing laboratory and served as an assistant research professor of electrical and computer engineering.

Dr. Stanzione's research focuses on such diverse topics as parallel programming, scientific computing, Beowulf clusters, scheduling in computational grids, alternative architectures for computational grids, reconfigurable/adaptive computing, and algorithms for high performance bioinformatics. He is strong advocate of engineering education, facilitates student research, and teaches specialized computation engineering courses.

Education
Ph.D., Computer Engineering, 2000; M.S., Computer Engineering, 1993; B.S. Electrical Engineering, 1991, Clemson University.

Tommy Minyard, Ph.D.
Director of Advanced Computing Systems
Texas Advanced Computing Center
The University of Texas at Austin

Dr. Minyard is the director of the Advanced Computing Systems group at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. His group is responsible for operating and maintaining the center’s production systems and infrastructure; ensuring world-class science through HPC leadership; enhancing HPC research using clusters; performing fault tolerance for large-scale cluster environments; and conducting system performance measurement and benchmarking.

Dr. Minyard holds a doctorate in Aerospace Engineering from The University of Texas at Austin where he specialized in developing parallel algorithms for simulating high-speed turbulent flows with adaptive, unstructured meshes. While completing his doctoral research in aerospace engineering, Dr. Minyard worked at the NASA Ames Research Center and the Institute for Computer Applications in Science and Engineering. After continuing his research at UT Austin as a post-doctorate research assistant, he joined CD-Adapco as a software development specialist to continue his career in computational fluid dynamics. Dr. Minyard returned to UT Austin in 2003 to join the Texas Advanced Computing Center.

Education
Ph.D., Aerospace Engineering, 1997; M.S., Aerospace Engineering, 1993; B.S. Aerospace Engineering, 1991, The University of Texas at Austin.

DELL_LanceB	Latest page update: made by DELL_LanceB , Aug 30 2010, 5:00 PM EDT (about this update About This Update Edited by DELL_LanceB view changes - complete history)
	Keyword tags: __migration_HPC_at_Dell Clock Rates Dell HPC Intel Linpack sustained performance TACC Top500 More Info: links to this page

Threads for this page

loweyj

Real world Performance

Aug 19 2010, 11:49 AM EDT by JOHNADCO

Thread started: Jul 29 2010, 7:11 PM EDT Watch

While benchmarks are great at seeing what systems run benchmarks well, they tend to almost never reflect reality. This has been true for as long as I have been involved in computing. Memory bandwidth has become extremely critical in multi-core architectures, and I agree with the conclusion that it will only scale in a limited manner, more cores will not equate to better real-world performance due to bandwidth constraints. Of course the 800 pound gorilla in the room is software limitations, but I will leave that to others to comment on.

Cheers
James

HPC at Dell	HPC at Dell 2010 Blog Posts	HPC Compute
GPGPU News Round-Up	Dell's Patricia L. Bush	Dell AIM Overview
Microsoft Guest Blog - GPGPU Computing: Here Come the Tools	HPC at Dell Blogs	HPC at Dell Guest Blog
Meet The Gang	Dell HPC Events	HPC Storage: Dealing with the Data Deluge

Cluster Performance- It's not just about Clock Rate Anymore.

Threads for this page

Related Content