
By Dan Stanzione and Tommy Minyard, Texas
Advanced Computing Center
For a long time now,
we've known that sustained performance on clusters is a complicated
thing to measure, especially when talking about parallel jobs. In
columns like this one, and others too numerous to count, it has been
stressed that sustained performance for real applications is as much
about balance (processor, interconnect, filesystem, memory) as it is
about clock frequency. But despite that, the speed of the processor has
still been the driving force for many when making decisions about
clusters. Look no further than the
Top 500
list to see that this is true. The Top 500 uses the
High Performance Linpack (HPL) benchmark,
which, while somewhat sensitive to interconnect performance, generally
delivers a pretty high fraction of the peak performance of the processor
(in the last few years, that means 60-70% on poorly balanced clusters,
and 75-85% on well balanced ones... you can see this yourself by working
out the ratio of Rmax to Rpeak on the reported numbers in the list,
particularly for the older systems).
Given that HPL delivers a
pretty good fraction of peak performance on most processors, not
surprisingly, higher clock rate has meant higher HPL, higher Top 500
number, and the impression that your new cluster is "faster." The big
gotcha here is the not-so-well-kept-secret in the HPC community that
*peak* performance and *real* application performance didn’t really have
that much to do with one another, and the performance of HPL did not
reflect the ability of a cluster to get work done. This has become
especially true with the last generation of new quad-core processors.
The fact of the matter is, while processors have been following the
Moore's law curve, most of our real applications have been increasingly
starved for memory bandwidth (i.e. the ability to get data from main
memory into those increasingly fast processors). HPL doesn't really
suffer too much from inadequate memory bandwidth, so the magnitude of
the problem hasn't been quite as obvious.
Intel has been well aware of this, however, and has
taken a quantum leap forward in memory bandwidth with the Intel Xeon
processor 5500 sequence “
Nehalem” series of processors (and
continued into the current Intel Xeon processor 5600 sequence “Westmere”
processors and beyond). If you've been out shopping for cluster
processors, it might appear on the surface that things have been pretty
stagnant. Two, three or even four years ago, you could get a quad-core
processor, issuing four floating point instructions per cycle, running
somewhere between 2-3GHz. If you looked recently, you could get a
quad-core Nehalem processor, issuing four floating point instructions
per cycle, running between 2-3GHz. So what happened to Moore's Law
performance doubling, you may ask. Well, it happened, but it's primarily
in memory bandwidth. If you look at peak performance, things look about
the same. Let's say two to three years ago you were looking at some
Intel Harpertown , and let's assume for sake of round numbers they ran
at exactly 2.5GHz. The peak performance of one of these chips would be
2.5GHz*(4 instructions per cycle)*(4 cores per chip) = 40 GigaFLOPS or
so of peak performance. If you looked at the quad-core versions of the
Nehalem at 2.5GHz, the math would be the same. But that's the clean
theory world of peak performance.
Let's look at some real
performance instead. The figure below shows the performance of the
Weather Research Forecast (WRF) V3.1.1 application, typical of climate
and weather models, running on 3.0GHz Intel Xeon processor E5450
“Harpertown” processors, compared to 2.66GHz Intel Xeon processor X5550
“Nehalem” processors in a 2-socket Dell blade configuration. On the
Harpertown processors, the performance flattens out with just four cores
on a node in use, stays constant up to eight cores and then scales
almost linearly when going from one to two nodes, as expected. In
contrast, the Nehalem processor continues to increase in performance up
to eight cores, with single core performance better than Harpertown by
40%. However, with all eight cores in use on a node, the Nehalem beats
the Harpertown by almost 4 to 1! The reason for this is clear when
looking at the memory bandwidth available on a node -- see the second
figure below. In the case of the Harpertown, the memory bandwidth
flattens out to a maximum at just two cores, while for the Nehalem,
memory bandwidth continues increasing out to all eight cores.


As you can see, on a real application when using all cores on a node,
the Nehalem outpaces the older architectures by better than 2:1. So,
your performance did double... just not if you judge by HPL numbers. So,
a cluster with a peak performance of 100 TeraFLOPS today is a whole lot
more productive than a peak 100 TeraFLOPS cluster of a couple of years
ago. In fact, a modern 50TF cluster may be faster for many workloads
than a 100TF cluster of 2008; but the 100TF one will still rank higher
on the Top 500 list.
The boost in memory bandwidth that Intel
introduced with the Nehalem architecture has been a real game changer in
overall system performance, but it's really thrown a wrench in the way
we look at the HPL benchmark and things like peak performance. These
were never perfect measures, but they were the best we had. Our grain of
salt has gotten a whole lot bigger lately. And while the Nehalem
architecture gave us a quantum leap in memory bandwidth, increasing core
count, and the creeping up of clock speeds again -- well beyond 3Ghz --
will make it difficult to maintain this fantastic bandwidth per core in
future products. Further muddying the waters, we're seeing lots of
other products that will claim high peak numbers, including the
introduction of GPU-based systems. All of these new products will first
claim new heights in peak performance, and sometime later claim
fantastic HPL performance, but keep in mind that the real cost benefit
analysis should focus on your particular application workload, and no
matter what architecture you have, speedy floating point units aren't
useful if you can't keep them fed with data. So, keep an eye on
benchmarks for your workloads, not just eye popping peak numbers.
The
good news, however, is that with recent architectures, significant
improvements in memory bandwidth delivered to each socket have been
made, which have done a great deal to close the gap between peak
performance and delivered performance, giving us huge productivity gains
at the same clock rate. Just keep in mind this won't always show up on
the Top 500.
Leave a COMMENT here |
Dan Stanzione, Ph.D. Deputy Director Texas
Advanced Computing Center The University of Texas at Austin
Dr.
Stanzione is the deputy director of the Texas Advanced Computing Center
(TACC) at The University of Texas at Austin. He is the principal
investigator (PI) for several projects including “World Class Science
through World Leadership in High Peformance Computing;” “Digital
Scanning and Archive of Apollo Metric, Panoramic, and Handheld
Photography;” “CLUE: Cloud Computing vs. Supercomputing—A Systematic
Evaluation for Health Informatics Applications;” and GDBase: An Engine
for Scalable Offline Debugging.”
In addition, Dr. Stanzione
serves as Co-PI for “The iPlant Collaborative: A
Cyberinfrastructure-Centered Community for a New Plant Biology,” an
ambitious endeavor to build a multidisciplinary community of scientists,
teachers and students who will develop cyberinfrastructure and apply
computational approaches to make significant advances in plant science.
He is also a Co-PI for TACC’s Ranger supercomputer, the first of
the “Path to Petascale” systems supported by the National Science
Foundation (NSF) deployed in February 2008.
Prior to joining
TACC, Dr. Stanzione was the founding director of the Fulton High
Performance Computing Institute (HPCI) at Arizona State University
(ASU). Before ASU, he served as an AAAS Science Policy Fellow in the
Division of Graduate Education NSF. Dr. Stanzione began his career at
Clemson University, his alma mater, where he directed the supercomputing
laboratory and served as an assistant research professor of electrical
and computer engineering.
Dr. Stanzione's research focuses on
such diverse topics as parallel programming, scientific computing,
Beowulf clusters, scheduling in computational grids, alternative
architectures for computational grids, reconfigurable/adaptive
computing, and algorithms for high performance bioinformatics. He is
strong advocate of engineering education, facilitates student research,
and teaches specialized computation engineering courses. Education Ph.D.,
Computer Engineering, 2000; M.S., Computer Engineering, 1993; B.S.
Electrical Engineering, 1991, Clemson University. |
 | Tommy Minyard, Ph.D. Director
of Advanced Computing Systems Texas Advanced Computing Center The
University of Texas at Austin
Dr. Minyard is the director of
the Advanced Computing Systems group at the Texas Advanced Computing
Center (TACC) at The University of Texas at Austin. His group is
responsible for operating and maintaining the center’s production
systems and infrastructure; ensuring world-class science through HPC
leadership; enhancing HPC research using clusters; performing fault
tolerance for large-scale cluster environments; and conducting system
performance measurement and benchmarking.
Dr. Minyard holds a
doctorate in Aerospace Engineering from The University of Texas at
Austin where he specialized in developing parallel algorithms for
simulating high-speed turbulent flows with adaptive, unstructured
meshes. While completing his doctoral research in aerospace
engineering, Dr. Minyard worked at the NASA Ames Research Center and
the Institute for Computer Applications in Science and Engineering.
After continuing his research at UT Austin as a post-doctorate research
assistant, he joined CD-Adapco as a software development specialist to
continue his career in computational fluid dynamics. Dr. Minyard
returned to UT Austin in 2003 to join the Texas Advanced Computing
Center.
Education Ph.D., Aerospace Engineering, 1997;
M.S., Aerospace Engineering, 1993; B.S. Aerospace Engineering, 1991,
The University of Texas at Austin. |