US Dept. of Energy Announces Frontier Supercomputer: Cray and AMD to Build 1.5 Exaflop Machineby Ryan Smith on May 7, 2019 7:40 AM EST
The history of the computing industry is one of constant progress. Processors get faster, storage gets cheaper, and memory gets denser. We see the repercussions of this advancement through all aspects of society, and that extends to the top as well, where national governments continue to invest in bigger and better supercomputers. One part technological necessity and one part technological race, the exascale era of supercomputers is about to begin, as orders for the first exaFLOP-capable are now going out. It’s only fitting then that this morning the United States Department of Energy is announcing the contract for their fastest supercomputer yet, the Frontier system, which will be built by Cray and AMD.
Frontier is planned for delivery in 2021, and when it’s activated it will become the second and most powerful of the US DOE’s two planned 2021 exascale systems, with performance expected to reach 1.5 exaFLOPS. The ambitious system won’t come cheaply, however; with a price tag of over 500 million dollars for the system alone – and another 100 million dollars for R&D – Frontier is among the most expensive supercomputers ever ordered by the US Department of Energy.
The new supercomputer is being built as part of the US DOE’s CORAL-2 program for supercomputers, with Frontier scheduled to replace Oak Ridge National Laboratory’s current Summit supercomputer. Summit is the current reigning champion in the supercomputer world, with 200 petaFLOPS of performance, and accordingly the US DOE and Oak Ridge are aiming to significantly improve on its performance for the new computer. All told, Frontier should be able to deliver over 7x the performance of Summit, and is expected to be the fastest supercomputer in the world once it’s activated.
Like Summit (and Titan before it), Frontier is an open science system, meaning that it’s available to academic researchers to run simulations and experiments on. Accordingly, the lab is expecting the supercomputer to be used for a wide range of projects across numerous disciplines, including not only traditional modeling and simulation tasks, but also more data-driven techniques for artificial intelligence and data analytics. In fact the latter is a bit of new ground for the lab and the system’s eventual users; just as we’ve seen in the enterprise space over the past few years, neural network-based AI is becoming an increasingly popular technique to solve problems and extract analysis from large datasets, and now researchers are looking at how to refine those techniques from the current-generation systems and apply them to exascale-level projects.
|US Department of Energy Supercomputers
|Intel Xeon Scalable
Frontier: Powered by Cray & AMD
Officially, the prime contractor for Frontier will be Cray. But looking at the specifications, you could be excused for thinking it was AMD. Cray for its part is partnering with the chipmaker for the system, and as a result AMD is providing most of the core hardware for the new supercomputer. Designed as a next-generation CPU + accelerator system, with a mix of CPUs and GPUs doing the heavy compute work, AMD will be supplying both the CPUs and GPUs for Frontier. And as the principle processor provider, AMD will also be taking on a lot of the responsibility for developing the software stack as well, with the company working with Cray to develop an enhanced version of their ROCm environment to best extract performance from the massive cluster of CPUs and GPUs.
On the CPU side of matters, AMD will be supplying a customized next-generation EPYC CPU. AMD has confirmed that it’s going to be using a future generation of their Zen CPU cores, and given the timing of the project, we’re almost certainly looking at a Zen 3 or Zen 4 design here. Just how custom AMD’s CPU is remains to be seen, but their announcement has revealed that Frontier’s CPUs will include new instructions for the optimization of AI and supercomputing workloads.
Meanwhile on the GPU side of matters, AMD and Cray are holding their cards a little closer. Rather than naming any architecture or architectural generation, AMD is only saying that the GPUs are “based on the Radeon Instinct family” and have “yet to be announced.” AMD’s current public roadmap goes out to “Next Gen” in 2020, and with GPU development cycles averaging 2 years, this may be the architecture we see. But with the particular needs for a supercomputer, AMD may have something slightly more bespoke.
What the company is confirming for now is that they aren’t holding back on features. The HPC-focused GPU is being designed with Frontier in mind and will incorporate mixed precision compute support. Feeding the beast will be HBM memory, and AMD will be tapping a version of Infinity Fabric to connect the CPUs and GPUs.
In fact while AMD has kept the details on the technology light, it sounds like this version of IF will be the most advanced version yet. AMD is specifically noting that it’s an “incredibly” coherent fabric, calling it the first fully optimized CPU + GPU design for supercomputing. AMD’s GPUs and CPUs will be arranged in a 4-to-1 ratio, with 4 GPUs for each EPYC CPU. It’s worth noting that AMD’s slide shows a mesh with every GPU connected to the CPU and two other GPUs, but I’m not reading too much into this quite yet, as AMD hasn’t disclosed any other details on the IF setup.
With AMD going up to the blade level, tying together all of these nodes will be Cray’s job. For Frontier the supercomputer vendor is launching their new Slingshot interconnect, an equally ambitious interconnect that will support adaptive routing, congestion management, and quality-of-service features. Slingshot is capable of 200Gb/sec per port, with individual blades incorporating a port for each GPU in the blade so that other nodes can directly read and write data to a GPU’s memory. As a result Frontier will have a significant amount of interconnect bandwidth, which is all but necessary in order to allow the system to scale to exaFLOP levels.
Overall, Frontier will be organized into over 100 Cray Shasta cabinets. And while Cray has not announced a specific power consumption figure for Frontier, with each cabinet rated for 300KW, this would put the complete system at over 30MW. Which to put things in context, this is over twice the power consumption of the 13MW Summit. So while Frontier is a significantly faster system than the supercomputer it replaces, Cray, AMD, and the US DOE are all feeling the pinch of Dennard scaling slowing down, as power efficiency gains get harder to achieve. All told, in a passing comment made in the press briefing, it sounds like Oak Ridge will be installing a total of 40MW of capacity for Frontier, which is a significant amount of power to say the least.
Along with furthering the US’s own supercomputing leadership goals, securing the Frontier contract also represents big wins for Cray and AMD. Cray is now involved in both 2021 exascale systems, reinforcing their own place in the supercomputing world. Meanwhile for AMD, who is spending this current generation from the outside looking in, they have now secured a major and prestigious win for both their CPU and GPU divisions.
In fact it’s interesting to note that of the two 2021 exascale systems being ordered, both are coming from full-service processor vendors that supply both CPUs and GPUs. Current-generation systems like Summit use mixed vendors – e.g. IBM + NVIDIA – so the move to integrated vendors is a big shift for these CPU + accelerator systems. Clearly there are technological and procurement benefits to using a single vendor for all of the processors, which benefits both AMD and Intel. Though it’s worth noting that the CORAL-2 program requires the DOE to buy systems based on two different architectures, so if the future is integrated systems, then AMD and Intel are the logical choices.
At any rate, with the contract placed for Frontier, the job is only half-done. AMD and Cray will need to continue developing their hardware and software for the system, not to mention locking down the specific specifications for the finished supercomputer. So expect to continue to hear news about Frontier trickle out over the next couple of years, leading up to its installation in 2021.