The drive to putting Arm into the server space has had its ups and downs. We’ve seen the likes of AppliedMicro/Ampere, Calxeda, Broadcom/Cavium/Marvell, Qualcomm, Huawei, Fujitsu, Annapurna/Amazon, and even AMD, deal with Arm-based silicon in the server market. Some of these designs have successful, others not so much, but Arm is pushing its new Neoverse N1 roadmap of cores into this space, aiming for high performance and for scale. We’ve already seen Amazon come into the market with its N1-based Graviton2 for its cloud services, but there’s going to be a counter product for every other cloud provider, with the new N1-based Next-Gen Ampere CPU, codenamed QuickSilver. We have some details ahead of the official release announcement in Q1 2020.

Ampere, technically Ampere Computing, is currently in the market with its eMAG processors. Using custom Arm cores derived from its acquisition of AppliedMicro, it has achieved mild success in second tier cloud providers (such as Packet) as well as Android in the cloud-type services for a number of Chinese smartphone game providers. There’s also some success with telecommunications at the edge, but ultimately nothing with extreme demands. This is where the Next-Gen Ampere product comes in.


Updated Roadmap for Jan 2020

 

New Product, No (Marketing) Name Yet

The new product doesn’t yet have a marketing name – we asked and they preferred it to be called ‘Next-Gen Ampere’ for now, or to use its SoC codename ‘QuickSilver’. What we were told is that the new product is a brand new ground-up design from Ampere, separate from the AppliedMicro IP acquisition. It plans to compete in the same space that Amazon’s Graviton2 currently sits at AWS, but as the main alternative to the other cloud providers that won’t have access to Graviton2.

What we are getting today is some rough details of the new chip. Other features, such as exact SKUs to be launched, exact TDPs, exact frequencies, and pricing, are going to be disclosed at the official release announcement in 2020. Nonetheless, Ampere has exposed a lot of details.

Next-Gen Ampere will be a monolithic chip built on TSMC’s 7nm process and featuring 80 cores. These cores are not custom like eMAG, but are built on Arm’s Neoverse N1 design, using paired clusters of cores connected by an Arm mesh IP (CMN-600). This is the same core as Graviton2, and as expected Ampere is keen to promote that their design is optimized for power, performance, latency, and throughput, as well as offering more cores and more of other things as well.

N1 cores are single threaded (not multi-threaded), and the QuickSilver SoC is built with containers in mind: chip will support container level QoS commands such that shared resources are not taken by one particular customer, and additional RAS features will be in play to ensure consistent performance. Ampere actually hammered on this point a fair bit, stating that QuickSilver is built to ensure consistency in a multi-tenant environment to the point where deterministic performance is required, which suggests a container based frequency and cache control. Ampere says that the turbo frequency of its new chip will be consistent in order to enable this.

Aside from 80 cores, QuickSilver will also have over 128 PCIe 4.0 lanes. Ampere wouldn’t state at this point exactly how many (details to come in Q1), but did state that it would have more on the market than any other server chip, x86 or Arm, will have in 2020. The current leader in this race is AMD, whose Rome EPYC processors have 128, and we confirmed that Quicksilver will have more than 128. This suddenly got very interesting, as this opens up a lot of possibilities for connectivity in specific markets, such as accelerators and storage. Ampere is a key partner in NVIDIA’s CUDA-on-Arm strategy in 2020, and so expect to see cloud instances using Quicksilver with access to lots of NVIDIA GPUs.

Also on the new chip, there will be eight DDR4 memory channels, although an exact frequency support was not stated. Ampere said it would be more than the current eMAG support, which is DDR4-2666, and that the IP they are using for the DDR4 memory controller allows them to push up to the limits of the JEDEC DDR4 specification, so realistically here we are looking at DDR4-3200 memory out of the gate. It will be interesting if this includes 2 DIMMs per channel support as well.

QuickSilver will support up to dual socket configurations, and Ampere states that it will use the CCIX protocol over PCIe 4.0 to enable socket-to-socket communications. Beyond this, the new chips will also support off-chip CCIX, for coherent accelerator attachments, or to add coherent storage class memory tiers. Ampere stated that their desire to use standardized interconnect protocols is helping drive their time-to-market, and with use of tried-and-tested IP blocks (like the DDR4 memory controller), that helps as well, along with taking advantage of the software and infrastructure that already circulates these standards. We were also told that all the work done to improve the performance on eMAG was transferable to QuickSilver, and that going forward Ampere’s strategy of building on the previous will be a key element for its customer base.

At this point in time Ampere did not see the need to add-in any specific acceleration units to the design – aside from the limited work they could do on the core, they decided that there was no need for specific cryptography or AI acceleration engines at this time, and they will wait until such compute blocks become standardized. The amount of connectivity and coherency, according to Ampere, will assist any customer needing additional acceleration.

On performance, Ampere is targeting a big leap over eMAG. The N1 cores from Arm are A76-like in their performance, and offer some configurability, such as a 512 KB to 1 MB L2 option. Ampere didn’t state exactly what size L2 they are using, but did state that they optimized for performance over die area in order to reduce memory/cache misses, which would imply that the full 1 MB L2 configuration is in play here. Ampere currently has silicon in house and early silicon sampling with key customers, and will provide a more rounded vision of performance with its full launch announcement in 2020.

We were told that the TDP of Next-Gen Ampere will range from 200W+ for the full 80-core model for servers, down to 45W with silicon that will offer an aggressive core count and performance per watt for particular use cases at the edge, for fanless designs. We were able to confirm that Ampere is building a monolithic die with 80 cores, so the halo chip will be a fully enabled part, with other chips being cut down versions of this. Ampere wouldn’t commit to any smaller silicon designs at this stage, stating that their customers are mostly requesting high-core count and high-performance parts.

Ampere’s Jeff Wittich, SVP of Products and ex-Intel, did state that the company is very thorough in its silicon design work, stating that he’s never seen so much pre-silicon design work for a product before. He cites an extensive amount of emulation for QuickSilver, especially when it comes to features of the SoC, and the company also does a lot of test chip designs as well to make sure what comes out at the right time ends up being what the company needs in a product with as few stepping adjustments as possible. I was surprised to hear this, given the relative size of Ampere and Wittich’s history, but given the recent praise on Graviton2 with its launch, QuickSilver does seem best poised to offer a direct competitor with higher-performance to other cloud providers in 2020.

Arm-based Server CPU Offerings
AnandTech Ampere

QuickSilver
Amazon

Graviton2
Marvell

ThunderX2
Huawei

Kunpeng 920
Ampere

eMAG
Launched Q1 2020 Q4 2019 Q2 2018 Q3 2019 Q3 2018
Arm arch

µarch
v8.2

Neoverse N1
v8.2

Neoverse N1
v8.1

Vulkan
v8.4

TSV110
v8.0

Skylark
Cores 80 64 32 64 32
Node TSMC
7nm
TSMC
7nm
TSMC
16nm
TSMC
7nm
TSMC
16nm
Freq ? 3.1 GHz 2.5 GHz 2.6 GHz 3.3 GHz
TDP 200W+ 100W ? ? 180 W 125 W
Memory 8x
DDR4-3200?
8x
DDR4-3200
8x
DDR4-2666
8x
DDR4-2933
8x
DDR4-2666
PCIe >128 4.0 ? 56x 3.0 40x 4.0 32x 3.0
CCIX Yes ? - Yes -
Multi
Socket
2 ? 2 4 1

Ampere’s Future and Roadmap

We were able to also ask about Ampere’s future in our briefing. It’s no secret that questions are being asked as to Ampere’s future, having done two rounds of funding but not making a serious dent in the Arm server space while also being suspiciously quiet. We were told that ultimately Ampere has had its head down recently driving the new Next-Gen Ampere product, hence the relative quietness, even when workstation products came to market but not a peep was heard from the company.

As part of our discussions, Ampere did state that it is very secure in its funding. We were told that Ampere is working on an annual cadence with its processor portfolio, with products coming out in 2020, 2021, and 2022. High-volume manufacturing with Quicksilver will start at the end of Q1/Q2 2020, and the company actually has the next two generations of hardware completely funded. This allows the company to be candid with old and new customers as to where the product is going, but also allows them to work with customers to examine future workloads and determine the requirements for future products.

Ampere's Roadmap
Ampere
eMag
Ampere
Next-Gen
NG+1 NG+2
Shipped Sampling In Development Defined
16nm 7nm 7nm 5nm
Skylark QuickSilver QS+1 QS+2
ARM v8.0 ARM v8.2    
Up to 32 Cores Up to 80 Cores    
8 x DDR4-2667 8 x DDR4-3200?    
42x PCIe 3.0 >128 PCIe 4.0    
75 W to 125 W 45 W to 200+ W    
3.3 GHz Turbo ? Frequency    
  More IO    
  CCIX Attach    
  Dual Socket    
  Improved IPC    

As to future products, Ampere did state that as they are now using LGA socketed processors for QuickSilver, then the next generation after this (which I’ll call QuickSilver+1) will be socket compatible and offer drop in support. Ampere sees this as a benefit for a quicker time-to-market, which is true, but also means that we should expect parity with memory and PCIe support for two generations, which is often welcomed in the server space. The generation after this, QuickSilver+2, is funded and is already in the definition phase. Beyond this Ampere is doing research and pathfinding, but based on the need to be Agile in an aggressive market space, Ampere doesn’t feel the need to start defining specifications in stone for products 3+ years away just yet.

From an AnandTech point of view, we’re glad that Ampere reached out to us to talk about this so far in advance of the official launch in Q1: it looks like only a couple of other media were offered briefings. We’ve seen workstation versions of eMAG hit the market, so we’re hoping to be sampled one of those ahead of the launch of the Next-Gen Product, if only to act as a reference point for performance claims. Then hopefully we can put it against the other competition in the market. Arm servers just got interesting.

The carousel image for this article is the current eMAG product as part of the Packet bare-metal cloud offering

Related Reading

POST A COMMENT

56 Comments

View All Comments

  • milli - Tuesday, December 24, 2019 - link

    As I said, I'll believe it when I see it. At the moment, it's all talk and no proof.
    N1 will be closely related to A76, which definitely has lower IPC than Zen 2. Even A77 has lower IPC. So I don't get where you get your 'facts'.
    Reply
  • Wilco1 - Wednesday, December 25, 2019 - link

    Cortex-A77 most definitely has higher IPC than Skylake and Zen 2, on SPECINT as well as SPECFP: https://images.anandtech.com/doci/15207/SPEC2006_S...

    That's despite being a small phone SoC. Servers have much larger caches and memory system so Neoverse N1 performs much better: https://www.anandtech.com/show/13959/arm-announces...
    Reply
  • milli - Wednesday, December 25, 2019 - link

    Don't tell me that you normalize that SPEC score? If CPU architectures have thought you anything, you would have known that performance doesn't scale linearly with clock frequency.
    Not only does it not scale linearly, an architecture designed to run sub-3GHz will often perform better normalized to an architecture designed to run over 4GHz (11-stage vs 19-stage). ARM cores just can't run at such high clocks.
    Reply
  • Wilco1 - Thursday, December 26, 2019 - link

    You do understand what IPC means, right? It normalises on clock frequency since that is its definition! If you understand microarchitecture you would know why IPC doesn't vary much with frequency.

    We're talking about servers here - high core count parts have base frequencies near 2GHz. Using a 5GHz core into a server is a bad idea, it's much better to design a server core for 3GHz operation.

    The question is not whether QuickSilver beats Rome, but whether it beats next-gen Milan too.
    Reply
  • rahvin - Friday, December 27, 2019 - link

    It's not going to beat Rome. Like every past ARM server iteration it's going to be a profound disappointment. Maybe good as an nginx dispatcher but worthless for anything involving computation. That's my prediction based on the history of ARM server cores. Reply
  • Wilco1 - Saturday, December 28, 2019 - link

    It surely is going to beat Rome. As for serious computation, you don't even know that there are now several Arm servers in the TOP500 list (as well as #1 in the GREEN500 list)?

    You're clearly completely ignorant about Arm servers...
    Reply
  • deltaFx2 - Saturday, December 28, 2019 - link

    @Wilco1: "It normalises on clock frequency since that is its definition" True but incomplete. Memory latency does not scale with frequency. 70ns is 70ns @1GHz or 5GHz. In the case of Zen's architecture, the fabric also does not get faster with core frequency. So, running Zen with a faster/overclocked memory is also incorrect. If you really want to do an honest IPC comparison, you should lock the core frequency AND memory latency/timings and then do your comparison. Spec score/Ghz is a ratio of workload runtime per unit frequency not IPC. Reply
  • Wilco1 - Sunday, December 29, 2019 - link

    We're running CPUs at 5GHz while DRAM barely runs at 10MHz (100ns). So how is that even possible? By making the CPU completely independent of this slow memory by adding huge caches, 3-4 levels of caches, automatic prefetching, out of order execution, and supporting ~50 outstanding cache misses.

    As a result performance almost perfectly scales with frequency on most applications, particularly SPEC which is very compute intensive (and which fits in caches when run single-threaded). So IPC doesn't vary with frequency.
    Reply
  • deltaFx2 - Sunday, December 29, 2019 - link

    @ Wilco1: "By making the CPU completely independent of this slow memory " Except it is not independent, is it? Prefetching is not magic. It doesn't get every pattern: pointer chases cannot be prefetched, data dependent accesses like binary tree traversals the same (also a form of pointer chase), other forms of irregular data accesses. Also, the prefetch distance @5GHz has to be much higher than at 2.5GHz to cover for the extra latency in terms of cycles. LLC miss rates even on spec are low double digit MPKI. Not all applications have MLP to exploit in the instruction windows that are currently implemented. So no, caches work, but are not perfect. If they were, you would be right, but they are not and will never be. If we had perfect scaling, speed demons would be the obvious implementation choice and we would have 10GHz CPUs, maybe even higher. Memory latency was and is the stumbling block. I suggest trying it out before making such statements because you clearly are mistaken. Reply
  • Ian Cutress - Tuesday, December 24, 2019 - link

    Epyc is x86. This is Arm. Reply

Log in

Don't have an account? Sign up now