The Snapdragon 865 Performance Preview: Setting the Stage for Flagship Android 2020

Name: The Snapdragon 865 Performance Preview: Setting the Stage for Flagship Android 2020
Item: The Snapdragon 865 Performance Preview: Setting the Stage for Flagship Android 2020
Author: Andrei Frumusanu

by Andrei Frumusanu on December 16, 2019 7:30 AM EST

178 Comments | Add A Comment

178 Comments

Earlier this month we had the pleasure to attend Qualcomm’s Maui launch event of the new Snapdragon 865 and 765 mobile platforms. The new chipsets promise to bring a lot of new upgrades in terms of performance and features, and undoubtedly will be the silicon upon which the vast majority of 2020 flagship devices will base their designs on. We’ve covered the new improvements and changes of the new chipset in our dedicated launch article, so be sure to read that piece if you’re not yet familiar with the Snapdragon 865.

As has seemingly become a tradition with Qualcomm, following the launch event we’ve been given the opportunity to have some hands-on time with the company’s reference devices, and had the chance to run the phones through our benchmark suite. The QRD865 is a reference phone made by Qualcomm and integrates the new flagship chip. The device offers insight into what we should be expecting from commercial devices in 2020, and today’s piece particularly focuses on the performance improvements of the new generation.

A quick recap of the Snapdragon 865 if you haven’t read the more thorough examination of the changes:

Qualcomm Snapdragon Flagship SoCs 2019-2020
SoC	Snapdragon 865	Snapdragon 855
CPU	1x Cortex A77 @ 2.84GHz 1x512KB pL2 3x Cortex A77 @ 2.42GHz 3x256KB pL2 4x Cortex A55 @ 1.80GHz 4x128KB pL2 4MB sL3 @ ?MHz	1x Kryo 485 Gold (A76 derivative) @ 2.84GHz 1x512KB pL2 3x Kryo 485 Gold (A76 derivative) @ 2.42GHz 3x256KB pL2 4x Kryo 485 Silver (A55 derivative) @ 1.80GHz 4x128KB pL2 2MB sL3 @ 1612MHz
GPU	Adreno 650 @ 587 MHz +25% perf +50% ALUs +50% pixel/clock +0% texels/clock	Adreno 640 @ 585 MHz
DSP / NPU	Hexagon 698 15 TOPS AI (Total CPU+GPU+HVX+Tensor)	Hexagon 690 7 TOPS AI (Total CPU+GPU+HVX+Tensor)
Memory Controller	4x 16-bit CH @ 2133MHz LPDDR4X / 33.4GB/s or @ 2750MHz LPDDR5 / 44.0GB/s 3MB system level cache	4x 16-bit CH @ 1866MHz LPDDR4X 29.9GB/s 3MB system level cache
ISP/Camera	Dual 14-bit Spectra 480 ISP 1x 200MP 64MP ZSL or 2x 25MP ZSL 4K video & 64MP burst capture	Dual 14-bit Spectra 380 ISP 1x 192MP 1x 48MP ZSL or 2x 22MP ZSL
Encode/ Decode	8K30 / 4K120 10-bit H.265 Dolby Vision, HDR10+, HDR10, HLG 720p960 infinite recording	4K60 10-bit H.265 HDR10, HDR10+, HLG 720p480
Integrated Modem	none (Paired with external X55 only) (LTE Category 24/22) DL = 2500 Mbps 7x20MHz CA, 1024-QAM UL = 316 Mbps 3x20MHz CA, 256-QAM (5G NR Sub-6 + mmWave) DL = 7000 Mbps UL = 3000 Mbps	Snapdragon X24 LTE (Category 20) DL = 2000Mbps 7x20MHz CA, 256-QAM, 4x4 UL = 316Mbps 3x20MHz CA, 256-QAM
Mfc. Process	TSMC 7nm (N7P)	TSMC 7nm (N7)

The Snapdragon 865 is a successor to the Snapdragon 855 last year, and thus represents Qualcomm’s latest flagship chipset offering the newest IP and technologies. On the CPU side, Qualcomm has integrated Arm’s newest Cortex-A77 CPU cores, replacing the A76-based IP from last year. This year Qualcomm has decided against requesting any microarchitectural changes to the IP, so unlike the semi-custom Kryo 485 / A76-based CPUs which had some differing aspects to the design, the new A77 in the Snapdragon 865 represents the default IP configuration that Arm offers.

Clock frequencies and core cache configurations haven’t changed this year – there’s still a single “Prime” A77 CPU core with 512KB cache running at a higher 2.84GHz and three “Performance” or “Gold” cores with reduced 256KB caches at a lower 2.42GHz. The four little cores remain A55s, and also the same cache configuration as well as the 1.8GHz clock. The L3 cache of the CPU cluster has been doubled from 2 to 4MB. In general, Qualcomm’s advertised 25% performance uplift on the CPU side solely comes from the IPC increases of the new A77 cores.

The GPU this year features an updates Adreno 650 design which increases ALU and pixel rendering units by 50%. The end-result in terms of performance is a promised 25% upgrade – it’s likely that the company is running the new block at a lower frequency than what we’ve seen on the Snapdragon 855, although we won’t be able to confirm this until we have access to commercial devices early next year.

A big performance upgrade on the new chip is the quadrupling of the processing power of the new Tensor cores in the Hexagon 698. Qualcomm advertises 15 TOPS throughput for all computing blocks on the SoC and we estimate that the new Tensor cores roughly represent 10 TOPS out of that figure.

In general, the Snapdragon 865 promises to be a very versatile chip and comes with a lot of new improvements – particularly 5G connectivity and new camera capabilities are promised to be the key features of the new SoC. Today’s focus lies solely on the performance of the chip, so let’s move on to our first test results and analysis.

New Memory Controllers & LPDDR5: A Big Improvement

One of the larger changes in the SoC this generation was the integration of a new hybrid LPDDR5 and LPDDR4X memory controller. On the QRD865 device we’ve tested the chip was naturally equipped with the new LP5 standard. Qualcomm was actually downplaying the importance of LP5 itself: the new standard does bring higher memory speeds providing better bandwidth, however latency should be the same, and power efficiency benefits, while there, shouldn’t be overplayed. Nevertheless, Qualcomm did claim they focused more on improving their memory controllers, and this year we’re finally seeing the new chip address some of the weaknesses exhibited by the past two generations; memory latency.

We had criticised Qualcomm’s Snapdragon 845 and 855 for having quite bad memory latency – ever since the company had introduced their system level cache architecture to the designs, this aspect of the memory subsystem had seen some rather mediocre characteristics. There’s been a lot of arguments in regards to how much this actually affected performance, with Qualcomm themselves naturally downplaying the differences. Arm generally notes a 1% performance difference for each 5ns of latency to DRAM, if the differences are big, it can sum up to a noticeable difference.

( )

Looking at the new Snapdragon 865, the first thing that pops up when comparing the two latency charts is the doubled L3 cache of the new chip. It’s to be noted that it does look that there’s still some sort of logical partitioning going on and 512KB of the cache may be dedicated to the little cores, as random-access latencies start going up at 1.5MB for the S855 and 3.5MB for the S865.

Further down in the deeper memory regions, we’re seeing some very big changes in latency. Qualcomm has been able to shave off around 35ns in the full random-access test, and we’re estimating that the structural latency of the chip now falls in at ~109ns – a 20ns improvements over its predecessor. While it’s a very good improvements in itself, it’s still a slightly behind the designs of HiSilicon, Apple and Samsung. So, while Qualcomm still is the last of the bunch in regards to its memory subsystem, it’s no longer trailing behind by such a large margin. Keep in mind the results of the Kirin 990 here as we go into more detailed analysis of memory-intensive workloads in SPEC on the next page.

Furthermore, what’s very interesting about Qualcomm’s results in the DRAM region is the behaviour of the TLB+CLR Trash test. This test is always hitting the same cache-line within a page across different, forcing a cache line replacement. The oddity here is that the Snapdragon 865 here behaves very differently to the 855, with the results showcasing a separate “step” in the results between 4MB and ~32MB. This result is more of an artefact of the test only hitting a single cache line per page rather than the chip actually having some sort of 32MB hidden cache. My theory is that Qualcomm has done some sort of optimisation to the cache-line replacement policy at the memory controller level, and instead the test hitting DRAM, it’s actually residing at on the SLC cache. It’s a very interesting result and so far, it’s the first and only chipset to exhibit such behaviour. If it’s indeed the SLC, the latency would fall in at around 25-35ns, with the non-uniform latency likely being a result of the four cache slices dedicated to the four memory controllers.

Overall, it looks like Qualcomm has made rather big changes to the memory subsystem this year, and we’re looking forward to see the impact on performance.

CPU Performance & Efficiency: SPEC2006

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

178 Comments

View All Comments

Bulat Ziganshin - Monday, December 16, 2019 - link
The Spec2006 tables show that A13 has performance similar to x86 desktop chips, which may be considered as revolution. Can you please add frequencies of the chips (both x86 and Apple) too, at least some estimations? Also, what are the memory configs (freq/CAS/...)? It will be also interesting to see x86 chips in individual SPEC benchmarks so we can analyze what are the weak and string points of Apple architecture.
Andrei Frumusanu - Monday, December 16, 2019 - link
The Apple chips are running near their peak frequencies, with some subtests being slightly throttled due to power. The 9900K was at 5GHz, the 3950X at 4.6-4.65GHz, 3200CL16 on the desktop parts.

I added the detailed overview over all chips; here's it again: https://images.anandtech.com/doci/15207/SPEC2006_o...
unclevagz - Monday, December 16, 2019 - link
It would be nice if some contemporary x86 laptop chips could be added to that list (Ryzen/Ice Lake/Coffee Lake...) just for ease of comparison between ARM and x86 mobile chips.
sam_ - Monday, December 16, 2019 - link
Any strong reason for these tests being compiled with -mcpu=cortex-a53 on Android/Linux?

One might expect for SoCs with 8.2 on all cores there may be some uplift from at least targeting cortex-a55, if not cortex-a75?

When you're expecting to run on a big core, forcing the compiler to target a in-order core which can only execute one ASIMD instruction per cycle seems likely to restrict the perf (unrolling insufficiently etc.). Certainly seems a bit unfair for aarch64 vs. x64 comparison, and probably makes the apple SoCs look better too (assuming XCode isn't targeting a LITTLE core by default). It also likely makes newer bigger cores look worse than they should vs. older cores with smaller OoO windows.

I get not wanting to target compilation to every CPU individually, but would be interesting to know how much of an effect this has; perhaps this could contribute to the expected IPC gains for FP not being achieved?
Andrei Frumusanu - Monday, December 16, 2019 - link
The tuning models only have very minor impact on the performance results. Whilst using the respective models for each µarch can give another 1-1.5% boost in some tests, as an overall average across all micro-architectures I found that giving the A53 model results in the highest performance. This is compared to not supplying any model at all, or using the common A57 model.

The A55 model just points to the A53 scheduling model, so they're the same.
sam_ - Monday, December 16, 2019 - link
Hmm, I took a look at LLVM and the scheduling model is indeed the same for A53 and A55, but A55 should enable instruction generation for the various extensions introduced since v8.0. I can believe that for spec 2006 8.1 atomics/SQRDMLAH/fp16/dot product/etc. instructions don't get generated.

It looks like not much attention has been paid to tweaking the LLVM backend for more recent big cores than A57, beyond getting the features right for instruction generation, so I can believe cortex-a53 still ends up within a couple of percent of more specific tuning. Probably means there's more work to be done on LLVM.

If it is easy to test I think it would be interesting to try cortex-a57, or maybe exynos-m4 tuning on a77 because these targets do seem to unroll more aggressively than other cortex-X targets with the current LLVM backend.
I made a toy example on godbolt: https://godbolt.org/z/8i9U5- , though for this particular loop I think a77 would have the vector integer MLA unit saturated with unroll by 2 (and is probably memory bound!), still the other targets would seem more predisposed to exposing instruction level parallelism.
Andrei Frumusanu - Tuesday, December 17, 2019 - link
I pointed out to Arm that there's not much optimisations going on in terms of the models, but they said that they're not putting a lot of effort into that, and instead trying to optimise the general Arm64 target.

I tested the A57 targets in the past, I'll have a look again on things like the M4 tuning over the coming months as I finally get to port SPEC2017.
Quantumz0d - Monday, December 16, 2019 - link
Sigh another comment on the x86 vs A series. Why dont people understand running an x86 code on ARM will have a massive impact in performance ? How do people think a fanless BGA processor with sub 10W design beat an x86 in realworld just because it has Muh Benchwarrior ? There are so many possible workloads from SIMD, HT/SMT, ALU.

Having scalability is also the key. Look at x86 AMD and Intel how they do it by making a Large Wafer and having multi SKUs with LGA/PGA (AM4) sockets allowing for maximum robustness.

ARM is all about efficiency and economical bandwidth and it won't scale like x86 for all workloads. If you add AVX its dead. And Freq scaling with HT/SMT. Add the TSMC N7 which is only fit for mobile SoCs. Ryzen don't scale much into clocks because of this limitation.

ARM is always Custom if you see as per Vendor. Its bad. Look at MediaTek trash no GPL policy. Huawei as well. Except QcommCAF and Exynos. Its a shame that TI OMAP left.
Andrei Frumusanu - Monday, December 16, 2019 - link
> Why dont people understand running an x86 code on ARM will have a massive impact in performance ?

Nobody even mentioned anything regarding this, you're going off on a nonsensical rant yet again. For once, please keep the comments section level-headed.
Quantumz0d - Monday, December 16, 2019 - link
What ? Its a genuine point. ARM based 8c processors Windows machines like Surface Pro X can only emulate 32bit x86 code. 64bit isnt here and running both emulation will have am impact (slow) That's what I mean. They need native code to run and rival.

Rant ? Benches = Realworld right. How come a user is able to see an OP7 Pro breeze through and not lag and offer shitty performance vs an iPhone ? I saw with my own OP3 downclocked on Sultan ROM due to the high clockspeed bug on 82x platform not just me, So many other users. GB score and benches do not only mean performance esp in ARM arena.

Except for bragging rights, This is pure Whiteknighting.

The Snapdragon 865 Performance Preview: Setting the Stage for Flagship Android 2020

Snapdragon 865

New Memory Controllers & LPDDR5: A Big Improvement

Post Your Comment

178 Comments

View All Comments

Bulat Ziganshin - Monday, December 16, 2019 - link

Andrei Frumusanu - Monday, December 16, 2019 - link

unclevagz - Monday, December 16, 2019 - link

sam_ - Monday, December 16, 2019 - link

Andrei Frumusanu - Monday, December 16, 2019 - link

sam_ - Monday, December 16, 2019 - link

Andrei Frumusanu - Tuesday, December 17, 2019 - link

Quantumz0d - Monday, December 16, 2019 - link

Andrei Frumusanu - Monday, December 16, 2019 - link

Quantumz0d - Monday, December 16, 2019 - link

Log in

Don't have an account? Sign up now