With the launch of Kaveri, some people have been wondering if the platform is suitable for HPC applications.  Floating point peak performance of the CPU and GPU  on both fp32 and fp64 datatypes is one of the considerations. At launch time, we were not clear on the fp64 performance of Kaveri's GPU but now we have official confirmation from AMD that it is 1/16th the rate of fp32 (similar to most GCN based GPUs except the flagships) and we have verified this on our 7850K by running FlopsCL.  

I am taking this opportunity to summarize the info about Kaveri, Trinity, Llano and Intel's competing platforms Haswell and Ivy Bridge on both the CPU and GPU side. We provide a per-cycle estimate for the chips as well as peak calculated in gflops. The estimates are chip-wide, i.e. already take into account the number of cores or modules. Due to turbo boost, it was difficult to decide what frequency to use for peak calculations. For CPUs, we are using the base frequency and for GPUs we are using the boost frequency because in multithreaded and/or heterogeneous scenarios the CPU is less likely to turbo. In any case, we believe our readers are smart enough to calculate peaks at any frequency they want, given that we already supply per-cycle peaks :)

The peak CPU performance will depend on the SIMD ISA that your code was written and compiled for. We consider three cases: SSE, AVX (without FMA) and AVX with FMA (either FMA3 or FMA4).

 

CPU floating-point peak performance
Platform Kaveri Trinity Llano Haswell Ivy Bridge
Chip 7850K 5800K 3870K 4770K 3770K
CPU frequency 3.7 GHz 3.8 GHz 3.0GHz 3.5GHz 3.5GHz
SSE fp32 (/cycle) 16 16 32 32 32
SSE fp64 (/cycle) 8 8 16 16 16
AVX fp32 (/cycle) 16 16 - 64 64
AVX fp64 (/cycle) 8 8 - 32 32
AVX FMA fp32 (/cycle) 32 32 - 128 -
AVX FMA fp64 (/cycle) 16 16 - 64 -
SSE fp32 (gflops) 59.2 60.8 96 112 112
SSE fp64 (gflops) 29.6 30.4 48 56 56
AVX fp32 (gflops) 59.2 60.8 - 224 224
AVX fp64 (gflops) 29.6 30.4 - 112 112
AVX FMA fp32 (gflops) 118.4 121.6 - 448 -
AVX FMA fp64 (gflops) 59.2 60.8 - 224 -

It is no secret that AMD's Bulldozer family cores (Steamroller in Kaveri and Piledriver in Trinity) are no match for recent Intel cores in FP performance due to the shared FP unit in each module. As a comparison point, one core in Haswell has the same floating point performance per cycle as two modules (or four cores) in Steamroller.

Now onto GPU peaks. Here, for Haswell, we chose to include both GT2 and GT3e variants.

Platform Kaveri Trinity Llano Haswell GT3e Haswell GT2 Ivy Bridge
GPU floating-point peak performance
Chip 7850K 5800K 3870K 4770R 4770K 3770K
GPU frequency 720 MHz 800 MHz 600 MHz 1.3 GHz 1.25 GHz 1.15 GHz
fp32/cycle 1024 768 800 640 320 256

fp64/cycle (OpenCL)

64 48** 0 0 0 0

fp64/cycle (Direct3D)

64 0? 0 160 80 64
fp32 gflops 737.3 614 480 832 400 294.4

fp64 gflops (OpenCL)

46.1 38.4** 0 0 0 0

fp64 gflops (Direct3D)

46.1 0? 0 208 100 73.6

The fp64 support situation is a bit of a mess because some GPUs only support fp64 under some APIs.  The fp64 rate of Intel's GPUs does not appear to be published but David Kanter provides an estimate of 1/4 speed compared to fp32. However Intel only enables fp64 under DirectCompute but does not enable fp64 under OpenCL for any of its GPUs.

Situation on AMD's Trinity/Richland is even more complicated. fp64 support under OpenCL is not standards-compliant and depends upon using a proprietary extension (cl_amd_fp64). Trinity/Richland do not appear to support fp64 under DirectCompute (and MS C++ AMP implementation) from what I can tell. From an API standapoint, Kaveri's GCN GPUs should work fine on for fp64 under all APIs.

Some of you might be wondering whether Kaveri is good for HPC applications. Compared to discrete GPUs, applications that are already ported and work well on discrete GPUs will continue to be best run on discrete GPUs.  However, Kaveri and HSA will enable many more applications  to be GPU accelerated. 

Now we compare Kaveri against Haswell. In applications depending upon fp64 performance, conditions are not generally favorable to Kaveri. Kaveri's fp64 peak including both the CPU and GPU is only about 110 gflops.  You will generally be better off first optimizing your code for AVX and FMA instructions and running on Haswell's CPU cores. If you are using Windows 8,  you might also want to explore using Iris Pro through C++ AMP in conjunction with the CPU. Overall I doubt we will see Kaveri being used for fp64 workloads.

For heterogeneous fp32 applications, Kaveri should outperform Haswell GT2 and Ivy Bridge.  Haswell GT3e will again be a strong contender on Windows given the extremely capable Haswell CPU cores and Iris Pro graphics.  Intel's GPUs  do not currently support OpenCL under Linux, but a driver is being worked on.  Thus, on Linux, Kaveri will simply win out on fp32 heterogeneous applications. However, even on Windows Haswell GT3e will get strong competiton from Kaveri.  While AMD has advantages such as excellent GCN architecture and HSA software stack (when ready) enabling many more applications to take advantage of GPU, Iris Pro will have the eDRAM to potentially provide much improved bandwidth and the backing of strong CPU cores.

I hope I have provided a fair overview of the FP capabilities of each platform. Application performance will of course depend on many more factors. Your questions and comments are welcome.

Comments Locked

101 Comments

View All Comments

  • BMNify - Friday, January 24, 2014 - link

    yeah , that kernel thing that runs on all the 1.81 billion mobile phone sales for all of 2013 not counting all of the other android devices today OC.
  • BMNify - Friday, January 24, 2014 - link

    and you are aware that the AMD linux Radeon closed source driver as used here is considered to be on par with the windows driver as they use the same code base, and did you forget that kaveri and it's little slower brothers are supposed to be found in the mobile android devices running that kernel etc some day if they manage to get actual orders there to offset their lower windows PC sales today.
  • moozoo - Wednesday, January 22, 2014 - link

    Thank you for this article.

    The reason the Intel GPU's don't have fp64 under opencl is because the math instruction that includes intrinsics and division doesn't support fp64. see page 134 of Intel Open Source Graphics Programmer's Reference Manual for the 2013 Intel Core Processor Family...: Volume 2b.

    From what I can tell GPU's have a larger number of intrinsics with greater numerical accuracy than AVX. Intel isn't correcting this until AVX-512 (see chapter 7.2 of the "Intel Architecture Instruction Set Extensions Programming Reference" and note the "less than 2^-23 relative error). I believe the normal accuracy is 2^-14.
    AVX does not have a native fp64 rsqrt.
    The native log and exp for Hawaii is precise to 1 ULP (http://semiaccurate.com/2013/10/23/long-look-amds-...

    The Intel OpenCL will not generate AVX2 FMA instructions.
    http://software.intel.com/en-us/forums/topic/40116...
    I assume the native AVX2 FMA is not compliant with Opencl requirements in someway.

    There may be a Workstation version of Kaveri on the way. This might have a better fp64:fp32 ratio than 1:16 (http://semiaccurate.com/2013/06/18/a-glimpse-of-fu...
  • kantian - Thursday, January 23, 2014 - link

    Why don't you specify that CPU fpu64 numbers of Intel are for AVX2 instructions, but not for AVX? In this way you give unjust performance advantage to Intel! Intel CPU fpu64 has about 2x performance advantage over AMD fpu64 only with AVX2 instructions. That's why, your following statement seems quite untrue:

    "As a comparison point, one core in Haswell has the same floating point performance per cycle as two modules (or four cores) in Steamroller."
  • kantian - Thursday, January 23, 2014 - link

    You can look at the following chart - http://images.hardwarecanucks.com/image//skymtl/CP... for some comparison numbers and examples. As you can see the FPU (VP8) results of the Haswell i3-4330 are about 2x than that of Kaveri A10-7850k. However the older FPU Ivy Bridge i3-3225 results are similar to that of A10-7850k. That's because the new Haswell processors have AVX2 instructions, but not the Ivy Bridge ones. You can also see that, if you compare the VP8 i7-4770K results to i7-3770K ones. That's why, i7-4770 has twice more performance than i7-3770K.
  • rahulgarg - Thursday, January 23, 2014 - link

    From a floating point perspective, the only difference between AVX and AVX2 is that AVX2 contains FMA instructions while AVX does not. Kaveri/Steamroller do not support full AVX2 but do support FMA instructions. So, from a floating point perspective, Kaveri/Steamroller and Haswell support almost the same instruction set. if you look at the column, AVX with FMA, we already cover this case.
  • kantian - Thursday, January 23, 2014 - link

    Thank you for your clarification! But as far as I know, Intel Haswell architecture has FMA 256 bit units compared to Ivy Bridge and Kaveri, etc., which have 128 bit FMA ones. That's the only Haswell's FPU big architectural advantage over the others. That can explain the double performance per FPU module, we can observe on the chart I have posted. And as you say, the AVX2 includes FMA instructions, where the big performance advantage is. However I cannot understand your table, where the regular AVX instructions have 4x advantage over Kaveri. As we can see on the chart (http://images.hardwarecanucks.com/image//skymtl/CP... the practical results show different picture. Haswell's FPU advantage over Kaveri (counting the same number of FPUs) is about 50% - 60%, but not more.
  • rahulgarg - Thursday, January 23, 2014 - link

    Yes, well, our coverage is more about the theoretical peaks. In practical applications, differences will be smaller.
    About the 4x advantage of AVX over Kaveri, the difference is that each Haswell core has two 256-bit units. Thus, quad-core Haswell has total of eight 256-bit units.
    Steamroller modules only have two 128-bit units per module. Thus, quad-core Steamroller only has four 128-bit units. Thus, Haswell has twice the number of SIMD units and each unit is double the width, hence the 4x difference.
  • kantian - Thursday, January 23, 2014 - link

    Thank you! I can absolutely agree with your calculations. However, I always thought that it is more accurately to compare the quad-core two-module Steamroller or Piledrivers with i3 2 core 4 thread processors. Because, as we know, the AMD quad-core processors have only 2 FPU and 4 Integer units. So they are only 2 core regarding the FPU and quad-core integer. I think the AMD definition for quad-core (or any other number of cores) is not quite correct. But that is another story...
  • kantian - Thursday, January 23, 2014 - link

    And I think that your comment "About the 4x advantage of AVX over Kaveri, the difference is that each Haswell core has two 256-bit units. Thus, quad-core Haswell has total of eight 256-bit units." is just partly correct. Because those units are 256-bit FMA units. And FMA instructions are part of AVX2, but not AVX. That was the subject of my initial comment.

Log in

Don't have an account? Sign up now