02:04PM EDT - Final talk of this session is IBM z15. This is big iron mainframe stuff. Prepare to be shocked about these chips, and wonder why you don't have them.

02:04PM EDT - Mainframes are still releveant - 220+ billion lines of COBOL still in deployment today. 70% of all business transactions still use COBOL

02:05PM EDT - Programs built in 1964 on IBM mainframes still work today

02:05PM EDT - 70k searches on google per second, vs. 1.3 million transactions per second on mainframes

02:06PM EDT - Deep pipeline high frequency z-series

02:07PM EDT - z13 introduced SMT, z14 introduced pervasive encryption

02:07PM EDT - These CPUs are ground up built by IBM and not used by anyone else

02:07PM EDT - Two bits of silicon - Storage Controller with 960 MB L4 cache, four Complte Chips

02:08PM EDT - 5 drawers are fully connected through the SC chips

02:08PM EDT - Large 14nm SOI designs

02:08PM EDT - 700mm2+ each

02:08PM EDT - ~700mm2

02:08PM EDT - Each SC chip has 12 cores. 8 MB L2 cache

02:08PM EDT - Max config supports 240 cores. 190 cores are customer available, others are for management or recovery

02:09PM EDT - 60 PCIe 4 x16 connections

02:09PM EDT - 40 TB RAIM memory supported

02:09PM EDT - Two CP chips create a single logical cluster

02:10PM EDT - CP chip

02:10PM EDT - 12 cores, 5.2 GHz

02:10PM EDT - 9.1B transistors

02:10PM EDT - 128 KB L1-I cache, 128 KB L1-D cache

02:10PM EDT - L2 4MB cache private

02:10PM EDT - 256 MB L3 cache shared

02:11PM EDT - Secure exectuion - 38 new instructions of vector performance

02:11PM EDT - speed up of common instructions in a smart way

02:11PM EDT - on-chip accelerators such as gzip, elliptic curve crypto, on-core sort/merge

02:11PM EDT - Here are the comparisons to z14

02:11PM EDT - 14% ST perf over z14

02:12PM EDT - Deep pipeline, CISC architecture

02:12PM EDT - branch is async

02:12PM EDT - two copies of almost everything shown

02:12PM EDT - recovering unit for when errors are detected - processor rolls back

02:13PM EDT - This allows for transient recovery from hardware errors

02:13PM EDT - Known good state can be transferred to a new core if non-transient error happens

02:13PM EDT - The goal of these cores is to be recoverable, even when blasted with high-energy proton beams

02:15PM EDT - NXU is syncrhonous and runs in real time

02:16PM EDT - Two main ways for compression - IBM uses both depending on the size to get the best results

02:17PM EDT - Elliptic curve cryptography acceleration unit in each core, along with enhanced modulo unit which it relies on

02:17PM EDT - 'MA unit' has its own instruction set and 'core'

02:17PM EDT - sign and verify is implemented as firmware and hardware

02:17PM EDT - Acts as a blueprint for future accelerators

02:17PM EDT - Attached to back-end of the pipeline

02:18PM EDT - All execution is in order and non-speculative

02:18PM EDT - No pipeline pain points

02:18PM EDT - Results are passed to the core

02:18PM EDT - physically these accelerators could be placed far away from core logic as needed

02:18PM EDT - dozens to hundreds of modulo ops with a couple of doubleword instructions

02:19PM EDT - Core is called millicode ?

02:19PM EDT - Here are the internal instruction set

02:19PM EDT - speed ups vs external attached PCIe accelerator on z14

02:19PM EDT - Secure Execution for z15 with vertical isolation

02:20PM EDT - specialized mode in the CPU, IO, and memory subsystem

02:20PM EDT - ultravisor sits between the hypervisor and OS

02:21PM EDT - integrity hash and input/output counts to stop malicious guests

02:22PM EDT - controlled code environment

02:23PM EDT - 5.2 GHz water cooled

02:23PM EDT - 4 CP and 1 SC chip per drawer

02:23PM EDT - 14% ST and 25% cap vs z14

02:24PM EDT - Q&A time

02:25PM EDT - Q: Pipeline depth? A: It's long! Long front end and back end and extends with the recovery

02:25PM EDT - Around 30

02:25PM EDT - Q: L1 / L2 load to use latency? A: L1 4-cycle loop, 8-cycle for L1 miss/L2 hit

02:26PM EDT - Q: For secure execution, what is the isolation from firmware? A: Validated trusted firmware / ultravisor. That's an integral part of our secure guest security

02:27PM EDT - Q: Async branch prediction? What happens if it's behind i-fetch A: It's lossless, and i-fetch is in-order. After a pipeline restart, if i-fetch is ahead, the pipeline will react and throw away as required. There are syncs - there's a hard sync at dispatch, so no predictions are dropped

02:27PM EDT - Q: Core IPC vs Power10? A: ask power!

02:28PM EDT - AES-256 for page encryption, integrity hash is SHA-512

02:29PM EDT - Q: 5.2 GHz? How? A: Deep pipelining and focus on gate design. A lot of work. Deep pipelining is table steaks, but a lot fo other things are needed

02:30PM EDT - Q: Is power/eff an important focus? A: It does consume less power than z14 that's similar configured. From a chip perspective, the focus wasn't to reduce overall power - the focus was on performance and throughput. That was done to put two more cores in and double caches - we burned the power budget to add in more performance. This is the sort of product this is. We stuff more hardware and acceleration.

02:30PM EDT - That's a wrap for the first session. Come back in 30 minutes for the next session, where we start on AMD's 4000-series Renoir

02:31PM EDT - .

POST A COMMENT

20 Comments

View All Comments

  • Ian Cutress - Monday, August 17, 2020 - link

    They eventually said 5.2 GHz base in the Q&A Reply
  • ikjadoon - Monday, August 17, 2020 - link

    >The goal of these cores is to be recoverable, even when blasted with high-energy proton beams

    Well, I personally now feel underprepared for any incoming high-energy proton beams.
    Reply
  • Mdarrish - Monday, August 17, 2020 - link

    They or similar high energy radiation come in all the time - cosmic rays switch the state of gates in processors. The smaller the transistors, the easier it is to switch the state. Modern CPUs have redundant circuits to ‘vote’ on results when that happens. Reply
  • octavus - Monday, August 17, 2020 - link

    Too bad that 940GB L4 cache chip can't be an option for X86 systems. Reply
  • Lord of the Bored - Monday, August 17, 2020 - link

    "02:05PM EDT - Programs built in 1964 on IBM mainframes still work today"

    I am torn. On the one hand, I strongly believe that sort of backwards-compatibility is something to be applauded.
    On the other hand, the surrounding context makes me concerned that they know this because some ancient program that hasn't been maintained in fifty years is a critical part of everyday financial transactions.
    Reply
  • Mdarrish - Monday, August 17, 2020 - link

    They were updated for Y2K, so only 20 years. Seriously, there are usually updates along the way. The point is that critical business systems don’t have to be rewritten or need emulators. Reply
  • Zizy - Tuesday, August 18, 2020 - link

    If it works, why break it?
    While I doubt there is any meaningfully large piece of software from that era still in use, some core algorithms likely have been written back then and are still used today completely unmodified in over 50 years.
    Reply
  • quadibloc - Tuesday, August 18, 2020 - link

    The z15 has a very deep pipeline. So it's not surprising that it can have a 5.2 GHz clock. It's like a Pentium 4 (or even a Bulldozer); and it's kept cool by a fancy water-cooling system. So IBM hasn't worked miracles, it's just attained 5.2 GHz in ways that Intel and AMD reject for good reason. Reply
  • melgross - Tuesday, August 18, 2020 - link

    AMD sand Intel can’t do this because this is a very large system. The cost to do it likely costs more than the totality of most server rooms. Reply
  • Dolda2000 - Wednesday, August 19, 2020 - link

    To be fair though, a lot of the pipeline depth seems to be contributed by the retire/commit side of things, with all the extra stages for checkpointing and verification, which shouldn't leave any effect on branch mispredict penalties.

    Granted, the front-end pipeline isn't exactly short, but it doesn't look completely out of line compared to Intel designs. If anything, it surprises me a bit that they haven't implemented a micro-op cache to bypass all those myriad of steps they seem to have for decoding. It looks like something the design would benefit quite a lot from.
    Reply

Log in

Don't have an account? Sign up now