Arm Cortex X925: Leading The Way in Single-Threaded IPC

The Arm Cortex-X925, codenamed "Black Hawk," as Arm boldly claims, stands at the forefront of single-threaded instruction per clock (IPC) performance, setting things up for improved performance and efficiency in a big way, at least from Arm's claims. This core is a pivotal part of Arm's move to the 3 nm process node and integrates seamlessly into the second-generation Armv9.2 architecture. If Arms claims were taken as gospel, the Cortex X925 would be positioned as a leader in high-performance mobile computing and is an example of where Arm and its focus on a highly efficient PPA is the driving force with Arm's 2024 CPU Core Cluster.

The Cortex-X925 is built on architectural improvements designed to maximize IPC. One of the standout features is its 10-wide decode and dispatch width, significantly increasing the number of instructions processed per cycle. This enhancement allows the core to execute more instructions simultaneously, leading to better utilization of execution units and higher overall throughput.

Arm has doubled the instruction window size to support this wide instruction path, allowing more instructions to be held in flight at any given time. This reduces stalls and improves the efficiency of the execution pipeline. Additionally, the core boasts a 2X increase in L1 instruction cache (I$) bandwidth and a similar increase in L1 instruction translation lookaside buffer (TLB) size. These enhancements ensure that the core can quickly fetch and decode instructions, minimizing delays and maximizing performance.

The Cortex-X925 also features a highly advanced branch prediction unit, which reduces the number of mispredicted branches. By incorporating techniques such as folded-out unconditional direct branches, Arm has removed several architectural roadblocks, enabling a more streamlined and efficient execution path. This leads to fewer pipeline flushes and higher sustained IPC.

The front end of the Arm Cortex-X925 showcases plenty of improvements within the design, including boosting instruction throughput and reducing latency. Central to these improvements is the 10-wide decode and dispatch width, which allows the core to handle more instructions per cycle compared to previous architectures. This wide instruction path increases the parallelism in instruction processing, enabling the core to execute more tasks simultaneously.

Additionally, the Cortex-X925 features a doubled instruction window size, accommodating more instructions in flight and minimizing pipeline stalls. The L1 instruction cache (I$) bandwidth has also been increased by 2x, along with a similar expansion in the L1 instruction translation lookaside buffer (iTLB) size. These enhancements ensure that the core can quickly fetch and decode instructions, significantly reducing fetch bottlenecks and improving overall performance.

The backend of the Cortex-X925 has seen significant growth in out-of-order (OoO) execution capabilities, with a 25-40% increase. This growth allows the core to execute instructions more flexibly and efficiently, reducing idle times and improving overall performance. Furthermore, the core's register file structure has been enhanced, increasing the reorder buffer size and instruction issue queues, contributing to ultimately smoother and, thus, faster instruction execution.

Despite its high performance, the Cortex-X925 is designed to be power efficient. The 3 nm process technology is crucial, enabling better power efficiency than previous generations. The core's design includes features such as dynamic voltage and frequency scaling (DVFS), which allows it to adjust power and performance levels based on the workload. This ensures energy is used efficiently, extending battery life and reducing thermal output.

The Cortex-X925 also incorporates advanced power management features, such as per-core DVFS and improved voltage regulation. These features help manage power consumption more effectively, ensuring the core delivers high performance without compromising energy efficiency. This balance is particularly beneficial for mobile devices requiring sustained performance and long battery life.

The Cortex-X925 is also designed for and optimized for AI-based workloads, with dedicated AI accelerators and software optimizations that enhance AI processing efficiency. With up to 80 TOPS (trillion operations per second), the core can handle complex AI tasks, from natural language processing to computer vision. These capabilities are further supported by Arm's Kleidi AI and Kleidi CV libraries, which provide developers with the tools needed to build advanced AI applications.

Interestingly, Arm hasn't moved into the realm of NPU or AI accelerators itself. Instead, it allows its partners, such as MediaTek, to incorporate their own, ensuring that the Core Cluster can provide the necessary support and integration capabilities. With its reference software stack and optimized libraries, the CSS platform provides a robust foundation for developers. The inclusive Arm Performance Studio offers advanced tooling environments that help developers optimize their applications for the new architecture.

The CSS platform's integration with operating systems such as Android, Linux variants, and Windows through its reinvigorated Windows on Arm OS ensures broad compatibility and ease of development. This cross-operating system support enables developers to quickly and efficiently build applications that leverage the capabilities of the Cortex-X925, along with the entirety of the updated Armv9.2 Core Cluster, which not only accelerates time-to-market but ensures compatibility across multiple devices.

Arm Unveils 2024 CPU Core Designs, Cortex X925, A725 and A520: Arm v9.2 Redefined For 3nm Arm Cortex A725: Improvements to Middle Core Efficiency
Comments Locked

55 Comments

View All Comments

  • eastcoast_pete - Wednesday, May 29, 2024 - link

    Speaking of SVE and SME: are there any applications (for Android, Windows-on-ARM or Apple devices) available to the general public that use either or both of them? SVE was originally co-developed by ARM and Fujitsu for the core that powers Fugaku, Riken's supercomputer. There are reports (rumors) that SVE is painful to implement, and someone wrote that Qualcomm elected to not enable SVE in their 8 Gen3 SoC, even though it's in their big cores. Anyone here knows, can comment? Right now, outside of 1-2 benchmarks, which applications actually use SVE, never mind SME?
  • name99 - Wednesday, May 29, 2024 - link

    Presumably ARM’s Kleidi AI libraries (and various MS equivalents) use SVE and SME if present.
    And that’s really what matters. This functionality is envisaged (for now) as “built-in”.
    Obviously they want developer buy-in over time, but that’s not what matters right now; what matters is what’s in the OS and API’s. Same as the fact that AMX was available to developers via Accelerate was great, but the primary user was Apple’s ML APIs.
  • Marlin1975 - Thursday, May 30, 2024 - link

    What do you mean, its all there. They went over the Optimized design that will take advantage of the synergies of the new NM tech from a leading edge lithography manufacture and lead them to greater performance. Its a win win for everyone, are you not onboard?

    :)
  • syxbit - Wednesday, May 29, 2024 - link

    I suspect this will still be worse than the A17 and the Nuvia chips.
  • GC2:CS - Wednesday, May 29, 2024 - link

    A17 and M3 and M4 did not show much benefit by going to the 3nm. If ARM can do better than only good for them.
  • BGQ-qbf-tqf-n6n - Wednesday, May 29, 2024 - link

    A17 was already 30% faster than S8G3 in single-core scores. In the same GB tests ARM is referring to, M4 is 27% faster still.

    Presuming the X925 is relative to the X4 with “36 percent faster”, they’ll still be behind M3, much less M4.
  • OreoCookie - Saturday, June 1, 2024 - link

    The speed ups in single and multi core were significant. To my knowledge the 10-core M4 is the fastest stock CPU in single core performance that was tested (about 13 % faster than Intel's Core i9 14900 KS, which clocks up to 6.2 GHz stock). The M3 is about 6 % behind the 14900 KS. (I am unaware of e. g. SPECmark results for the M4.)
  • mode_13h - Saturday, June 1, 2024 - link

    > I am unaware of e. g. SPECmark results for the M4.

    I'm pretty sure nobody is testing that, since Anandtech stopped doing it (i.e. after Andrei left).
  • OreoCookie - Sunday, June 2, 2024 - link

    Yeah, and it seems nobody is doing it consistently across several generations. The best dissection of the M3 architecture I remember was by a Chinese Youtube channel, but nobody is carrying the baton. Maybe Ian and Andrei are doing this as part of their work for clients. (Andrei, I think, is working for Qualcomm now, isn't he?)
  • mode_13h - Monday, June 3, 2024 - link

    name99 would know what M3 analysis is out there. He wrote/compiled the Apple M1 explainer, which is a 300-page PDF you can find with all the details about it.

    https://github.com/name99-org/AArch64-Explore/

Log in

Don't have an account? Sign up now