Intel Lunar Lake: New P-Core, Enter Lion Cove

Diving straight into the Performance, or P-Core commonly referred to, has had major architectural updates to increase power efficiency and performance. Bigger of these updates, Intel needed to comprehensively update its classic P-core cache hierarchy.

Key among these improvements is a significant overhaul of Intel's traditional P-core cache hierarchy. The fresh design for Lion Cove uses a multi-tier data cache containing a 48KB L0D cache with 4-cycle load-to-use latency, a 192KB L1D cache with 9-cycle latency, and an extended L2 cache that gets up to 3MB with 17-cycle latency. In total, this puts 240KB of cache within 9 cycles' latency of the CPU cores, whereas Redwood Cove before it could only reach 48KB of cache in the same period of time.

The data translation lookaside buffer (DTLB) has also been revised, increasing its depth from 96 to 128 pages to improve its hit rate.

Intel has also added a third Address Generation Unit (AGU)/Store Unit pair to further boost the performance of data write operations. Intel has also thrown more cache at the problem, and as CPU complexity grows, so does the reliance on the cache subsystems to keep them fed. Intel has also reworked the core-level cache subsystem by adding an intermediate data cache (IDC) between the 48 KB L1 and the L2 level. The original L1D cache is now called the L0 D-cache internally and retires to a 192 KB L1 D-cache.

The latest Lion Cove P-core design also includes a new front-end for handling instructions. The prediction block is 8x larger, fetch is wider, decode bandwidth is higher than on Raptor Cove, and there has been an enormous increase in Uops cache capacity and read bandwidth. The change in Uop queue capacity is designed to enhance the overall performance throughput.

The out-of-order engine in Lion Cove is partitioned in the footprint for Integer (INT) and Vector (VEC) domains Execution Domain with Independent renaming and scheduling. This type of partitioning allows for expandability in the future, independent growth of each domain, and benefits toward reduced power consumption for a domain-specific workload. The out-of-order engine is also improved, going from 6 to 8-wide allocation/rename and 8 to 12-wide retirement, with the deep instruction window increased from 512 to 576 entries and from 12 to 18 execution ports.

Lion Cove's integer execution units have also been improved over Raptor Cove, with execution resources grown from 5 to 6 integer ALUs, 2 to 3 jump units, and 2 to 3 shift units. Scaling from 1 to 3 units, these multiply 64x64 units to 64, which takes 3 units and gives even more compute power for the harder part of computation. Another significant development is transforming the P-core database from a 'sea of fubs' to a 'sea of cells.' This process of migrating the sub-organization of the P-cores structure from fubs to more organized cells essentially increases the density.

Intel has removed Hyper-Threading (HT) from their Lunar Lake SoC, with one potential reason being to enhance power efficiency and single-thread performance. By eliminating HT, Intel reduces power consumption and simplifies thermal management, which should extend battery life in ultra-thin notebooks. Intel does make a couple of claims regarding the Lion Cove P-cores, which are set to offer approximately 15% better performance-to-power and performance-to-area ratios than cores with HT. Intel's hybrid architecture, which effectively utilizes E-cores for multi-threaded tasks, reduces the need for HT, allowing workloads to be distributed more efficiently by the Intel Thread Director.

Power management has also been refined by including AI self-tuning controllers to replace the static thermal guard bands. This lets the system respond dynamically to real-time operating conditions in an adaptive way to achieve higher sustained performance. Intel also implements Lion Cove P-Core clock speeds at tighter 16.67MHz intervals rather than the traditional 100MHz. This means more accurate power management and finer tuning to squeeze as much from the power budget as possible.

Intel's Lion Cove P-Core microarchitecture looks like a nice upgrade over Golden Cove. Lion Cove incorporates improved memory and cache subsystems and better power management while not relying solely on opting for faster P-core frequencies to boost the IPC performance.

Intel Unveils Lunar Lake Architecture: Overview Intel Lunar Lake: New E-Core, Skymont Takes Flight For Peak Efficiency
Comments Locked

91 Comments

View All Comments

  • kwohlt - Tuesday, June 4, 2024 - link

    20A is best thought of as an internal only, early sampling of 18A for use on the Compute Tile.

    But LNL differs from ARL in that its compute tile also contains the iGPU and NPU, making 20A not an appropriate choice. 18A would've been the node Intel would've needed, but that's not until next year (coincidently, LNL's direct successor, PNL, will use 18A for it's unified compute tile instead of TSMC)
  • Blastdoor - Wednesday, June 5, 2024 - link

    Or we could take it to mean that intel reserved a lot of N3B capacity and so figured they might as well use it. Like Apple, they will probably be looking to get off of N3B ASAP. While Apple moves to N3E, Intel will leap ahead to A18.
  • The Hardcard - Wednesday, June 5, 2024 - link

    Barring newly announced delays, TSMC will hit volume on N2 in the same timeframe as volume on Intel A18. Apple’s move to N3E has happened. N2 in 2025.
  • rgreen1983 - Tuesday, June 4, 2024 - link

    "This uplift is noticed, especially in the betterment of its hyper-threading, whereby improved IPC by 30%, dynamic power efficiency improved by 20%, and previous technologies, in balancing, without increasing the core area, in a commitment of Intel to better performance, within existing physical constraints."

    So hyper threading is bother present and improved, yet they disabled it? This seems non sensical
  • meacupla - Tuesday, June 4, 2024 - link

    From what I have read and seen from other tech sites, Intel disabled HT because it wasn't working properly with E-cores.

    Disabling HT improves performance and efficiency, because the E-cores get utilized, instead of sitting idle on low power loads.
  • rgreen1983 - Tuesday, June 4, 2024 - link

    I'm not asking why they disabled HT, we've known they were going to disable HT for some time. Disabling HT out of the box doesn't do anything because we've always been able to disable HT ourselves. I'm asking why they improved it if they are going to disable it, why waste a bunch of transistors and die area on a disabled feature? And if maybe the decision came too late to be removed, why brag about a thing that isn't even enabled?
  • Drumsticks - Tuesday, June 4, 2024 - link

    Unfortunately, this feels like word salad from Anandtech. I won’t speculate how or why it was left in, or why Anandtech is quoting a 30% gain in IPC that is nowhere in Intel’s slides or on other tech website coverage.

    They didn’t improve hyperthreading and then disable it. They removed the feature completely, and netted the die area and power savings from doing so. They probably also took a MT loss, but the die area and power savings could have been redirected to either better usage of the area for more performance, or just direct cost and efficiency savings. Intel’s hyperthreading was always a really inefficient way to gain a small amount of performance anyways. The actual side, published on Techpowerup’s dive, says removing hyperthreading saved them 5% perf/power, 15% perf/area, and 15% perf/power/area. That slide doesn’t appear to be published on Anandtech.

    Essentially, they didn’t waste a bunch of transistors on a disabled feature - they did the obvious thing and physically removed the feature from the die. The description here is Anandtech’s fault, not Intel’s.
  • rgreen1983 - Tuesday, June 4, 2024 - link

    Thank you for your reply and the suggestion to check the techpowerup article. I would expect you are correct like the techpowerup article that HT was removed from the design and silicon, but I've also just read the pcworld lunar lake article which seems to suggest otherwise and amazingly has a slide not found in the techpowerup or anandtech articles.

    What I think might be going on is that lion cove still has HT in the design because Intel wants it for server chips, although I'd argue it's not necessary there either and by the looks of their recent all E core xeons the thread count sensitive clients should be running those anyways. That would explain why they might improve HT. If that is the case is there 2 lion cove designs, one with HT and another without? I just read the wccftech article which suggests this is the case, mentioning "variants" of lion cove.

    Since this lion cove core for lunar lake is being made at tsmc, it makes sense they had to make a new design for their fab anyways so maybe they did remove HT, and wccftech says they removed TSX and AMX also. So the lion cove for Intel fab coming to arrow lake/xeon might have HT, will definitely have TSX and AMX, but they might still turn HT off and only enable for xeon.

    Regardless yeah the anandtech mention of HT improvements here in relation to lunar lake seems off base. But I still think there is more Intel could clear up on HT status on die and whether there are multiple lion cove designs.
  • Drumsticks - Tuesday, June 4, 2024 - link

    I think they (techpowerup and pcworld) are both right. Per Tomshardware, commenting on Intel removing HT:

    "As such, Intel architected two versions of the Lion Cove core, one with and one without hyperthreading, so that the threaded Lion Cove core can be used in other applications, like we see in the forthcoming Xeon 6 processors."

    I expect the LNL physical design lacks HT, as that's the only way to actually get the performance/area and performance/watt savings. But we'll probably see the version of Lion Cove with hyper threading show up in the Xeon world (although, to be honest, I'm not sure if it's worth it there given the efficiency losses), as well as on Arrow Lake, where higher performance in exchange for an efficiency loss tends to be an acceptable tradeoff for PC Enthusiasts.

    The Tomshardware article also points out to me where Anandtech's article summary gets the 30% number: "Intel’s architects concluded that hyperthreading, which boosts IPC by ~30% in heavily threaded workloads, isn’t as relevant in a hybrid design that leverages the more power- and area-efficient E-cores for threaded workloads." - this is coupled with yet another slide that shows Intel quoting hyperthreading as a +30% throughput for +20% Cdyn.

    In other words, I think *lunar lake* does not feature hyperthreading - it's physically non-present in the design. Lion Cove the P-Core microarchitecture, on the other hand, has two designs - one with HT physically present (in Arrow Lake and any Xeon SKUs - speculation), and one without (in Lunar Lake only).

    On that note, it also implies two different Modules for e-core as well - one with the e-cores not present on the ring bus (in Lunar Lake) and one where it's connected to the ring bus like "normal"
    - this being the config in Alder Lake and Raptor Lake (and this is presumably coming in Arrow Lake higher power laptop SKUs and the desktop)
  • rgreen1983 - Wednesday, June 5, 2024 - link

    Thank you for indulging me in this detailed discussion. I think you are right there are 2 lion cove designs. I don't think all the news outlets are aware of it.

Log in

Don't have an account? Sign up now