base Clock frequency over 5 GHz. A typical system has a total of sixteen of these chips, arranged in drawers with four sockets. = “” /> Enlarge / Each Telum package consists of two 7 nm processors with eight cores and sixteen threads that run at a base clock frequency of over 5 GHz. A typical system has a total of sixteen of these chips arranged in four socket “drawers”.
From the perspective of a traditional x86 computing enthusiast – or professional – mainframes are strange, archaic beasts. They are physically huge, energy hungry, and expensive compared to traditional data center equipment, and generally offer less compute power per rack at a higher cost.
This begs the question, “Then why keep using mainframes?” Once you hand the cynical answers to “because we always did it”, the practical answers depend largely on reliability and consistency. As AnandTech’s Ian Cutress points out in a speculative article that focuses on Telum’s redesigned cache, “Downtime This” [IBM Z] Systems are measured in milliseconds per year. “(If that’s true, that’s at least seven nines.)
IBM’s own announcement of the Telum indicates how different the priorities of mainframe and commodity computing are. It casually describes Telum’s memory interface as “capable of tolerating complete channel or DIMM failures and designed to transparently recover data without affecting response time”.
If you pull a DIMM from an active x86 server, that server does not “restore transparent data” – it simply crashes.
IBM Z-series architecture
Telum is designed to be a kind of one-chip for mainframes to rule them all, replacing a much more heterogeneous setup in previous IBM mainframes.
The 14 nm IBM z15 CPU that Telum is replacing has a total of five processors – two pairs of 12-core computing processors and a system controller. Each compute processor hosts 256 MiB L3 cache shared by its 12 cores, while the system controller hosts a whopping 960 MiB L4 cache shared by the four compute processors.
Five of these z15 processors – each consisting of four computing processors and a system controller – form a “drawer”. Four drawers come together in a single z15 powered mainframe.
Although the concept of multiple processors in one drawer and multiple drawers in one system remains, the architecture within Telum itself is radically different – and vastly simplified.
Telum is a bit simpler than z15 at first glance – it’s an eight-core processor based on Samsung’s 7nm process, with two processors combined in each case (similar to AMD’s chiplet approach to Ryzen ). There is no separate system controller processor – all Telum processors are identical.
From here, four Telum CPU packages form a “drawer” with four sockets, and four of these drawers go into a single mainframe system. This offers a total of 256 cores on 32 CPUs. Each core runs at a base clock rate of over 5 GHz – which provides more predictable and consistent latency for real-time transactions than a lower base with a higher turbo rate.
Pockets full of cache
The elimination of the central system processor in every package also meant a redesign of Telum’s cache – the huge 960 MiB L4 cache is gone, as is the L3 cache shared per chip. In Telum, every single core has a private 32 MiB L2 cache – and that’s it. There is no hardware L3 or L4 cache at all.
This is where things get deeply strange – while each Telum core’s 32 MiB L2 cache is technically private, it’s really only virtually private. When a line is removed from one core’s L2 cache, the processor looks for empty space in the other cores’ L2. If it finds some, the L2 cache line removed from core x is marked as an L3 cache line and stored in the L2 of core y.
OK, so we have a virtual, shared L3 cache of up to 256 MiB on each Telum processor, which consists of the 32 MiB “private” L2 cache on each of its eight cores. From here, things go one step further – that 256 MiB shared “virtual L3” on each processor can in turn be used as shared “virtual L4” by all processors in a system.
Telum’s “virtual L4” works much the same as its “virtual L3” – removed L3 cache lines from one processor look for a home on another processor. If another processor in the same Telum system has free space, the removed L3 cache line is re-marked as L4 and instead lives in the virtual L3 on the other processor (which consists of the “private” L2s of its eight cores).
Ian Cutress from AnandTech takes a closer look at Telum’s cache mechanisms. He finally sums it up by answering, “How is that possible?” with a simple “magic”.
AI inference acceleration
Christian Jacobi from IBM briefly outlines Telum’s AI acceleration in this two-minute clip.
Telum is also introducing an on-chip 6TFLOPS inference accelerator. Among other things, it is intended to be used for real-time fraud detection during financial transactions (as opposed to shortly after the transaction).
In search of maximum performance and minimum latency, IBM threads several needles. The new inference accelerator is placed on-chip, allowing connections between the accelerator and the CPU cores with lower latency – but it’s not built into the cores themselves, like Intel’s AVX-512 instruction set.
The problem with in-core inference acceleration like Intel’s is that it usually limits the AI processing power available to a single core. A Xeon core executing an AVX-512 instruction only has the hardware in its own core available, which means that larger inference requests have to be split between several Xeon cores in order to achieve the full available performance.
Telum’s accelerator is on-die, but off-core. This allows a single core to perform inference workloads with the power of the entire on-die accelerator, not just the part that is built into it.
Offer picture from IBM