Nehalem Architecture
Fundamentally Nehalem is designed to be scaleable. In Core i7 form, the chip has four processing cores, a triple-channel memory controller, bi-directional Quick Path Interconnect delivering up to 25.6GB/sec of bandwidth (12.8GB/sec in each direction), and 8MB of L3 cache. Server variants of Nehalem could have more cores, larger L3 cache, and more QPI links (desktop chips feature one link), while mobile variants could have fewer cores with less cache and a dual-channel (rather than triple-channel) memory controller. Intel has indicated that they will even add graphics to the equation at some point, taking yet another feature off the system chipset and onto the CPU itself.
This modular design helps to reduce power consumption. Features like the memory controller and QPI all run at voltages independent of each other.
Intel has incorporated a number of improvements into Nehalem that are designed to improve IPC. For instance, the number of micro-ops (microinstructions) in flight has increased from 96 in Conroe/Penryn to 128 in Nehalem. Intel also increased the size of the load and store buffers to ensure that they wouldn’t become a limiting factor.
Intel also improved Nehalem’s branch prediction. A new second-level branch target buffer has been added to improve branch prediction in applications that have large footprints such as databases. This second predictor has a much larger history table which should allow it to predict branches more accurately than the first level predictor. Intel has also added a new renamed return stack buffer (RSB). RSBs store forward and return pointers associated with call and return instructions. The RSB should help Nehalem avoid return instruction mispredictions.
With its faster synchronization primitives, Nehalem has also been tweaked to handle threaded software better.
Speaking of threading, with Nehalem we see the resurgence of simultaneous multi-threading (Hyper-Threading). With Hyper-Threading, one processing core can run two threads at the same time. With four processing cores inside Core i7, the OS “sees” eight cores and sends eight instructions to the CPU, effectively doubling the number of overall threads that Nehalem can run simultaneously over a conventional quad-core CPU.
Whereas Hyper-Threading (HT) never really took off on the Pentium 4, Intel feels that Nehalem has a distinctive HT advantage thanks to its larger cache and greater memory bandwidth, all of which should allow it to deliver better HT performance. Additionally, there are also more apps capable of taking advantage of HT than there were a few years ago. As you’ll see in our Lost Planet, Cinebench, and Valve benchmarks, Nehalem delivers a significant performance increase in HT-aware apps.
New cache subsystem
While Nehalem has the same 32KB instruction/32KB data L1 cache configuration as previous Core 2 CPUs, Intel has totally revamped the L2 cache and added a new L3 cache.
Nehalem’s L2 cache is much smaller than Penryn. Each core has its own 256KB L2 cache for handling data and instruction. While this is significantly less than previous processors, Nehalem’s L2 is lower latency than its predecessors.
In addition to the L1 and L2 caches, like AMD’s Phenom Nehalem also features an L3 cache that is shared across all the cores. Unlike Phenom however, Nehalem’s L3 is inclusive and not exclusive like AMD’s. Intel feels that this inclusive architecture gives them an advantage over AMD, as an exclusive architecture doesn’t store data from the lower level L1 and L2 caches. As a result, if a data request misses on the L3 cache, each processor core must be snooped (searched) in case its L1 or L2 cache has the requested data. This increases latency and snoop traffic between the cores.
With Nehalem these snoops are unnecessary, as the CPU already knows that the data doesn’t reside in L1 or L2, this helps to reduce latency and thus improve performance as well as reducing power consumption.
Like its two-level branch prediction, Nehalem features a two-level 512 entry translation lookaside buffer (TLB). Nehalem is the first CPU to feature a second TLB. This is another improvement Intel has incorporated into Nehalem to improve its performance with server apps like large databases.
SSE4
Nehalem is Intel’s first CPU to offer SSE4.2 support. 7 new application targeted accelerators have been added to the new instruction set providing improved performance in string and text processing operations. One example Intel provides is the parsing of XML files at a much higher speed. The other two instructions are focused on accelerated searching and pattern recognition of large data sets (useful for voice/handwriting recognition) and the seventh is a CRC instruction focused on new communications capabilities such as accelerated network attached storage.