James discussed the Pentium 4 micro-architecture in his Pentium 4 notes article, for those of you who missed it, we'll go over the highlights of the micro-architecture once again.
The Pentium 4 is based on Intel's NetBurst micro-architecture. NetBurst doesn't mean anything special, it's just a marketing name given to the processors' features. Here are the highlights of the Pentium 4:
Hyper Pipelined Technology
- Very deep pipeline to enable breakthrough clock rates
- Performance scalability and frequency headroom for the future
Advanced Dynamic Execution
- Can handle more than 100 instructions in flight
- Enables speculative execution with enhanced branch prediction algorithm
- Offers 128-byte cache lines
- Extends upon basic features found in the P6 core
Rapid Execution Engine
- An integer ALU (arithmetic logic unit) clocked at twice the frequency of the Pentium 4 processor decreases the latency and increases the throughput of basic integer operations
Execution Trace Cache
- Execution Trace cache feeds fast engine
- Removes IA-32 decoder from main loop and in turn removes the decoder pipeline latency
Enhanced Floating-Point/Multimedia Unit
- 128-bit Floating-Point/Multimedia execution port
- Separate 128-bit Floating-Point move and data store port (in addition to the integer store port)
New Streaming SIMD Extensions 2 Instructions
- Offer 144 new instructions - SIMD double-precision floating-point instructions, SIMD 128-bit integer instructions, conversion instructions between floating-point and integer data, and cacheability instructions
- Enable excellent performance of next-generation broadband services, such as interactive digital TV
A New System Bus Architecture With High Bandwidth For Performance Headroom
- 3.2GB/sec data transfer rate
- Split transaction, deeply pipelined bus
- 128-byte lines with 64-byte access
Compatibility with existing IA-32 applications and operating systems
Let's go over these features in a bit more depth, starting with the increased number of pipeline stages.
Hyper Pipelined Technology
As we mentioned in the introduction, each new processor generation has brought with it an increased number of pipeline stages - 20 in the Pentium 4 versus 10 in the Pentium III.
This is to be expected, as more stages allow the processor to clock to higher frequencies. The downside is less work is performed per stage. Quite simply, if a Pentium III and Pentium 4 were available at the same clock speed, given everything else remaining equal, the Pentium III would perform faster due to the fact that its performing more work per clock than the Pentium 4.
Another disadvantage to the hyper pipelined micro-architecture is the performance penalty of mis-predicted branch instructions. When a branch instruction is mis-predicted, the processor has to start at the beginning of the pipeline. The ten additional stages present in the Pentium 4 pipeline adds to the amount of time it takes the processor to execute an operation. Intel has come up with a few clever ways to get around this problem via their enhanced branch prediction algorithm (which improves the branch prediction of the Pentium 4 processor) and execution trace cache. We'll discuss both of these on the next page.