IntroductionI’ll never forget when Intel first announced that the name for the successor to the 486 would be “Pentium.” I and most of my fellow computer nerds thought the name was silly and not suitably geeky. Everyone knew that computer components were supposed to have names with numbers in them; after all, Star Wars droids, Star Trek ships, software versions, Compuserve e-mail addresses, and every other kind of computer-related thing you could think of had a moniker consisting of some mix of numbers and letters. So what’s with a name that vaguely suggests the concept of “fiveness,” but would be more appropriate for an element or a compound? To this day, I still have no idea who or what was responsible for the name “Pentium,” but I suppose it no longer matters. A question that’s still worth asking, though, is why the Pentium name has stuck around as the brand name for Intel’s main processor product line through no less than four major architectural changes. In a nutshell, the answer is that the Pentium brand name, having somehow made the transition from the original Pentium architecture to the radically different Pentium Pro (or P6) architecture, became synonymous with the most successful desktop microprocessor architecture of all time — in fact, in its heyday “Pentium” became virtually synonymous with “PC.”This series of articles takes a look at the consumer desktop processors that have borne the Pentium name, beginning with the original Pentium up through today’s Pentium 4 (Prescott) and Pentium M divisions. The overview is general enough that for the most part it should be accessible to the nonspecialist, and it should give you a sense of the major differences between each generation of Pentiums. In keeping with the Ars tag line, the article does not attempt to tell you everything about every iteration of the Pentium; instead, it covers only what you need to know. The original PentiumPentium Vitals Summary TableIntroduction date: March 22, 1993Process: 0.8 micronTransistor Count: 3.1 millionClock speed at introduction: 60 and 66 MHzCache sizes: L1: 8K instruction, 8K dataFeatures: MMX added in 1997 The original Pentium is an extremely modest design by today’s standards, and when it was introduced in 1993 it wasn’t exactly a blockbuster by the standards of its RISC contemporaries, either. While its superscalar design (Intel’s first) certainly improved on the performance of its predecessor, the 486, the main thing that the Pentium had going for it was x86 compatibility. In fact, Intel’s decision to make enormous sacrifices of performance, power consumption, and cost for the sake of maintaining the Pentium’s backwards compatibility with legacy x86 code was probably the most strategically-important decision that the company has ever made. The choice to continue along the x86 path inflicted some serious short- and medium-term pain on Intel, and a certain amount of long-term pain on the industry as a whole (how much pain depends on who you talk to), but as we’ll see the negative impact of this critical move has gradually lessened over time. The Pentium’s two-issue superscalar architecture was fairly straightforward. It had two five-stage integer pipelines, which Intel designated U and V, and one six-stage floating-point pipeline. The chip’s front-end could do dynamic branch prediction, but as we’ll learn in a moment most of its front-end resources were spent on maintaining backwards compatibility with the x86 architecture.
Figure 1: Pentium architectureThe Pentium’s U and V integer pipes were not fully symmetric. U, as the default pipe, was slightly more capable and contained a shifter, which V lacked. The two pipelines weren’t fully independent, either; there was a set of restrictions, which I won’t waste anyone’s time outlining, that placed limits on which combinations of integer instructions could be issued in parallel. All told, though, the Pentium’s two integer pipes provided solid enough integer performance to be competitive, especially for integer-intensive office apps. Floating-point, however, simply went from awful on the 486 to just mediocre with the Pentium — an improvement, to be sure, but not enough to make it even remotely competitive with comparable RISC chips on the market at that time. First off, you could only issue both a floating-point and an integer operation simultaneously under extremely restrictive circumstances. This wasn’t too bad, because floating-point and integer code are rarely mixed. The killer, though, was the unfortunate design of the x87 stack-based floating-point architecture.
I’ve covered in detail the problems related to x87 before, so I won’t repeat that here. Modern x86 architectures have workarounds, like rename registers and a “free” FXCH instruction, for alleviating — but not eliminating — the performance disadvantages of x87’s register-starved (only eight architectural registers) and stack-based architecture. The Pentium, however, had none of these, so it suffered mightily compared to its RISC competitors. In the days before rise of PC gaming, though, when most Pentium purchasers just wanted to run DOS spreadsheet and word-processing applications, this didn’t really matter too much. It simply kept the Pentium out of the scientific/workstation market, and relegated it to the growing home and business markets. The Pentium’s pipelineThe Pentium’s basic integer pipeline is five stages long, with the stages broken down as follows: 1. Prefetch/Fetch: Instructions are fetched from the instruction cache and aligned in prefetch buffers for decoding. 2. Decode1: Instructions are decoded into the Pentium’s internal instruction format. Branch prediction also takes place at this stage. 3. Decode2: Same as above, and microcode ROM kicks in here, if necessary. Also, address computations take place at this stage. 4. Execute: The integer hardware executes the instruction. 5. Write-back: The results of the computation are written back to the register file. The main difference between the Pentium’s five-stage pipeline and the four-stage pipelines prevalent at the time lies in the second decode stage. RISC ISAs support only simple addressing modes, but x86’s multiple complex addressing modes, which were originally designed to make assembly language programmers’ lives easier but ended up making everyone’s lives more difficult — require extra address computations. These computations are relegated to the second decode stage, where dedicated address computation hardware handles them before dispatching the instruction to the execution units. The take-home message here is that if it weren’t for the vagaries of x86, the second decode stage would not be necessary and the pipeline length would be reduced by a fifth. x86 overhead on the PentiumThe second decode stage isn’t the only place where legacy x86 support added significant overhead to the Pentium design. According to an MPR article published at the time (see bibliography), Intel estimated that a whopping 30% of the Pentium’s transistors were dedicated solely to providing x86 legacy support. When you consider the fact that the Pentium’s RISC competitors with comparable transistor counts could spend those transistors on performance-enhancing hardware like execution units and cache, it’s no wonder that the Pentium lagged behind. A large chunk of the Pentium’s legacy-supporting transistors were eaten up by the Pentium’s microcode ROM. If you read my old RISC vs. CISC article, then you know that one of the big benefits of RISC processors was that they didn’t need the microcode ROMs that CISC designs required for decoding large, complex instructions. The front-end of the Pentium also suffered from x86-related bloat in that its prefetch logic had to take account of the fact that x86 instructions are not a uniform size and hence could straddle cache lines. The Pentium’s decode logic also had to support x86’s segmented memory model, which meant checking for and enforcing code segment limits; such checking required its own dedicated address calculation hardware, in addition to the Pentium’s other dedicated address hardware. Furthermore, all of the Pentium’s dedicated address hardware needed four input ports, which again spelled more transistors spent. So to summarize, the Pentium’s entire front-end was bloated and distended with hardware that was there solely to support x86 (mis)features which were rapidly falling out of use. With transistor budgets as tight as they were, each of those extra address adders and prefetch buffers — not to mention the microcode ROM — represented a painful expenditure of scarce resources that did nothing to enhance performance. Fortunately for Intel, this wasn’t the end of the story. There were a few facts and trends working in the favor of Intel and the x86 ISA. If we momentarily forget about ISA extensions like MMX, SSE, etc. and the odd handful of special-purpose instructions like CPUID that get added to the x86 ISA every so often, the core legacy x86 ISA is fixed in size and has not grown over the years; similarly, with one exception (the P6, covered below), the amount of hardware that it takes to support such instructions has not tended to grow either. Transistors, on the other hand, have shrunk rapidly since the Pentium was introduced. When you put these two facts together, this means that the relative cost (in transistors) of x86 support, a cost that is mostly concentrated in an x86 CPU’s front-end, has dropped as CPU transistor counts have increased. Today, x86 support accounts for well under 10% of the transistors on the Pentium 4 — a drastic improvement over the original Pentium, and one that has contributed significantly to the ability of x86 hardware to catch up to and even surpass their RISC competitors in both integer and floating-point performance. In other words, Moore’s Curves have been extremely kind to the x86 ISA. The P6 architecture Pentium Pro vitals Pentium II vitals Pentium III vitalsIntroduction date November 1, 1995 May 7, 1997 February 26, 1999Process 0.60/0.35 micron 0.35 micron 0.25 micronTransistor count 5.5 million 7.5 million 9.5 millionClock speed at introduction 150, 166, 180, and 200MHz 233, 266, and 300 MHz 450 and 500MHzL1 cache size 8K instruction, 8K data 16K instruction, 16K data 16K instruction, 16K dataL2 cache size 256K or 512K (on-die) 512K (off-die) 512K (on-die)Features No MMX MMX MMX, SSE, processor serial numberIntel’s P6 architecture, first instantiated in the Pentium Pro, was by any reasonable metric a resounding success. Its performance was significantly better than that of the Pentium, and the market rewarded Intel handsomely for it. The architecture also proved extremely scalable, furnishing Intel with a good half-decade of desktop dominance and paving the way for x86 systems to compete with RISC in the workstation and server markets.Figure 2: Pentium Pro architectureWhat was the P6’s secret, and how did it offer such a quantum leap in performance? The answer is complex and involves the contribution of numerous technologies and techniques, the most important of which had already been introduced into the x86 world by Intel’s smaller x86 competitors (most notably, AMD’s K5): the decoupling of the front-end’s fetching and decoding functions from the back-end’s execution function, by means of an instruction window.
Decoupling the front end from the back endIn the Pentium and its predecessors, instructions traveled directly from the decoding hardware to the execution hardware. As noted above, the Pentium had some hardwired rules for dictating which instructions could go to which execution units and in what combinations, so once the instructions were decoded then the rules took over and the dispatch logic shuffled them off to the proper execution unit. In fact, you probably noticed the box marked “Control Unit” in my original Pentium diagram. The control unit is responsible for implementing and executing the rules that decide which instructions go where, and in what combinations. This static, rules-based approach is rigid and simplistic, and it has two major drawbacks, both stemming from the fact that though the code stream is inherently sequential, a superscalar processor attempts to execute parts of it in parallel: 1. It adapts poorly to the dynamic and ever-changing code stream, and 2. It would make poor use of wider superscalar hardware. See, since the Pentium is a two-issue machine (i.e., it can issue at most two operations simultaneously from its decode hardware to its execution hardware on each clock cycle), then its dispatch rules look at only two instructions at a time to see if they can or cannot be dispatched simultaneously. If more execution hardware were added, and the issue width were increased to three instructions per cycle (as it is in the P6), then the rules determining which instructions go where would need to be able to account for various possible combinations of two and three instructions at a time, in order to get those instructions to the right execution unit at the right time. Furthermore, such rules would inevitably be difficult for coders to optimize for, and if they weren’t to be overly complex then there would necessarily exist many common instruction sequences that would perform suboptimally under the default rule set. Or, in plain English, the makeup of the code stream would change from application to application and from moment to moment, but the rules responsible for scheduling the code stream’s execution would be forever fixed. The reservation stationThe solution to the above dilemma is to place the newly decoded instructions in a buffer, and then issue them to the execution core whenever they’re ready to be executed, even if that means executing them not just in parallel but in reverse order. This way, the current context in which a particular instruction finds itself executing can have much more of an impact on when and how it’s executed. In replacing the control unit with a buffer, the P6 core replaces fixed rules with flexibility.
There are a number of factors that can prevent an instruction from executing out-of-order in this manner — the instruction may depend for input on the results of an as-yet unexecuted instruction, or it may be waiting on data to be loaded from memory, or it may be waiting for a busy execution unit to become available, or any one of a number of other conditions may need to be met before the decoded instruction is ready to be sent off to the proper execution unit. The P6 architecture feeds each decoded instruction into a buffer called the reservation station (RS), where it waits until all of its execution requirements are met. Once they’re met, the instruction then moves out of the reservation station into the proper execution unit, where it executes. A glance at the P6 core diagram will show that up to three instructions per cycle can move from the decoders into the reservation station. And as we’ll see below, up to five instructions per cycle can move from the reservation station into the execution units. Thus the Pentium’s original superscalar design, in which two instructions per cycle could move from the decoders into the execution core, has been replaced with a buffered design in which three instructions can move into the buffer and four instructions can move out of it on any given cycle. This buffering action, and the decoupling of the front end’s fetch/decode bandwidth from the back end’s execution bandwidth that it enables, are at the heart of the P6’s performance gains. The reorder bufferThose of you who have closely examined the P6 diagram above have probably noticed that I left out a step in the preceding paragraph. After the instructions are decoded, they must travel through the reorder buffer (ROB) before flowing into the reservation station. The ROB is like a large log book in which the P6 can record all the essential information about each instruction that enters the execution core. The primary function of the ROB is to ensure that instructions come out one end of the out-of-order execution core in the same order in which they entered it. Iin other words, it’s the reservation station’s job to see that instructions are executed in the most optimal order, even if that means executing them out of program order, and it’s the reorder buffer’s job to ensure that the finished instructions get put back in program order and that their results are written to the architectural register file in the proper sequence. To this end, the ROB stores data about each instruction’s status, operands, register needs, original place in the program, etc. So newly decoded instructions flow into the ROB, where their relevant information is logged in one of 40 available entries. From there, they pass on to the reservation station, and then on to the execution core. Once they’re done executing, their results go back to the ROB where they’re stored until they’re ready to be written back to the architectural registers. This final write-back, which is called retirement and which permanently alters the programmer-visible machine state, cannot happen until all of the instructions prior to the newly finished instruction have written back their results, a requirement which is necessary for maintaining the appearance of sequential execution. The reservation station and the reorder buffer together make up the brains of the P6’s out-of-order execution core, and they account for its drastic clock-for-clock performance advantage over the original Pentium. These two buffers — the one for reshuffling and optimizing the code stream (the RS) and the other for unshuffling and reordering the code stream (the ROB) — enable the P6 processor to dynamically and intelligently adapt its operation to fit the needs of the ever-changing code stream. The instruction windowA common metaphor for thinking about and talking about the P6’s RS + ROB combination, or analogous structures on other processors, is that of an instruction window. The P6’s ROB can track up to 40 instructions in various stages of execution, and its reservation station can hold and examine up to 20 instructions to determine the optimal time for them to execute. You can think of the reservation station’s 20-instruction buffer as a window that moves along the sequentially ordered code stream; on any given cycle, the P6 is looking through this window at that visible segment of the code stream and thinking about how its hardware can optimally execute the 20 or so instructions that it sees there. A good analogy for this is the game of Tetris. When you’re playing Tetris, you can see the next piece that will come your way while you’re deciding how best to place the currently falling piece. Thus at any given moment, you can see a total of two Tetris pieces and think about how those two should fit with the pieces that have gone before and those that might come after. The P6 core’s job is a little harder than the average Tetris player’s, because it must maneuver and optimally place as many as three falling pieces at a time; hence it needs to be able to see further ahead into the future in order to make the best decisions about what to place where and when. The P6’s wider instruction window allows the processor to look further ahead in the code stream and to juggle its instructions so that they fit together with the currently available resources in the optimal manner. Register renamingOne function of the P6’s ROB that’s not covered above is register renaming. I’ve covered this technique numerous times before, so I won’t do any more than a summary here. Register renaming does for the data stream what the instruction window does for the code stream — it allows the processor some flexibility in adapting its resources to fit the needs of the currently-executing program. The x86 ISA has only eight general-purpose registers (GPRs) and eight floating-point registers (FPRs), a paltry number by today’s standards (e.g., the PowerPC ISA specifies 32 of each register type), and a half to a quarter of what many of the P6’s RISC contemporaries had. Register renaming allows a processor to have a larger number of actual registers than the ISA specifies, thereby enabling the chip to do more computations simultaneously without running out of registers. Of course, there’s some sleight-of-hand involved in fooling the program into thinking that it’s using only eight registers, when it’s really using up to 40, but this isn’t conceptually much different than the sleight-of-hand that fools the program into thinking that it’s running sequentially when it’s really running out-of-order. Each of the P6 core’s 40 ROB entries has a data field, which holds program data just like an x86 register. These fields give the P6’s execution core 40 microarchitectural registers to work with, and they’re are used in combination with the P6’s register allocation table (RAT) to implement register renaming in the P6 core. It’s fair to say that register renaming counts among the costs of x86 legacy support in the P6 core. Of course, most modern microprocessors both RISC and CISC implement register renaming, but the need for it is acute in an x86-based architecture. While a RISC ISA like PowerPC has 32 general-purpose registers (for integers and addresses) and 32 floating-point registers, the x86 ISA specifies only eight. This means that x86 code has many more loads and stores in it, and thus gives rise to more memory traffic, than RISC code. Register renaming is a way of reducing such load-store traffic by keeping more data on-chip. The P6 execution coreThe P6’s execution core is significantly wider than that of the Pentium. Like the Pentium, it contains two symmetrical integer ALUs and a separate floating-point unit, but its load-store capabilities have been beefed up to include three execution units devoted solely to memory accesses: a load address unit, a store address unit, and a store data unit. The load address and store address units each contain a pair of four-input adders for calculating addresses and checking segment limits; these are the adders in the decode stage of the original Pentium. The asymmetrical integer ALUs on the P6 have single-cycle throughput and latency for most operations, with multiplies and divides having single-cycle throughput but four-cycle latency. The floating-point unit is improved over the original Pentium in a couple of ways, the most notable of which is that the common FXCH instruction is now handled by a register rename in the ROB, which means that it effectively takes “zero” cycles to execute. Compilers that are aware of this trick can use it to simulate a flat register file, thereby getting around some of the limitations of x87’s stack-based design. From the present overview’s perspective, the most noteworthy feature of the P6’s execution core is that it’s attached to the reservation station via five issue ports. (To see what I’m talking about, refer back to the P6 core diagram from earlier in this article.) This means that up to five instructions per cycle can pass from the reservation station through the issue ports and into the execution units. This five issue-port structure is one of the most recognizable features of the P6 core, and when later designs (like the P-II) added execution units to the core (like MMX), they had to be added on one of the existing five issue ports. If you looked closely at the Pentium Pro diagram, then you probably noticed that there were already two units that shared a single port in the original PPro core: the simple integer unit and the floating-point unit. This means that there are some restrictions on issuing a second integer computation and a floating-point computation, but these restrictions rarely affect performance. The P6 pipelineThe P6 has a 12-stage pipeline — considerably longer than the Pentium’s 5-stage pipeline. I won’t enumerate and describe all twelve stages individually, but I will give a general overview of the phases that the P6’s pipeline passes through. BTB access and instruction fetch: The first three and a half pipeline stages are dedicated to accessing the branch target buffer and fetching the next instruction. The P6’s two-cycle instruction fetch phase is longer than the Pentium’s 1-cycle fetch, but it keeps the L1 cache access latency from holding back the clock speed of the processor as a whole. Decode: The next two-and-a-half stages are dedicated to decoding x86 instructions and breaking them down into the P6’s internal, RISC-like instruction format. A full-blown article dedicated solely to the P6 would cover this instruction set translation, which takes place in all modern x86 processor and even in some RISC processors, but my discussion of this topic will be limited to what I cover in the section on the Pentium 4. Register rename: This stage takes care of register renaming and logging instructions in the ROB. Write to RS: Moving instructions from the ROB to the RS takes one cycle, and occurs here. Read from RS: It takes another cycle to move instructions out of the RS, through the issue ports, and into the execution units. Execute: Instruction execution can take one cycle, as in the case of simple integer instructions, or multiple cycles, as in the case of floating-point instructions. Retire: These two final cycles are dedicated to writing the results of the instruction execution back into the ROB, and then retiring the instructions by writing their results from the ROB into the architectural register file. Lengthening the P6’s pipeline as described above has two primary beneficial effects. First, it allows Intel to crank up the processor’s clock speed, since each of the stages is shorter, simpler, and can be completed quicker; but this is fairly common knowledge. The second effect is a little more subtle and less widely appreciated. The P6’s longer pipeline, when combined with its buffered decoupling of fetch/decode bandwidth from issue bandwidth, allows the processor to hide hiccups in the fetch and decode stages. In short, the nine pipeline stages that lie ahead of the execute stage combine with the RS to form a deep buffer for instructions, and this buffer can hide gaps and hang-ups in the flow of instructions in much the same way that a large UPS can hide fluctuations and gaps in the flow of electricity to a device or a large water reservoir can hide interruptions in the flow of water to a facility. But on the downside (to continue the water reservoir example), when one dead animal is spotted floating in the reservoir the whole thing has to be flushed. This is sort of like the case with the P6 and a branch mispredict. Branch prediction on the P6A more comprehensive article devoted solely to the P6 would cover in detail the P6’s improved branch prediction. But here I’ll suffice it to say that the P6 expended considerably more resources than its predecessor on branch prediction, and managed to boost dynamic branch prediction accuracy from the Pentium’s ~75% rate to upwards of 90%. As we’ll see when we look at the P4, branch prediction gets more important as pipelines get longer, because a pipeline flush due to a mispredict means more lost cycles. Consider the case of a conditional branch whose outcome depends on the result of an integer calculation. On the original Pentium, the calculation happens in the fourth pipeline stage, and if the branch prediction unit (BPU) has guessed wrongly then only three cycles worth of work would be lost in the pipeline flush. On the P6, though, the conditional calculation isn’t performed until stage 10, which means 9 cycles worth of work gets flushed if the BPU guesses wrongly. The cost of x86 legacy compatibility on the P6Like the original Pentium, the P6 still spends extra time in the decode phase, but this time the extra cycle and a half goes not to address calculations but to ISA translation. This is slightly offset by the fact that the pipeline as a whole is much longer than on the original Pentium, so adding one more stage makes less of a difference. And as I also noted above, register renaming can reasonably be counted as a cost of legacy x86 support, although to what degree is debatable since the technique is popular with architectures of all types. The real cost of legacy x86 support is felt not in the extra decode pipeline stage, however, but in the sheer number of transistor that the P6’s decoding logic and ISA translation takes to implement. Though I didn’t go into detail on the P6’s decoding hardware, the design actually contains three separate decoders: two simple/fast decoders, which handle simple x86 instructions, and one complex/slow decoder, which handles the more complex x86 instructions. (Note that the complex/slow decoder works in conjunction with the microcode ROM to handle the really complex legacy instructions, which are translated into multiple uops.) Together, these three decoders are capable of passing up the three instructions per cycle into the ROB. All of this decoding and translation hardware takes up a lot of transistors. MDR estimates that close to 40% of the P6’s transistor budget is spent on x86 legacy support. If correct, then that’s even higher than the astonishing 30% estimate for the original Pentium, and even if incorrect it still suggests that the cost of legacy support is quite high. At this point, you’re probably thinking back to the conclusion of the first part of this article, in which I suggested that the relative cost of x86 support has down with successive generations of the Pentium. This is true, but the trend didn’t hold for the first instantiation of the P6 core: the original 133MHz Pentium Pro. The PPro’s L1 cache was a modest 16K, which was small even by 1995 standards. The chip’s designers had to skimp on on-die cache because they’d spent so much of their transistor budget on the decoding and translation hardware. Comparable RISC processors had two to four times that amount of cache, because less of the die was taken up with core logic so they could use the space for cache. It’s also important to keep things in perspective by considering the fact that transistor counts were still relatively low by today’s standards. As Moore’s Curves marched on, microprocessor designers went from thinking “how do we squeeze all the hardware that we’d like to put on the chip into our transistor budget?” to “now our generous transistor budget will let us do some really nice things!” to “how on earth do we get this many transistors to do useful, performance-enhancing work?”. What really drove the decrease in subsequent generations’ costs for x86 support was the increase in L1 cache sizes and the L2 cache’s move onto the die, because the answer to that last question has — until recently — been “let’s add cache.” The P6 core’s incarnationsThis concluding section provides an overview of the P6 core in its various incarnations. The main focus here is on fitting everything together and giving you a sense of the big picture of how the P6 core evolved. The historical narrative outlined below seems in retrospect to have unfolded over a much longer length of time than the seven years that it actually took to go from the Pentium Pro to the Pentium 4, but seven years is an eternity in computer time. The Pentium ProThe processor described above under the name “P6” is the original, 133 MHz Pentium Pro. The original PPro was such a great x86 chip that it prompted our own Caesar to write this paean to the Pentium Pro as one of our first Ars articles. As you can see from the previous processor comparison table, the PPro was relatively short on transistors, short on cache, and short on features. In fact, the original Pentium eventually got rudimentary SIMD support in the form of MMX, but the PPro didn’t have enough room for that so it got jettisoned in favor of all that fancy decoding logic described above. In spite of all its shortcomings, though, the PPro did manage to raise the x86 performance bar significantly. Its out-of-order execution engine, dual integer pipelines, and improved floating-point unit gave it enough oomph to get x86 into the commodity server market. The Pentium IIMMX didn’t make a return to the Intel product line until the Pentium II. Introduced in 1997, this next generation of the P6 core debuted at speeds ranging from 233 to 300 MHz and sported a number of performance-enhancing improvements over its predecessor. First among these improvements was the an on-die, split L1 cache that was doubled in size to 32K. This larger L1 helped boost performance across the board by keeping the P-II’s lengthy pipeline full of code and data. Perhaps what most of us will remember about the PII, though, was its introduction of the single-edge contact (SEC) CPU cartridge. The SEC cartridge was essentially a daughtercard which allowed Intel to more tightly integrate the PII’s 256K L2 cache with the CPU core by means of a fast backside bus. This backside bus/SEC cartridge design was a hack intended to make up for the fact that Intel couldn’t fit the L2 cache onto the CPU die itself. I think it’s safe to say that without all of the extra legacy x86 overhead in the P6 core, a respectably-sized L2 could’ve been put directly on the die. So the SEC cartridge, which didn’t really make anybody very happy when it was introduced, was in part the Pentium II’s way of paying the x86 support tax. Fortunately for Intel, the PII was the last of the Pentium line to pay that tax in as visible and painful a way. The P6’s basic pipeline stayed the same in the P-II, but Intel widened the execution core by adding the aforementioned MMX support in the form of two new MMX execution units: one on port 0 and the other on port 1. MMX provided SIMD support for integers only, though. It wasn’t until the introduction of SSE with the P-III that the P6 core got support for floating-point SIMD. The Pentium II’s integer and floating-point performance was relatively good compared to its CISC competitors, and it helped further the trend, started by the PPro, of x86 commodity hardware’s migration into the server and workstation realms. However, the P-II still couldn’t stand up to RISC designs built on the same process with similar transistor counts. It’s main advantage was in bang for the buck, whereas the more expensive RISC chips specialized in pure bang. The Pentium IIIMany will remember two main things about the P-III: the gigahertz race with AMD, and Intel’s ad campaigns. Regarding the latter, the P-III brought us both the bunny suit guys and the deliberately misleading “the P-III makes your Internet faster!!” commercials. And regarding the former, the competitive pressure piled on by AMD affected the PC market in ways that are clear only in hindsight. The whole “P-III speeds up the Internet” marketing blitz is telling in its own way about the place that Intel had arrived at with the P-III. Right around the time that the AMD won the GHz race, a consensus had begun to build that a 1 GHz P-III or Athlon was about all you needed to do the kinds of things that most people were doing with their computers, i.e., surf the Internet, do e-mail, and run word processing and spreadsheet software. So the more advanced iterations of the P-III brought enough power to the average desktop to where it looked like consumers might run out of compelling reasons to upgrade. All told, the Pentium line was looking like it might become a victim of its own success, and this put Intel in a bind. How, Intel wondered, do you sell people more processing horsepower than they realistically need to do everyday sorts of tasks? One aspect of this conundrum which I’ve commented on often was Intel’s reluctance to acknowledge PC gaming as a major factor promoting performance upgrades. Intel wanted to pretend that everyone needed a 1GHz+ Pentium III for “serious” work, like surfing the Internet, and not for blowing up mutated aliens or exploring dungeons. Since the heady days of the heated AMD-Intel competition, a competition that’s still going on but now seems more subdued, you don’t hear too many folks talking about the uselessness of additional processor muscle. Everyone is back to wanting more, and no one suggests anymore that anything over 1GHz is pointless. So what happened? Why did everyone gradually settle back into the upgrade cycle? There was no one event or factor — no “killer app” — that came on the scene and required a quantum leap in computing power. Instead, the gradual return of the status quo was more the product a few more subtle trends, all of them having to do with the GHz race. Prior to the rise of AMD as a serious competitor in the performance x86 space, Intel had the luxury of setting the pace of progress in the x86 world. Products were released when Intel was ready to release them, and clock speeds climbed when Intel was ready for them to climb. It’s also important to remember that during this period, Intel controlled the PC chipset market, as well. The result of all this is that Intel could dictate the innovation in the PC space to a remarkable degree, which meant that there was little competitive pressure forcing the company to push new products out the door at an accelerated pace. When it became clear that there was a real race on between Intel and AMD, Intel kicked things up a few notches. All of the sudden, the pace of processor performance started move well ahead of the curves along which the other parts of the PC system — and that includes both OS and application software — were traveling. The processor was maturing rapidly, and the rest of the PC just wasn’t keeping up. It was this phenomenon that led to the doom-and-gloom scenarios in which people just stopped buying processors once they got around the 1.2GHz mark. The rest of the industry had no idea how to absorb all of the performance gains coming out of the GHz race, and there seemed to be no end in sight to the ride. In this respect, the computer industry, and by this I mean everything from hardware vendors to software vendors to customers at all levels, is very much like an ecosystem, with multiple species engaged in a mix of competition and cooperation and achieving a continuously-evolving equilibrium. The GHz race upset that equilibrium, and it took the rest of the system a little while to absorb the shock and to incorporate it into a new type of equilibrium. Of course, the rest of the PC industry ecosystem had a little help in the form of a slowdown in progress brought on by external events. There was a stock market crash, an economic downturn, and a group of fanatics who flew some airplanes right into the centers of American economic and military power… but you know the rest of that story. And now, four years later, we’ve all sort of collectively caught our breath and caught up with the microprocessor, which has still been trucking along, but not at the same white-hot pace of the late 90s. Today’s PC market is much more segmented and diverse than the market into which the P-III emerged in early 1999. As the rapid rise in home networking attests, many homes have multiple PCs, with each one falling into a different price/performance niche. So there’s relatively less room at the very high end, but the fallout from the shrinking of the high-end slice is mitigated by the fact the overall pie has gotten substantially bigger and continues to grow. To sum up, the P6 architecture carried the personal computer from the first awkward changes of adolescence up through young adulthood, and just like certain parts of a young person’s body or mind will mature much more rapidly than others, the P6 core grew too much, too soon. The GHz race also had another, much more profound effect on the Pentium itself, insofar as the next chip to bear the Pentium name — the Pentium 4 — also bore the marks of the GHz race stamped into its very architecture. If Intel learned anything in those last few years of the P6 core’s life, it learned that clock speed sells, and it kept that lesson foremost in its mind when it designed the P4’s Netburst architecture. The next article in this series will cover Netburst, as well as the resurrection of the P6 core in the form of the Pentium M. So stick around, because the story of the Pentium only gets more interesting.