The Advanced Micro Devices Athlon Processor

Chris Wyman
Fall '99, CS 6810 Research Paper

Introduction

For many years, the Intel x86 architecture has been the predominant processor for personal computers. The only other competitor, the Motorola and IBM PowerPC architecture, has been losing market share for years. However, for a number of years, companies like Cyrix and Advanced Micro Devices (AMD) have tried to beat Intel at its own game: delivering high performance, backward compatible, x86 processors at prices affordable for mass markets.

Until recently however, AMD's processors have been relegated to the low end of the market due to some design shortcuts and oversights which have caused their processors to perform worse than the latest Intel offerings. With the introduction of the new Athlon processor, formerly codenamed the K7, AMD has turned the tables and released a processor which beats the best Intel can offer for less money. Of course, considering the speed of the microprocessor market this may change tomorrow, but until then the Athlon deserves praise and, as seen below, has some quite ingenious aspects to the design.

Instruction Set Architecture

The AMD Athlon processor is a 100 percent Intel compatible x86 processor. Like the Intel Pentium architecture, the Athlon processor has a register-memory type instruction set. However, the core of the processor does not execute these register-memory instructions, instead x86 instructions are decoded into MacroOPs (also know as MOPs) which are register-register type instructions. This load-store RISC-like architecture actually comprises the heart of the processor.

Since AMD strives to keep its processor compatible with Intel processors, the Athlon has the same addressing modes as other x86 processors. Basically, an instruction operand may be located in the instruction (an immediate value), a register, memory, or an I/O port.

Immediate values can have different lengths depending on the instruction, and may never have values greater than 232. Just about all arithmetic instructions allow the use of an immediate value as a source.

The x86 architecture has a complex method of accessing memory. Both a segment and an offset are required to determine a correct memory location. Multiple segment values are usually stored in registers on the processor to allow for implicit segment determination. For example, when fetching the next instruction to execute, the segment is assumed to be the value located in the code segment (CS) register. Offsets can be calculated in numerous ways, including:

The Athlon processor contains all the standard x86 instructions, including those Intel introduced through the Pentium processor. Because of increased emphasis on and development of multimedia application, both Intel and AMD have introduced new instructions to handle the large amounts of data efficiently. Thus, AMD includes the multimedia extensions (MMX) introduced by Intel, as well as the 3DNow! extension devised by a coalition of Intel competitors. In addition to the MMX and 3DNow! extensions, AMD also introduced 24 new proprietary instructions in the Athlon. These include 12 instructions to speed up integer math used in speech and video processing, 7 instructions to accelerate data movement, and 5 instructions to improve performance of digital signal processing such as software modems and sound processing.

Athlon Processor Architecture


Figure 1: High-level overview of the Athlon Architecture

The first step in the execution of any instruction involves fetching the instruction from memory and "predecoding" it to locate instruction boundaries and branch instructions (to improve branch prediction). The predecoded result is stored into the level 1 cache to be accessed by the instruction decoders.

The next step of execution is the decode state. Instructions must be decoded quickly into the MacroOPs used by the internal execution pipeline, otherwise the pipeline will be forced to stall. To this end, AMD engineers included three x86 hardware instruction decoders. Each of these three decoders can interpret all but the most complex of x86 instructions. Almost all instructions less than 16 bytes long go through this quick hardware decoder. The few instructions greater than 15 bytes in length or otherwise too complex to decode in hardware go to a ROM vector lookup to decode. However, since the maximum length of a x86 instruction is 17 bytes long, the vast majority of instructions can be handled in hardware.

An additional feature of the Athlon processor is that all of the three decoders are equivalent. Other x86 processors only allow one register-memory operation per clock cycle, and any other decoders must get a register-register operation. Having three symmetric decoders completely eliminates the need to stall the decoder due to hardware limitations of the decoders.

Once the instructions have been decoded into the RISC-like MacroOPs, the MOPs are stored in the Instruction Control Unit. This unit stores up to 72 MOPs. Since an x86 instruction is decoded into at most 2 MOPs, the Instruction Control Unit can store between 32 and 72 waiting instructions. Essentially, the ICU is a reorder buffer, tracking the instruction from issue until retirement. Like all reorder buffers, the ICU gives the processor the opportunity to execute instructions speculatively.

From the ICU, instructions move onto one of two smaller scheduling units, which schedules the operation on one of the execution units. Once there is space available in the appropriate scheduler, a MOP is sent to either the Integer Scheduler or the FPU Scheduler. The Integer Scheduler holds up to 18 MOPs, and the FPU Scheduler holds up to 36 MOPs.

Notice that the FPU register file is invisible to integer instructions due to the stack nature of the x86 instruction set architecture. The x86 floating point instructions all use stack operands. In order to allow for pipelining of the floating point execution units, however, these operands need to be mapped to one of these 88 internal FPU registers. Since these internal registers are invisible to both the programmer and the compiler writer, they are also invisible to the integer execution units.

Finally, execution occurs. The Athlon can issue up to 9 MacroOPs to execution units per cycle. There are 3 integer execution units, 3 "address generation units", and 3 floating point units. All are fully pipelined. One noticeable difference between the integer and floating point pipelines should be mentioned. All of the integer execution units are identical, and so are the address generation units. However, the floating point units have different purposes. The first floating point unit does floating point stores, the second unit does adds and both MMX and 3DNow! operations, and the third unit does multiplies and both MMX and 3DNow! operations. While this may seem odd, any resource conflicts this may cause are counterbalanced by the benefits of the Athlon's entirely pipelined floating point units -- the first in the x86 world.


Figure 2: The Athlon pipeline stages.

As seen in Figure 2, the Athlon integer pipeline is 10 stages long and the floating point pipeline is 15 stages long. Note stage 1 is the fetch from memory, stage 2 is the predecode stage, stages 3 and 4 align the instruction for easy access by the three x86 instruction decoders, and stages 5 and 6 are reserved by the instruction decoders.

The integer pipeline looks similar to most other integer pipelines: one cycle to schedule which execution unit to use, one cycle to execute, one cycle to resolve addresses, and one cycle to write back results.

The floating point pipeline is slightly more complex. First, the stack needs to be remapped into an easier to use register file. The next cycle renames registers again to avoid any dependencies. Stages 9 and 10 schedule the instruction onto one of the execution units. Note that this takes two pipeline stages due to non symmetrical FP units and variable latencies. In the 11th stage, operands are read from the register file, and in stage 12, execution begins. Because the FP units are fully pipelined, they all have a 1 instruction per cycle throughput. The FP add and multiply both have 4 cycle latencies, the MMX instructions have 2 cycle latencies, and the 3DNow! instructions also have 4 cycle latencies.

One problem with the Athlon pipeline is the long penalty for a mispredicted branch. A mispredicted branch is not caught until the tenth stage of the (integer) pipeline, and thus causes a 10 cycle penalty.

Luckily, AMD has incorporated an excellent branch prediction scheme onto the Athlon. The processor has a 2,048 entry branch history table and a 2,048 entry branch target address cache. The BHT uses a simple 2-bit Smith prediction scheme. The history table is accessed in the fetch stage, and a prediction is made during the scan stage of the pipeline. On a misprediction, the target address is computed and stored in the branch target address cache for future reference. Additionally, a 12 entry return address stack is also incorporated onto the CPU to optimize CALL/RET pairs. With these schemes, the Athlon reportedly accurately predicts branches 95% of the time.

Athlon Memory Structure


Figure 3: High level overview of the Athlon memory subsystem

So what's a high performance chip without a blazingly fast memory subsystem? Well, not only has the Athlon chip been tuned for quick computations, but a lot of effort has gone into making sure the memory can keep the processor supplied with instructions and data.

The first level cache is split into two 64K caches, one for instructions and one for data. Both caches are 2-way set associative. The data cache has two 64-bit load/store ports, and the cache is multi-banked to allow for multiple concurrent memory operations. The cache has 3 sets of tags, allowing for concurrent access by the two load/store ports as well as a snoop from the system bus.

Loads and stores go through the 44 entry Load/Store Queue. As in many systems, loads take precedence over stores, unless a store needs to be performed to free room in the ICU. If a load is scheduled before a store to the same memory location, then the data from the store is forwarded to the load instruction without the need for a cache access.

The data cache has physical address tags, and effective addresses are computed in parallel with tag lookups via a translation look-aside buffer (TLB). The TLB has a fully associative 32 entry first level, which is backed by a 256 entry 4-way second level TLB. Both the 4K and 4M page sizes implemented by the Intel 36-bit address space extension are supported. Additionally, to speed the fetching of instructions, the instruction cache has a its own fully associative, 24 entry TLB, backed by a 256 entry (presumably also 4-way) second level TLB.

The second level cache can range from 512 kilobytes to 8 megabytes in size, and can be clocked at 1/3, 2/5, 1/2, 2/3, or 1 times the processor speed. Furthermore, the processor has tags for 512K on chip to improve hit/miss detection for the external cache. However, even with 8M of cache, the on chip tags provide an early miss detection, preventing the access of external tags on most misses. This reduces the average L2 miss penalty.

The Athlon System Bus

After accesses to both the L1 and L2 caches have missed, main memory is queried for the data. The Athlon uses the EV6 bus designed at Digital Equipment Corporation for the Alpha processor. The EV6 bus is a point to point bus (not shared like the Pentium bus), so each processor has a direct link to the memory controller and every other processor in the system. Furthermore, this means that the bus bandwidth is not shared between processors.

The EV6 bus in the Athlon currently runs at 100 MHz, and is scalable to at least 200 MHz. Furthermore, the bus transmits data on both the rising and falling clock edges, effectively running at 200 MHz, scalable to 400 MHz. The bus supports a 43 bit address space, and supports up to 14 processors.

Each link of the bus has 3 ports: address in, address out, and data ports. The data port is a bi-directional, 72 bit wide path allowing for 64 bits of data an 8 bits of error correcting code ensuring high integrity data transfers. Additionally, with the address lines 43 bits wide, the bus allows access to significantly more memory than other x86 processors -- 243 = 8 terabytes of addressable memory. Between the three bus ports, the Athlon supports up to 24 outstanding bus transactions per processor at any given time.

In multiprocessor systems, cache coherency is maintained via snooping, as is common in today's microprocessors. Because of the dual address in/out ports on the bus, a snoop and a data request can occur simultaneously. Interestingly, the Athlon cache coherency protocol is not the standard MESI protocol, but is rather a five state MOESI protocol. Thus, the five statee are the standard: modified, exclusive, shared, and invalid plus the additional "owned" state, essentially a shared-modified state.

I/O Busses

The system bus not only connects to the memory controller, but also to peripheral busses. As with other x86 processors, to access the I/O devices available via these busses, specific processor instructions are required. Notice that the Alpha processor which uses the same EV6 system bus supports half of the address space mapped to I/O devices (and thus doesn't need specific instructions to access them). However, the Alpha actually supports a 44 bit address space (with half devoted to I/O devices), whereas the Athlon only supports a 43 bit address space.

The AMD-751 chipset (currently the only chipset available for the Athlon) connects to the memory, a synchronous 33 MHz, 32 bit PCI bus supporting up to 6 master devices, as well as a synchronous 66 MHz AGP bus supporting 2x speed data transfers. Additionally, by using another chip (either the AMD-756 or the equivalent Via chipset), the Athlon can utilize the older, asynchronous ISA bus, a USB bus, as well as IDE hard drives. To relieve the strain on the processor, direct memory transfers (via DMA) are allowed from the I/O devices.

Athlon Hardware Specifics

Athlon processors were originally manufactured on a 0.25 micron CMOS process. On this process, the 22 million transistors required a die of 184 mm2. However, the newest Athlon processors have been switched to a 0.18 micron CMOS process, which reduces the die size to 102 mm2. The processor requires a power supply of 1.6 volts, and dissipates between 35 and 50 watts of energy. In order to attach the L2 cache and allow for easier energy dissipation the Athlon is distributed in a "Slot-A" cartridge mechanically equivalent (though electrically different) to the Pentium II and III Slot-1 cartridge.

Analysis of Athlon ISA Decisions

The AMD design team made a number of good design choices when creating the Athlon, though there are also areas they could improve upon.

The first point to notice about the Athlon is that it is compatible with the Intel x86 line. Because of this decision by AMD, the Athlon is stuck with an awkward, difficult to understand, and complex instruction set architecture. While this may sound like a tremendously unsound decision, the fact is that most PC consumers want backwards compatibility. Without offering the ability to run x86 programs on the Athlon, most people would stick with the tried and tested Intel processors. Thus, while choosing the x86 ISA may make very little sense architecturally, it ensures a market for the processor. However, this decision also has impacts on other areas of the chip's design, specifically the need for instruction decoders to allow for pipelining and emulating a stack for floating point operands.

A number of years ago, AMD and other companies came up with the 3DNow! multimedia extension standard, now implemented in all the non-Intel x86 chips. Interestingly, with the Athlon AMD added new, proprietary 3DNow! instructions. This may end up being a mistake, and the reason for this is simple. Intel still has much more control over the compiler, operating system, and games vendors than AMD does, which means that these new 3DNow! instructions may never be widely used to optimize the applications for which they were designed. Unfortunately, like many new instructions on the x86 platform, once they have been incorporated into the ISA, it is unlikely AMD will be able to remove them, because the few people who depend on these instructions will pressure AMD to keep these them. The transistors used in implementing these new instructions may have been better used adding instructions compatible with the new Intel streaming instructions or in something as simple as additional on-die cache.

Analysis of Athlon Architecture Design and Decisions

Another interesting choice by the Athlon's design team was the inclusion of symmetric instruction decoders. These allow decoding of almost any three instructions in parallel. Once again, this is unique in the x86 processor world. All other processors, including other Intel and other AMD processors only allow one "complex" register-memory operation to be decoded per clock cycle. Any other operations must be register-register instructions or wait until the next clock. While most code is optimized for the non-symmetric decoders in the Pentium, having symmetric decoders eases a compiler writer's job and speeds up decoding (important for keeping the processor fed at high clock speeds).

Another note on the instruction decoding: as mentioned above, complex operations are not decoded in hardware, but are rather relegated to the "vector path" decoding through a ROM on the processor. While this slows the decoding of these instructions, it avoids slowing the decoding of other, more common instructions. This idea keeps with the RISC motto of "keep the common case fast, and the uncommon case correct." Since the goal is to make fast processors, keeping the common decoders fast while relegating older and slower instructions to ROM speeds up most code while retaining compatability for the older code -- a wise design decision.

One of the areas AMD engineers put much time into was the design of the floating point system. With their completely pipelined FP unit, the Athlon can issue 3 floating point instructions each clock, which should keep floating point intensive programs such as games and scientific calculations very happy.

AMD has always trailed Intel in floating point performance, but with the Athlon's new pipelined processor, they have turned the tables. But in order to keep the FP pipelines fed, the standard x86 stack needed remapping into a more usable format. Sticking with the stack would have required either implementing a stack on chip with registers, or looking to memory for operands. Instead, the Athlon employs a register renaming technique with 88 registers. While this may seem like a waste of transistors, 88 registers allows a complete remapping for all the destination registers on any instruction in the ICU, avoiding going to memory for stack registers that were swapped to memory and eliminating conflict stalls due to lack of registers. Thus, once again AMD retains compatibility with the awkward x86 ISA while improving performance using a more modern technique.

One might ask how the Athlon can fill 9 execution units every cycle. After all, the rule of thumb is there's only enough instruction level parallelism to make about 4-way superscalar processors worthwhile. However, because the Athlon decodes each x86 instruction into multiple MacroOPs, ILP can be found not only at the instruction level, but also at the MacroOP level, allowing a 9-way superscalar processor to be useful.

Another interesting tidbit of information comes from comparing the Athlon's pipeline to other processor's pipelines. For example, the Athlon has a 10 stage integer pipeline and a 15 stage floating point pipeline. Comparatively, the Pentium III has a somewhere between a 12 and 17 stage integer pipeline and over 20 "stages" in the floating point unit. Somehow, the engineers at AMD were able to accomplish the same amount of work in fewer pipeline stages at a higher clock frequency!

One of the more amazing features of the Athlon is its branch prediction. Despite having a horrible 10 cycle misprediction penalty, the 95% correct prediction rate achieved by the Athlon ensures that this penalty occurs as infrequently as possible. As a comparison, Intel chips only have around a 90% correct prediction rate. Oddly enough, AMD reduced the branch history table from a complex 8,192 entry table in the K6 series of processors to a simple, 2,048 entry Smith prediction scheme. Fortunately, this did not significantly reduce the prediction rate (both the K6 series and the Athlon predict correctly about 95% of the time), and saved numerous transistors to use elsewhere on the processor.

Analysis of Athlon Memory Decisions

Overall, the memory system has been tuned to be able to keep instructions and data flowing fast enough to keep pace with the processor. Though there are some interesting points which could stand some improvement.

First, with two dual level TLB caches, one for each the instruction and data caches, accesses to the memory TLB are kept to a minimum. Interestingly, the Athlon designers provided a smaller first level TLB cache for the I-cache than for the D-cache. However, the second level TLBs are the same size. Presumably, this allows quicker access to the first level instruction TLB which is paramount to keeping the instruction fetches quick. Yet, keeping the second level I-cache TLB the same size is important, since usually it is not accessed, but when necessary to access it, having a larger second level TLB can prevent going to the main memory TLB.

The L1 caches, at 64 kilobytes apiece, are four times larger than the L1 caches on Intel's Pentium series of processors. While this certainly improves the performance of the processor, it requires many additional transistors, which explains the size and transistor count difference between the two processors. Another interesting point, the level 1 data cache is "multi-ported", meaning there are multiple banks in the cache, allowing up to two load/store accesses, as long as they are to separate banks. Certainly, there are numerous times when one of the two ports stands empty while the other is overbooked. It would be nice to be able to have a true multi-ported data cache.

The L2 cache is located off the CPU die on a separate chip. Certainly, this decision allows smaller CPU dies, lower transistor counts, and thus higher yields and lower prices. However, all new x86 chips by Intel have the L2 cache integrated onto the die, increasing performance. With integrated L2 caches, the Intel caches can achieve the same performance with as the Athlon caches with about half the storage capacity.

Still one advantage to having an off die cache is that creating multiple versions of the processor with a variety of L2 cache sizes is much easier. The Athlon supports between 512K and 8M of L2 cache, and processors with these sized caches can be manufactured as easily as plugging the correct sized cache chip onto the processor board. In order for Intel to increase the cache size, the whole die layout has to be modified in order to accommodate the change.

Another wise idea on the AMD Athlon design team was the incorporation of L2 cache tags on chip. This allows early miss detection so the memory can be queried immediately without checking the off board L2 chip. While this does not improve the hit time, it does reduce the miss penalty, and any improvement is welcome.

However, these on die cache tags only provide access to 512 kilobytes of the cache. This forces any cache chips over 512K in size to have secondary tags incorporated onto the L2 controller. This is the reason only 512K cache Athlons are available, because AMD has not had time to redesign and test the cache controller that contains the secondary tags. Since most consumers will buy the cheapest 512K cache Athlon, this move was wise on the part of AMD. Why spend extra money and transistors on a feature almost nobody will use.

Another problem run into by the Athlon is the speed of modern memory chips. An on die cache can more easily be tweaked to run at the core processor speed, or at least at half speed. However, with off die cache, AMD has had difficulty with the cache not keeping up. In fact, the latest Athlon processor, running at 750 MHz has a L2 cache to processor speed ration of 2/5 (or 1/2.5) compared to previous ratios of 1/2. Just one additional reason why AMD may wish to reconsider their decision to keep the L2 cache off the core CPU die.

Analysis of Athlon System Bus Decisions

The shared bus used by the Intel Pentium series of processor has been pushed far, but AMD was wise to opt for a more advanced bus. The EV6 bus developed for the Alpha processor at Digital Equipment Corporation has many advantages over the Pentium bus.

First, the EV6 bus is scalable to higher speeds today. Unfortunately, due to current memory speeds there was no reason to run the bus above 200 megahertz. Memory able to take advantage of even that speed has yet to reach mass distribution. Still, it is nice to know that the bus is scalable when faster memory becomes available.

Second, with the EV6's dedicated point-to-point connections, bus conflicts are a thing of the past (at least between multiple processors). While this won't affect single processor mass market computers, reducing these conflicts helps improve performance on high-end multiprocessor systems (which AMD wants to target for higher profits). Furthermore, the Pentium bus reaches a limit after four (or with some stretching eight) processors on the bus. With the EV6 bus, the Athlon can be used in systems with up to 14 processors (at which point, presumably, the number of point-to-point connections just becomes too much).

Another nice property of the Athlon EV6 bus implementation is the 72 bit wide data transfers. This allows the bus to not only transfer the 64 bits of data, but also 8 bits of error correcting code. Checking the ECC to guarantee that the data remains uncorrupted is vital for users of servers and other critical applications.

The address lines of the EV6 bus are 43 bits wide in the Athlon implementation. AMD could have reduced these data lines to the standard 36 bits used on the Pentium bus, but instead the logic was added to deal with larger address spaces. This allows potentially larger address spaces, addressing up to 8 terabytes of memory. Considering that high end consumer PCs today contain 256 megabytes of memory and it is not uncommon to see servers with 512 MB to 1 GB of memory, the 4 GB limit imposed by the 32 bit systems, and even the 32 GB allowed by a 36 bit address space, will shortly not be enough. Here, AMD made a wise investment.

Oddly, the cache coherency protocol used on the Athlon bus is a non-standard 5 state MOESI protocol. Supposedly, this allows reduced write-invalidation traffic on the bus. But considering the vastly improved bus bandwidth and the fact that write-invalitate traffic does not block all transactions on a point-to-point bus, one wonders why AMD went to such trouble to devise, implement, and test a new cache coherency protocol.

Analysis of Athlon I/O Decisions

One of the slightly disappointing, though not unexpected aspects of the Athlon is the AMD chipset which allow the processor to talk to the rest of the system.

First, AMD designed the AMD-751 chipset for yesterday's high end protocols. For example, the chipset only supports a 66 MHz AGP, at 2x speed. Already Intel chipsets are on the market with support for the new 4x, 133 MHz AGP bus. Additionally, only 100 MHz memory chips are supported even though the bus and the processor could benefit from the newer 133 MHz memory currently on the market.

While unfortunate this is not surprising. AMD has stated all along that it wants to avoid the chipset business and concentrate on building good CPUs. Hopefully, other vendors will soon provide the chipsets which will allow the Athlon processor to talk to the rest of the system at the highest speeds available today.

Conclusions

The AMD Athlon processor has had an enormous impact on the personal computer microprocessor market. For the first time, due to some ingenious designs, Advanced Micro Devices has a microprocessor more powerful than any other x86 processors on the market. Features like its larger caches, faster system bus, fully pipelined floating point units, and very good branch prediction set the Athlon apart from competitors. Still, there is room for improvement and in today's quickly moving microprocessor market Intel, or some other company, may take advantage of those weaker areas of the Athlon to design a better microprocessor.


Bibliography

[1] K7 Challenges Intel. Microprocessor Report, Vol. 12 Num. 14, October 26, 1998.

[2] AMD Athlon™ Processor Tecnical Brief. AMD Technical Documents, November 1999, available at http://www.amd.com/products/cpg/athlon/techdocs/index.html

[3] AMD Athlon™ Processor Data Sheet. AMD Technical Documents, November 1999, available at http://www.amd.com/products/cpg/athlon/techdocs/index.html

[4] AMD-750™ Chipset Overview. AMD Technical Documents, August, 1999, available at http://www.amd.com/products/cpg/athlon/techdocs/index.html

[5] Intel Architecture Software Developer's Manual, Volume 1: Basic Architecture. Intel Pentium III Processor Manuals, available at http://www.intel.com/design/PentiumIII/manuals/



[6] CPU Guide - The New Athlon Processor. Tom's Hardware, August 9, 1999. Homepage: http://www.tomshardware.com, Article: http://www5.tomshardware.com/cpu/99q3/990809/index.html

[7] AMD Athlon™ Tecnical Overview. AMD Technical Documents. Available at http://www.amd.com/products/cpg/athlon/overview.html

[8] Slides from AMD Microprocessor Forum 1998 Talk. Given by Dirk Meyer. Available at http://www6.tomshardware.com/cpu/98q4/981015/sld001.html