Background Material

There's a truck load of background material for this paper ... I'll keep updating this page and re-organizing it. Some sections are marked as "Under construction". I've read a lot, but need to organize the information. So don't be dismayed if those sections are sparse. A lot of this material is PC/Intel-specific.

A lot of my info for legacy systems comes from "The Indispensable PC Hardware Book", Second Edition, by Hans-Peter Messmer. (There is a copy in the flux office. I have it right now.)

Part 1: Before Interrupts

Before interrupts, there was polling. The CPU would periodically check devices like keyboards to see if they needed attention. Devices were slow during that time so polling lead to a lot of wasted cycles.

Interrupts were introduced during the 1950s according to this site.

Part 2: Primitive Interrupts and DMA on the IBM PC/XT/AT

Wikipedia pages: IBM PC/XT/AT

These are from the 1980s and used Intel 8080/8086/8088/80286's. They had two address spaces: memory space and I/O space. Memory is accessed via mov instructions, I/O space via in/out instructions to certain addresses. (While this is still true on current Intel chips, I/O spaces are ultra legacy.)

Here's a block diagram of the IBM PC/XT computer. I'll use it to explain the interrupt technology from that time.


IBM PC/XT Block Diagram (full size image). From Messmer's book, page 450.

Diagram notes:

  • 8284: clock generator
    • This clock is used for the whole thing (local bus, control, address, and data buses).
    • The clock controls how fast the buses are (number of transactions per second, data throughput, etc.), because the bus protocol uses the clock period to stipulate when events can occur.
    • You can see it's hooked to the CPUs and to the 8288 bus controller.
  • 8086/8088: cpu
    • Connected to two address buffers (one for the memory space, one for I/O space).
    • 20 bit parallel address bus (address up to a whopping 1 MB of memory)
    • 16 bit parallel data bus
    • I'm emphasizing "parallel" because PCI Express and a lot of modern interconnect technologies are serial.
    • Roughly: "Parallel" means 8 bits are sent on 8 wires. "Serial" means 8 bits are sent on 1 wire, in sequence.
  • 8087: math coprocessor
  • 8288: bus controller
    • Translates cpu operations to bus operations. (I think this means it translates combinations of pin settings on the cpu to the correct signals on the bus.)
  • 8259A: programmable interrupt controller (PIC)
    • Multiplexes "interrupt requests" (IRQs) from other system components onto the single INTR pin on the cpu. (Explained in a bit more detail below.)
    • It supports 8 IRQ lines.
    • Some of these are hardwired (channel 0 of the 8253 programmable interrupt timer, the internal system clock, is hardwired to IRQ 0).
    • The PIC is programmable, so it supports some configuration and runtime commands. For example, you can configure the "base" vector value.
    • See here for the full documentation. The important things to note from this document:
      • Page 1: Note the incoming IRQ pins and the data (D0-D7) pins. The PIC sends out the interrupt vector on the D0-D7 pins.
      • The IRR/ISR/IMR registers will come up again when we look at APICs, so I wouldn't bother understanding those in depth yet.
  • 8237: Direct Memory Access (DMA) chip
    • Supports 4 "channels".
    • Each channel (except channel 0) basically describes a region of memory that is used to transfer data to/from a peripheral.
    • For example, the floppy disk usually uses channel 2.
    • Peripherals can choose a channel by asserting some pins.
    • Channel 0 is wired to the 8255 programmable interval timer and triggers a memory refresh. (Required for RAM to keep it from losing data.)
  • 8255: programmable interval timer
    • Also has 3 "channels" that represent different timer types.
    • Channel 0 is internal system clock and is wired to IRQ 0 on the PIC.
    • Channel 1 is for memory refresh and is hard wired to channel 0 on the DMA chip.
    • Channel 2 is for speakers.

Most if not all of these components are configured using in and out instructions in the I/O space.

Interrupts on IBM PC/XT

On the peripheral card, you manually configure which IRQ a card uses using jumpers (little lego-like pieces you push onto two wire prongs on the card). The card will use the corresponding IRQ pin and the rest will remain unused. (One of the benefits of legacy PCI is you don't have as much waste of IRQ pins. More on that later.)

You also need to configure the PIC (using out instructions to the correct I/O addresses) to assign interrupt vectors to IRQs. You have to honor certain restrictions: For example, the internal system clock is hardwired to IRQ0; and Intel reserves the first 32 vectors.

Next, you need to set up your interrupt pointer table (that's what Intel called it back in the day) with your service routines. It had a fixed mapping at the first 1 KB of memory. 256 entries (just as today). Each entry is 4 bytes.

Alright, action! The peripheral card asserts its IRQ pin. This goes to the PIC. The PIC uses some internal logic (that I'll skip over now, we'll see it in depth when we talk about APICs) and asserts the cpu's INTR pin. The cpu acknowledges with the INTA pin. There's some back and forth, and then the PIC delivers the interrupt vector that corresponds to that IRQ. The cpu should signal an End of Interrupt (EOI) when its done. (Again, we'll see this in APICs.)

What is a "bus master"?

Not to be confused with "bus arbiter".

A bus master is a system component that can request a bus transaction, typically DMA. This means the bus master is given exclusive access to the address and data buses for so many bus cycles.

On IBM PC/XT, only the CPU and DMA chip could do that, the expansion cards could not. For example, a LAN card would assert DRQ3 to say it had data ready to store into the host's memory; but it would be up to the DMA controller to ack the request and initiate the transaction.

Future bus architectures allow peripherals to be bus masters, including a limited form on IBM PC/AT (the peripheral would request access still through the DRQx lines, and would assert the MASTER line after an ack from the DMA chip).

[Abhi] Earlier systems had a DMA controller that sat on the PC side (motherboard) to perform DMA to and from the memory. The DMA controller would arbitrate between the devices connected to it, but this offered very less performance improvement as the bandwidth was dependent on the bus-width of the DMA controller.

With the advent of PCI/PCIe standard, the DMA functionality has been moved as a part of the I/O device that resides on the PCI/PCIe bus. Any PCI device can become a bus master by requesting to the PCI controller which claims the reads/writes from the device and forwards to the memory controller to be written to the memory.

To give a sense of timing, the write/read from the I/O device lands on the PCI controller (called South bridge in earlier architectures, now replaced with PCH), which will forward the transactions on the system bus (HyperTransport has now been replaced with DMI (Direct Media Interface) that shares the same clock as to that of PCI/PCIe bus) to the memory controller. The memory controller will in-turn convert the transactions into DDR3 format that memory chips would understand.

See also wikipedia article.

Part 3: Level- versus Edge-triggered Interrupts

Wikipedia has a great explanation here on interrupt triggering.

In older systems, all the way up through PCI (PCI requires level-triggered), the system required certain triggering types. The operating system needed to know whether the interrupt was level- or edge-triggered because this sometimes required the operating system to take extra interrupt handling steps. For example, on an older system with level-triggered interrupts, the interrupt controller would not automatically mask the IRQ line, the CPU had to tell it to. If the CPU didn't, the asserted IRQ line would keep triggering interrupts.

Early interrupts were edge-triggered. EISA, MCA, and PCI are level-triggered. (EISA allows edge-triggered for backward compatibility with ISA.) This allowed devices to share an IRQ line (see Wikipedia article). Devices that want service assert their IRQ line and hold it until they are acknowledged. The OS cycles through all of the devices (via device drivers) on an IRQ line, so eventually, a device will be serviced. When a device is serviced, it stops asserting the IRQ line. When all devices are serviced, the IRQ line is inactive.

Charlie/Abhi speculation: On modern Intel platforms with APICs, level- and edge-triggered interrupts are used by devices to convey different information / operating modes (need an example). But the APICs take care of proper IRQ masking. For example, a device running in mode A uses edge-triggered interrupts (perhaps on the rising and falling edge of the clock); in mode B, it uses level-triggered interrupts.

See also Linux Kernel Architecture by Wolfgang Mauerer, 14.1, "Interrupts". This explains the guts of Linux 2.6's IRQ handling. 14.1.5 "Interrupt Flow Handling" talks about the different steps needed to correctly handle an edge- versus a level-triggered interrupt in an architecture-independent manner.

Part 4: ISA, EISA, and MCA

There are a few more architectures before PCI appeared.

ISA: Industrial Standard Architecture

According to Messmer, this is basically the IBM PC/AT architecture, which is similar to the PC/XT I walked through above. (Some differences: larger buses to accommodate Intel 80286, more IRQs, more integration of components onto single chips.)

ISA requires edge-triggered interrupts.

EISA: Extended ISA

Here's a block diagram:


EISA Bus Architecture (full size image). From Messmer's book, page 477.

New things:

  • 32-bit address and data buses
    • Accommodates new 32-bit i386, i486
  • More integration of components
    • Intel 82357 integrated system peripheral (ISP) - interrupt controller, DMA controller, bus arbitrator, timer, NMI logic
    • Intel 82358 EISA bus controller (EBC)
  • EISA peripherals can be true bus masters
    • Hence why we now have a bus arbitrator
    • I (Charlie) believe the arbitrator makes choices based on the DMA channel the peripheral is trying to use (DMA channels are now organized into priority groups, memory refreshing is in the highest priority group)
  • Enhanced interrupt controller
    • Can specify which IRQs are edge- and level-triggered (with 8259A, it was all edge or all level triggered)
  • NMI Watchdog
  • Configure devices with I/O address space (rather than manually with jumpers and DIP switches)
  • Backward compatible with ISA cards
    • Uses tricks to split 32-bit data into chunks ISA cards can handle, etc.
    • Uses "bus clock stretching" to accommodate slower cards
    • DMA backward compatible for ISA (DMA becomes bus master on behalf of ISA card)

MCA: IBM's Microchannel Architecture (MCA)

Here's a block diagram:


IBM Microchannel Architecture (MCA) (full size image). From Messmer's book, page 495.

Similar to EISA. (Messmer says EISA was a reaction to IBM's MCA. IBM MCA was proprietary.)

  • 32 bit bus
  • Unlike EISA, Local CPU bus and MCA bus are asynchronous (use different clocks)
  • Integrated Video Graphics Array (VGA)
  • Level-triggered interrupts

Part 5: Early Advanced Programmable Interrupt Controllers (APICs)

Intel introduced APICs with their Pentium chips (I, Charlie, believe). They were integrated onto the chip.

Shortly after, Intel introduced 82489DX APIC chips in the mid/late 1990s that fulfilled the same purpose. These chips allow for more advanced routing of interrupts in multiprocessor systems. This manual describes Intel's multiprocessor architecture of the time (1997) with APICs. Here's a block diagram that has two 486's. There are some important notes that follow the diagram.


APICs with Intel 486 CPUs (full size image). From Intel Multiprocessor Specification, 1997, page 5-3.

There are a lot of wires here, but the important things to note are:

  • There are two APIC types: local and I/O (both fulfilled by the 82489DX, it's dual purpose)
    • Each CPU has a local APIC that receives interrupts from either a legacy 8259A PIC (via INTR), or from the I/O APIC over the ICC bus
    • There is typically one I/O APIC wired to the legacy PICs and other integrated chips
  • The legacy 8259A PICs still play a pretty significant role
    • They still signal an interrupt with an INTR assert
    • I'm guessing the PICs' data lines are hooked to the I/O APIC (not shown), however, and the I/O APIC sends the interrupt vector (programmed on the PIC) to the correct local APIC. It's hard to say though for sure. The local APICs can be turned off, and the PICs can interrupt a CPU directly. In that case, I would assume the PIC sends the interrupt vector over directly as well, so it needs to be wired to the CPU.
    • The I/O APIC and local APICs are programmed via a memory mapped table (not shown). This is how the I/O APIC knows which local APIC to route the vector to.
  • Intel's APICs are compatible with all of the bus architectures - ISA, EISA, MCA, PCI, and so on (you could run a 486 on an old ISA system, it would just be heavily underutilized)
  • The APICs use a dedicated bus for interrupts and for Interprocessor Interrupts (IPIs), a new idea of the time

Part 6: Peripheral Component Interconnect (PCI)

As Messmer writes, manufacturers had isolated the fast local bus connected to the CPU, memory, and graphics from the slower system bus (e.g., EISA bus). But then there was a desire to hook up other things like a hard disk controller to the local bus, via an expansion slot. Manufacturers wanted to standardize this, and they came up with PCI and VL.

Like its predecessors, PCI is a shared bus and peripherals can be bus masters (so there is therefore a PCI bus arbiter). But PCI goes further:

  • The PCI clock speed is a lot faster, so transfer rates are higher
  • PCI introduces a new address space, the configuration address space, for getting/setting device configuration data
  • You can hook up various bus types with bridges (PCI-PCI, PCI-ISA, PCI-EISA, ...). The bridges can be smart and cache writes and reads, and translate PCI transactions into e.g. an ISA transaction
  • PCI devices can use one of four interrupt pins (INTA, INTB, INTC, or INTD)
    • These interrupt pins are level triggered, like EISA and MCA

Bus, device, function

Each bus is given a unique ID (I believe by the BIOS at boot). The bus closest to the CPUs is bridge 0.

A physical card or embedded chip hooked to a PCI bus is further identified by a device number. This is platform-specific and depends on how the PCI control bus is wired (the IDSEL lines).

A PCI device can have multiple "logical devices", or functions. The device gives each function an ID. Each device has at least one function (function 0), and can have up to 8 functions.

If you look at the output of lshw, lspci, etc., you may see something like pci@0000:00:1a.2. This is giving you information about the device's ID numbers. On Linux, the 0000 means PCI, the 00 means bus 0, the 1a means device 26, the 2 means function 2 on that device. (For me, this is my PCI/USB controller.)

Intel Block Diagrams

Intel also made some significant architectural changes around this time (late 1990s). Here is a typical block diagram for a single core architecture. This comes from "PCI System Architecture" by Shanley and Anderson, Fourth Edition (copy available in flux, I have it right now).


Intel Architecture from 1990s with PCI (full size image). From Shanley and Anderson, page 103.

I believe this is when the notion of a North and South bridge was introduced. The North bridge contains a host-to-PCI bridge. Here is a block diagram for two cores with an I/O APIC:


Intel Architecture from 1990s with PCI and Two Cores (full size image). From Shanley and Anderson, page 11.

You can see how interrupts are wired into the I/O APIC, and the I/O APIC will signal the cores using the APIC bus. One significant thing not shown in this picture is that all subsequent interactions are via the PCI bus. I will explain this more below. Basically, the CPU ack's the interrupt over the PCI bus (via the host/PCI bridge), and the interrupt controller(s) send the vector back over the PCI bus as well. Kind of interesting.

System Initialization

This is platform-specific, but this is roughly how it goes (see Shanley and Anderson pg 239):

  1. BIOS initializes the interrupt vector table entries to dummy values.
  2. BIOS installs interrupt service routines for embedded devices.
  3. BIOS hooks in PCI BIOS routines via an interrupt vector.
    • In old systems (maybe still today?), you can invoke BIOS-provided routines via INT 21h, putting arguments in CPU registers.
    • The PCI spec mandates certain functions the BIOS should provide for doing PCI stuff (like scanning for devices)
  4. BIOS scans legacy buses (e.g. ISA) for expansion ROMs that contain device drivers.
    • These will typically set up an interrupt service routine for an ISA device.
  5. BIOS scans all PCI devices and their functions and does the following for each
    • Reads PCI device's configuration header
    • Configures device for interrupts
      • If device wants to use Message Signaled Interrupts (see below), configures it if platform supports it
      • If instead interrupt pins must be used, looks at "Interrupt Pin" field to know whether device is using INTA, INTB, INTC, INTD lines
      • Programs the interrupt router (I/O APIC) to route the interrupt to a particular IRQ
      • Note that I/O APIC is hooked to PCI bus, so it can receive and handle MSIs
    • Loads device ROM that may contain device drivers into main memory and execs it
      • Device ROM will e.g. set up an interrupt service routine
    • Maps Base Address Registers (BARs) into memory or I/O space
      • Reads BAR entries in configuration header to see amount desired by device
      • Allocates space, and writes the starting address to the header field

I/O Space Starting To Get Painful

At this point, the I/O space (only 64KB's) was starting to get crowded. Some devices expected to be mapped at certain places. Just a mess. So I/O space use was discouraged starting around this time.

Interrupts In Action

There are many possibilities, depends on the platform.

Pin-based, no I/O APIC (see Shanley and Anderson, pg 101):

  • Device asserts INTA# (for example)
  • This is routed to IRQ3 on a PIC (for example)
  • The PIC asserts INTR on the CPU
  • The CPU, via its host/PCI bridge, does two INTA PCI bus transactions
  • Only the PIC will respond to INTA's
  • On the second INTA, the PIC will send the interrupt vector to the host/PCI bridge
  • The host/PCI bridge sends the interrupt vector to the CPU over the internal system bus

If an I/O APIC is in the picture, a lot of this interaction will happen over the APIC bus instead. The INTA# would be hooked into the I/O APIC, and perhaps (I'm not sure), the INTA and interrupt vector delivery would happen over the APIC bus.

MSI-based:

  • See below

The Manuals

There's a lot I skipped over that isn't as interesting for us (electrical signaling, expansion card dimension requirements power management, error handling, latency guarantees, fair bus sharing, etc.). You may also be curious about the lower level details of reads and writes (the wire signaling used).

These are the manuals. I'm hosting them on my Google Drive for now (found from random websites). We should maybe find a permanent home for them? Also, should we worry about the legality of having links to these? They're thousands of bucks.

  • PCI 3.0 Specification
    • Reading suggestions:
      • Skim Chapter 1, look at diagrams
      • Skim Chapter 2. It helps to be familiar with some of the important pins, like the address pins, interrupt pins, command/byte enable pins, and so on.
      • Chapter 3 and 6 is the bulk of what you need to read. Skim at your own discretion.
      • The rest of the chapters aren't too important.
      • It's important to note that PCI uses a parallel shared bus with a bus arbiter.
  • PCI BIOS 2.1 Specification
    • More of FYI
    • This describes the interface a BIOS on an Intel platform must provide so that an operating system can query devices and do other operations that cannot be done with memory mapped I/O, etc.

Shanley and Anderson is a great text too. Easier to read.

Part 7: Message Signalled Interrupts (MSI, MSI-X)

(This section could possibly be improved.)

MSI

MSI was introduced in PCI 2.2. Instead of asserting a pin, a device sends a regular PCI message (a PCI write) to an address partially configured by the BIOS at boot time. The data it sends is also partially configured by the BIOS. For PCI, a device can deliver up to 32 distinct interrupts this way (this is the maximum allowed). The BIOS/platform determine how those MSI messages are interpreted. (For example, on a platform with an I/O APIC, the BIOS programs it so that messages are routed to certain IRQs.)

Advantages of MSI:

  • More IRQs possible. Even though Intel CPUs only support 256 interrupt vectors, you can route certain IRQs to different

cores

  • Avoids some tricky ordering edge cases
  • No more sharing of IRQ lines

Disadvantage: You might experience more latency, since now the device has to gain access to the bus in order to send the interrupt. (I'm guessing in PCI Express they somehow resolved this.)

MSI-X

Allows devices to use up to 2,048 IRQs. Message address and data are no longer restricted to the extent of MSI. For modern MSI-X on Intel, see Intel SDM Volume 3, 10.11 "Message Signalled Interrupts". The device can target specific CPUs, and specify the interrupt vector.

Part 8: Controller Hubs

early 2000s

Now we're getting close to modern Intel architectures. Intel no longer has the notion of North and South bridges. Around the early 2000s, they switched to the Memory Controller Hub (MCH)/IO Controller Hub terminology. ''These hubs are no longer connected via PCI.'' They use an Intel proprietary "hub link". Here is a typical block diagram (from "PCI Express System Architecture", by Budruk, Anderson, and Shanley):


Intel Architecture from mid-2000s with PCI-X (full size image). From Budruk, Anderson, and Shanley, page 33.

PCI-X

As you can see, there's another bus architecture I'm skipping over here, PCI-X. PCI-X is still shared bus, but uses some clever optimizations to increase the transfer rate. It's also important to note that ISA is no longer in the picture. (A motherboard could probably be designed to support ISA, but it would have to be linked with the legacy PCI bridge, and have custom IRQ line wiring for legacy ISA interrupts perhaps.)

One clever optimization was split retries:

  • In legacy PCI, if a target couldn't fulfill a request, it sent the master a Retry
  • This required the master to wait an arbitrary (but unknown) time period and try again
  • Was inefficient

On PCI-X:

  • The target remembers the master's request, and immediately ack's
  • When the target can fulfill the request, it notifies the target of completion

There are still two bus cycles involved, but they're split, and the master no longer has to guess when the target is ready to retry.

North Bridge Integrated on Processor Die

With Intel's Sandy Bridge architecture, release around 2009 - 2011, the North Bridge/MCH is now on the processor die.

The IO Hub is now called the Platform Controller Hub (PCH). The PCH is connected to the processor die via Intel's proprietary Direct Media Interface (DMI) link. Here is a modern Intel block diagram (2015):


Intel x99 Chipset (2015) (full size image).

You can see that some PCI Express peripherals are connected directly to the processor die. (They somehow sync up with the integrated North bridge?)

Part 9: Later APICs

(Under construction)

See Intel SDM Volume 3, Chapter 10.

There are three technology phases:

  • Phase 1: Intel 82489DX APICs. These used a dedicated APIC bus. They sit on the "other side" of the PCI bridge.
  • Phase 2: xAPIC architecture: PCI MSI's go to PCI bridge and then to target processor (I/O APIC not involved, as far as I can tell). Other interconnects go to I/O APIC, but the I/O APIC uses PCI bus to deliver interrupts. See Figures 10-2 and 10-3.
  • Phase 3: x2APIC architecture. I think interrupt delivery is similar to xAPIC, but there are lots of other new features, including VT-d stuff (IOMMU)

Part 10: PCI Express

There's a lot to talk about here, and not much time before the seminar ;)

Terminology:

  • Endpoints: the devices
  • Root Complex: PCI Express is hierarchical. While peer-to-peer messages are possible, most go to the Root Complex. The Root Complex is usually a part of the North Bridge/MCH (and is now integrated on the processor die in modern Intel archs)
  • Posting versus Non-posting:
    • Most packets in PCI Express are non-posting
    • Non-posting means the target sends an acknowledgement (Completion) message after it receives the original message.
    • Posting means the target does not send an ack (this is used for memory writes, for example. The bus is reliable, and as long as there isn't a serious error, the memory write will happen)

Here are some important features:

  • Point-to-point topology consisting of serial links
  • Layered, packet protocol that resembles the Internet stack
    • Transaction layer is for initiator-to-target messages
    • Data Link Layer is like the Link layer/layer 2 of the Internet
      • Responsible for reliable data transfer across one hop
    • Physical Layer
      • Digital and electrical sublayers, describing transmission
  • Configuration is the same, via configuration headers

Here is a generic block diagram, showing the packets involved in a memory read:


PCI Express: Memory Read (full size image).

Manuals:

  • PCI Express 3.0 Specification
    • Note there is PCIe 4.0
    • I haven't gotten too far into this document yet. As you read it, you will find a striking resemblance with the networking stack (physical layer, transport layer, point-to-point serial links, etc.).
    • This one is a beast - 1000 pages.
  • SR-IOV 1.1 Specification

Part 11: Single Root I/O Virtualization (SR-IOV)

A device can present multiple virtual functions (VFs), in addition to its physical functions (PFs). Virtual functions do not have their own configuration headers, and must be configured through the physical function's configuration header by the hypervisor (unless you don't care about isolation). This means VMs that use Virtual Functions must be paravirtualized.

Here's a generic diagram from Intel:


SR-IOV (full size image).

Part 12: Intel VT-d

Again, not much time. But here are some high-level points:

  • One or more IOMMUs are distributed throughout the system
    • Each IOMMU takes addresses from devices for DMA and translates them to host physical addresses
    • You can set up a set of page tables per device (via some configuration tables)
    • Devices can use virtual addresses! (You can have two levels of paging on the IOMMUs)
    • IOMMU configuration is through memory mapped tables/registers
  • Interrupt remapping
    • This hardware is resident with the IOMMUs
    • Set up a giant Interrupt Remapping Table
    • Each entry translates an index to a CPU core, interrupt vector, trigger mode, etc.
    • Allows more fine-grained delivery of interrupts
    • Similar to the remapping table on IO APICs
      • But I believe those were harder to use for fast, dynamic updates of interrupt remaps. There are issues with re-ordering, missing interrupts, etc. Intel VT-d takes care of that mess for you.
  • Interrupt posting
    • This hardware is also resident with IOMMUs
    • Hardware will save interrupts as they come in, rather than delivering them immediately
    • A VMM maintains memory for this
  • Translation Lookaside Buffers (TLBs) are used to cache all of this translation information
    • IOMMU TLBs
    • Interrupt remapping TLBs
    • Interfaces for invalidating entries

This is the official guide.