Computer Architecture and Organization: CPUs, Memory, and I/O
Computer architecture and organization defines the structural principles governing how processors execute instructions, how memory systems store and retrieve data, and how input/output subsystems connect computational logic to the external world. This page covers the foundational mechanics of CPU design, the memory hierarchy from registers to secondary storage, I/O communication models, and the classification boundaries and tradeoffs that shape real hardware decisions. The subject underpins every layer of software—from operating systems to compilers—making it foundational knowledge for computer science practitioners at every level.
- Definition and scope
- Core mechanics or structure
- Causal relationships or drivers
- Classification boundaries
- Tradeoffs and tensions
- Common misconceptions
- Checklist or steps
- Reference table or matrix
- References
Definition and scope
Computer architecture describes the abstract model of a computing system as seen by software—the instruction set, register model, addressing modes, and memory semantics that programmers and compilers target. Computer organization describes the concrete hardware implementation of that model: how logic gates, buses, cache arrays, and control units are arranged to execute the architectural specification. The distinction matters because a single instruction set architecture (ISA) can be implemented with radically different organizations optimized for different power, performance, or cost targets.
The formal scope of the discipline, as characterized by IEEE and ACM's Computing Curricula 2020 (CC2020), spans ISA design, processor microarchitecture, memory system design, I/O subsystems, and parallel hardware structures. The ACM/IEEE Computer Engineering 2016 curriculum identifies computer architecture as one of 12 knowledge areas required for an accredited computing program.
For a broader map of how this topic fits within the discipline, see Key Dimensions and Scopes of Computer Science.
Core mechanics or structure
The CPU
A central processing unit performs four repeating operations: fetch an instruction from memory, decode it into control signals, execute the specified operation, and write results back. This fetch-decode-execute cycle is the atomic unit of computation in von Neumann and Harvard architectures alike.
Modern CPUs add pipelining, which overlaps these four stages across successive instructions. A 5-stage pipeline—Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), Write-Back (WB)—can theoretically sustain one instruction completion per clock cycle at steady state, as described in Patterson and Hennessy's Computer Organization and Design (Morgan Kaufmann, 6th edition, 2020). Out-of-order execution extends this by allowing later instructions to execute when earlier instructions are stalled, using reservation stations and a reorder buffer to maintain architectural correctness.
Superscalar processors issue more than one instruction per cycle by replicating execution units. The Intel Core microarchitecture, for example, maintains 4 integer ALUs and 2 floating-point units simultaneously active per core (Intel 64 and IA-32 Architectures Optimization Reference Manual).
The Memory Hierarchy
Memory is organized in a hierarchy where each level trades capacity against latency:
- Registers: typically 32–64 general-purpose 64-bit registers; access in under 1 clock cycle
- L1 cache: 32–64 KB per core; 4–5 cycle latency
- L2 cache: 256 KB–1 MB per core; 12–15 cycle latency
- L3 cache (LLC): 6–64 MB shared; 30–50 cycle latency
- DRAM: gigabytes of capacity; 60–80 nanosecond access latency
- SSD/NVMe: terabytes of capacity; microsecond-to-millisecond access latency
Cache coherency protocols such as MESI (Modified, Exclusive, Shared, Invalid) maintain consistency across multiple cores accessing shared cache lines. The JEDEC standards body publishes specifications for DRAM interface standards including DDR4 and DDR5, governing signal timing, voltage, and capacity ranges for commodity memory modules.
I/O Subsystems
I/O devices communicate with the CPU through three mechanisms: programmed I/O (polling), interrupt-driven I/O, and direct memory access (DMA). DMA offloads bulk data transfer from the CPU entirely, allowing a DMA controller to write data from a device directly into DRAM while the CPU continues executing other instructions. The PCI-SIG governs the PCIe (Peripheral Component Interconnect Express) standard, which as of PCIe 5.0 provides 32 GT/s (gigatransfers per second) per lane bidirectionally.
Causal relationships or drivers
Three engineering forces fundamentally shape architectural choices:
Power density constrains clock frequency. Intel's Pentium 4 Prescott core reached 3.8 GHz in 2004 but dissipated over 100 watts at full load, making further frequency scaling thermally unsustainable. This drove the industry toward multi-core designs rather than single-threaded frequency scaling, a transition documented in Intel's processor development history.
Memory wall refers to the growing gap between processor clock speeds and DRAM latency. As of DDR5-6400, peak theoretical bandwidth reaches approximately 51 GB/s, but latency remains in the 14–16 nanosecond CAS range. Cache hierarchies exist entirely to bridge this gap.
Locality of reference is the empirical observation that programs tend to reuse recently accessed data (temporal locality) and access addresses near recently accessed locations (spatial locality). Cache designs exploit both: temporal locality through cache retention policies, and spatial locality through cache line sizes of 64 bytes that prefetch adjacent memory locations.
Compiler design, covered in Compiler Design and Interpreters, interacts directly with these causal forces because compilers must generate instruction sequences that maximize cache utilization and minimize pipeline stalls.
Classification boundaries
ISA taxonomy
| ISA Class | Characteristic | Examples |
|---|---|---|
| CISC | Variable-length instructions; memory-to-memory operations | x86, x86-64 |
| RISC | Fixed-length instructions; load/store memory model | ARM, RISC-V, MIPS |
| VLIW | Multiple operations encoded per instruction word | Intel Itanium (IA-64) |
| SIMD extensions | Single instruction, multiple data lanes | x86 AVX-512, ARM NEON |
Architecture models
Von Neumann architecture uses a single address space and bus for both instructions and data. Harvard architecture uses physically separate instruction and data memories, eliminating structural hazards between instruction fetch and data access. Most modern CPUs use a modified Harvard architecture: physically separate L1 instruction and data caches that unify at the L2 level.
Memory types
SRAM (static RAM) retains state without refresh and forms cache arrays. DRAM (dynamic RAM) requires periodic refresh and forms main memory. NVRAM (non-volatile RAM) categories include Flash (NAND and NOR), 3D XPoint, and MRAM, each differing in endurance, density, and write latency.
The JEDEC Solid State Technology Association publishes formal classifications for all major memory interface standards.
Tradeoffs and tensions
IPC vs. clock frequency: Increasing instructions-per-clock (IPC) requires more complex out-of-order logic, branch predictors, and wider superscalar pipelines, all of which consume die area and power. Pushing clock frequency increases dynamic power consumption as a square of voltage. ARM's big.LITTLE architecture addresses this by pairing high-IPC performance cores with low-power efficiency cores on the same die, a design that requires the operating systems scheduler to assign workloads to appropriate cores.
Latency vs. throughput: Reducing cache latency by shrinking cache size reduces hit rate, increasing average memory latency under working sets that exceed the cache. Increasing cache size reduces hit rate misses but adds latency to every access. This is not a solvable tradeoff—it is managed through multi-level hierarchies.
Coherency vs. scalability: MESI-based coherency protocols generate bus traffic proportional to the number of cores sharing a cache line. In systems with 128 or more cores (such as AMD EPYC 9004 series), directory-based coherency protocols replace snooping protocols to bound traffic, at the cost of additional latency for inter-node communication.
Speculation vs. security: Speculative execution enables CPUs to execute instructions before confirming they will be needed, dramatically increasing throughput. However, the Spectre and Meltdown vulnerability classes (disclosed in January 2018 and documented in CVE-2017-5715 and CVE-2017-5754 by NIST's NVD) demonstrated that speculative side-channel attacks can leak privileged memory contents. Mitigations such as Retpoline and kernel page-table isolation (KPTI) each impose measurable throughput penalties.
Common misconceptions
Misconception: More cores always means faster execution
Single-threaded performance depends entirely on per-core IPC and clock speed. A sequential algorithm cannot be parallelized by adding cores. Amdahl's Law (Gene Amdahl, 1967) establishes that if 95% of a program is parallelizable, the theoretical maximum speedup with infinite cores is 20×—the remaining 5% serialized fraction forms an absolute ceiling.
Misconception: Cache size is the primary determinant of CPU performance
Cache hit rate, not raw cache size, governs performance. A 6 MB L3 cache with high spatial and temporal locality in the workload outperforms a 32 MB cache with poor access patterns. Cache-oblivious algorithms and data structure layout matter as much as hardware provisioning.
Misconception: 64-bit CPUs are simply "faster" than 32-bit CPUs
The primary advantage of 64-bit architecture is the expanded virtual address space—32-bit addressing limits physical memory to 4 GB per process, while 64-bit addressing extends this to 16 exabytes theoretically. Arithmetic on 64-bit operands is not inherently faster than on 32-bit operands and can consume more memory bandwidth when not required.
Misconception: DRAM access latency has improved proportionally with bandwidth
DRAM bandwidth has improved by roughly 10× from DDR1 to DDR5, but access latency (CAS latency in nanoseconds) has improved by less than 2× over the same period. This asymmetry explains why the memory wall remains a persistent architectural constraint.
Checklist or steps
The following sequence describes how a single load instruction proceeds from software to hardware and back:
- The CPU's instruction fetch unit reads the load instruction from the L1 instruction cache (or fetches from L2/L3/DRAM on a miss).
- The decode stage identifies the source register and addressing mode, producing a micro-operation (µop).
- The address generation unit (AGU) computes the effective memory address from the base register and displacement.
- The load µop is issued to the load/store unit, which checks the L1 data cache for the target address.
- On an L1 hit, data returns in 4–5 clock cycles. On an L1 miss, the request propagates to L2.
- On an L2 miss, the request propagates to the L3 (last-level) cache.
- On an L3 miss, a memory access request is issued to the DRAM controller over the memory bus.
- The DRAM controller activates the target row, transfers the 64-byte cache line to the CPU, and the line is installed in L1, L2, and L3.
- The data value is forwarded to the destination register, and the µop is retired from the reorder buffer in program order.
- If another core holds a modified copy of the same cache line, the MESI protocol triggers an invalidation or writeback before the requesting core receives the data.
Reference table or matrix
CPU Microarchitecture Feature Comparison
| Feature | In-Order Pipeline | Out-of-Order (OoO) | Superscalar OoO | VLIW |
|---|---|---|---|---|
| Instruction ordering | Program order | Dynamic reorder | Dynamic, multi-issue | Static compiler-scheduled |
| IPC ceiling | 1 | >1 with data independence | 4–6 (typical modern CPUs) | Compiler-dependent |
| Hardware complexity | Low | High | Very high | Low (complexity in compiler) |
| Power efficiency | High | Moderate | Lower | Moderate |
| Typical use case | Embedded, IoT | Desktop, server | Server, high-performance | DSP, media processors |
| Standards/examples | ARM Cortex-M0 | ARM Cortex-A55 | Intel Core Ultra, AMD Zen 4 | Intel Itanium |
Memory Hierarchy Reference
| Level | Typical Size | Typical Latency | Technology | Governed by |
|---|---|---|---|---|
| Registers | 256–512 bytes | <1 cycle | SRAM flip-flops | ISA specification |
| L1 cache | 32–64 KB | 4–5 cycles | SRAM | Vendor microarchitecture |
| L2 cache | 256 KB–1 MB | 12–15 cycles | SRAM | Vendor microarchitecture |
| L3 (LLC) | 6–64 MB | 30–50 cycles | SRAM | Vendor microarchitecture |
| Main memory (DRAM) | 8 GB–6 TB | ~80 ns | DRAM (DDR4/DDR5) | JEDEC standards |
| NVMe SSD | 500 GB–8 TB | ~100 µs | NAND Flash | NVM Express standard |
| HDD | 1–20 TB | 5–10 ms | Magnetic platter | INCITS T13/T10 |
This reference domain is covered in depth across the Computer Science Authority knowledge base, where topics like parallel computing, distributed systems, and embedded systems extend directly from the architectural foundations described here.
References
- ACM/IEEE-CS Computing Curricula 2020 (CC2020) — Joint ACM/IEEE curriculum framework defining computer architecture as a core knowledge area.
- ACM/IEEE Computer Engineering Curricula 2016 — Identifies architecture among 12 required knowledge areas for accredited programs.
- Intel 64 and IA-32 Architectures Optimization Reference Manual — Official Intel documentation for microarchitecture execution unit counts and pipeline behavior.
- JEDEC Solid State Technology Association — Standards body governing DRAM interface specifications including DDR4 and DDR5.
- PCI-SIG — PCIe Specification — Governing body for PCI Express interface standards including transfer rate specifications.
- NVM Express (NVMe) Standard — Industry specification governing NVMe storage interface protocol and latency characteristics.
- NIST NVD — CVE-2017-5715 (Spectre) — NIST National Vulnerability Database entry for the Spectre speculative execution vulnerability.
- NIST NVD — CVE-2017-5754 (Meltdown) — NIST