Computer Architecture and Organization: CPUs, Memory, and I/O

Computer architecture and organization defines the structural principles governing how processors execute instructions, how memory systems store and retrieve data, and how input/output subsystems connect computational logic to the external world. This page covers the foundational mechanics of CPU design, the memory hierarchy from registers to secondary storage, I/O communication models, and the classification boundaries and tradeoffs that shape real hardware decisions. The subject underpins every layer of software—from operating systems to compilers—making it foundational knowledge for computer science practitioners at every level.



Definition and scope

Computer architecture describes the abstract model of a computing system as seen by software—the instruction set, register model, addressing modes, and memory semantics that programmers and compilers target. Computer organization describes the concrete hardware implementation of that model: how logic gates, buses, cache arrays, and control units are arranged to execute the architectural specification. The distinction matters because a single instruction set architecture (ISA) can be implemented with radically different organizations optimized for different power, performance, or cost targets.

The formal scope of the discipline, as characterized by IEEE and ACM's Computing Curricula 2020 (CC2020), spans ISA design, processor microarchitecture, memory system design, I/O subsystems, and parallel hardware structures. The ACM/IEEE Computer Engineering 2016 curriculum identifies computer architecture as one of 12 knowledge areas required for an accredited computing program.

For a broader map of how this topic fits within the discipline, see Key Dimensions and Scopes of Computer Science.


Core mechanics or structure

The CPU

A central processing unit performs four repeating operations: fetch an instruction from memory, decode it into control signals, execute the specified operation, and write results back. This fetch-decode-execute cycle is the atomic unit of computation in von Neumann and Harvard architectures alike.

Modern CPUs add pipelining, which overlaps these four stages across successive instructions. A 5-stage pipeline—Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), Write-Back (WB)—can theoretically sustain one instruction completion per clock cycle at steady state, as described in Patterson and Hennessy's Computer Organization and Design (Morgan Kaufmann, 6th edition, 2020). Out-of-order execution extends this by allowing later instructions to execute when earlier instructions are stalled, using reservation stations and a reorder buffer to maintain architectural correctness.

Superscalar processors issue more than one instruction per cycle by replicating execution units. The Intel Core microarchitecture, for example, maintains 4 integer ALUs and 2 floating-point units simultaneously active per core (Intel 64 and IA-32 Architectures Optimization Reference Manual).

The Memory Hierarchy

Memory is organized in a hierarchy where each level trades capacity against latency:

Cache coherency protocols such as MESI (Modified, Exclusive, Shared, Invalid) maintain consistency across multiple cores accessing shared cache lines. The JEDEC standards body publishes specifications for DRAM interface standards including DDR4 and DDR5, governing signal timing, voltage, and capacity ranges for commodity memory modules.

I/O Subsystems

I/O devices communicate with the CPU through three mechanisms: programmed I/O (polling), interrupt-driven I/O, and direct memory access (DMA). DMA offloads bulk data transfer from the CPU entirely, allowing a DMA controller to write data from a device directly into DRAM while the CPU continues executing other instructions. The PCI-SIG governs the PCIe (Peripheral Component Interconnect Express) standard, which as of PCIe 5.0 provides 32 GT/s (gigatransfers per second) per lane bidirectionally.


Causal relationships or drivers

Three engineering forces fundamentally shape architectural choices:

Power density constrains clock frequency. Intel's Pentium 4 Prescott core reached 3.8 GHz in 2004 but dissipated over 100 watts at full load, making further frequency scaling thermally unsustainable. This drove the industry toward multi-core designs rather than single-threaded frequency scaling, a transition documented in Intel's processor development history.

Memory wall refers to the growing gap between processor clock speeds and DRAM latency. As of DDR5-6400, peak theoretical bandwidth reaches approximately 51 GB/s, but latency remains in the 14–16 nanosecond CAS range. Cache hierarchies exist entirely to bridge this gap.

Locality of reference is the empirical observation that programs tend to reuse recently accessed data (temporal locality) and access addresses near recently accessed locations (spatial locality). Cache designs exploit both: temporal locality through cache retention policies, and spatial locality through cache line sizes of 64 bytes that prefetch adjacent memory locations.

Compiler design, covered in Compiler Design and Interpreters, interacts directly with these causal forces because compilers must generate instruction sequences that maximize cache utilization and minimize pipeline stalls.


Classification boundaries

ISA taxonomy

ISA Class Characteristic Examples
CISC Variable-length instructions; memory-to-memory operations x86, x86-64
RISC Fixed-length instructions; load/store memory model ARM, RISC-V, MIPS
VLIW Multiple operations encoded per instruction word Intel Itanium (IA-64)
SIMD extensions Single instruction, multiple data lanes x86 AVX-512, ARM NEON

Architecture models

Von Neumann architecture uses a single address space and bus for both instructions and data. Harvard architecture uses physically separate instruction and data memories, eliminating structural hazards between instruction fetch and data access. Most modern CPUs use a modified Harvard architecture: physically separate L1 instruction and data caches that unify at the L2 level.

Memory types

SRAM (static RAM) retains state without refresh and forms cache arrays. DRAM (dynamic RAM) requires periodic refresh and forms main memory. NVRAM (non-volatile RAM) categories include Flash (NAND and NOR), 3D XPoint, and MRAM, each differing in endurance, density, and write latency.

The JEDEC Solid State Technology Association publishes formal classifications for all major memory interface standards.


Tradeoffs and tensions

IPC vs. clock frequency: Increasing instructions-per-clock (IPC) requires more complex out-of-order logic, branch predictors, and wider superscalar pipelines, all of which consume die area and power. Pushing clock frequency increases dynamic power consumption as a square of voltage. ARM's big.LITTLE architecture addresses this by pairing high-IPC performance cores with low-power efficiency cores on the same die, a design that requires the operating systems scheduler to assign workloads to appropriate cores.

Latency vs. throughput: Reducing cache latency by shrinking cache size reduces hit rate, increasing average memory latency under working sets that exceed the cache. Increasing cache size reduces hit rate misses but adds latency to every access. This is not a solvable tradeoff—it is managed through multi-level hierarchies.

Coherency vs. scalability: MESI-based coherency protocols generate bus traffic proportional to the number of cores sharing a cache line. In systems with 128 or more cores (such as AMD EPYC 9004 series), directory-based coherency protocols replace snooping protocols to bound traffic, at the cost of additional latency for inter-node communication.

Speculation vs. security: Speculative execution enables CPUs to execute instructions before confirming they will be needed, dramatically increasing throughput. However, the Spectre and Meltdown vulnerability classes (disclosed in January 2018 and documented in CVE-2017-5715 and CVE-2017-5754 by NIST's NVD) demonstrated that speculative side-channel attacks can leak privileged memory contents. Mitigations such as Retpoline and kernel page-table isolation (KPTI) each impose measurable throughput penalties.


Common misconceptions

Misconception: More cores always means faster execution
Single-threaded performance depends entirely on per-core IPC and clock speed. A sequential algorithm cannot be parallelized by adding cores. Amdahl's Law (Gene Amdahl, 1967) establishes that if 95% of a program is parallelizable, the theoretical maximum speedup with infinite cores is 20×—the remaining 5% serialized fraction forms an absolute ceiling.

Misconception: Cache size is the primary determinant of CPU performance
Cache hit rate, not raw cache size, governs performance. A 6 MB L3 cache with high spatial and temporal locality in the workload outperforms a 32 MB cache with poor access patterns. Cache-oblivious algorithms and data structure layout matter as much as hardware provisioning.

Misconception: 64-bit CPUs are simply "faster" than 32-bit CPUs
The primary advantage of 64-bit architecture is the expanded virtual address space—32-bit addressing limits physical memory to 4 GB per process, while 64-bit addressing extends this to 16 exabytes theoretically. Arithmetic on 64-bit operands is not inherently faster than on 32-bit operands and can consume more memory bandwidth when not required.

Misconception: DRAM access latency has improved proportionally with bandwidth
DRAM bandwidth has improved by roughly 10× from DDR1 to DDR5, but access latency (CAS latency in nanoseconds) has improved by less than 2× over the same period. This asymmetry explains why the memory wall remains a persistent architectural constraint.


Checklist or steps

The following sequence describes how a single load instruction proceeds from software to hardware and back:

  1. The CPU's instruction fetch unit reads the load instruction from the L1 instruction cache (or fetches from L2/L3/DRAM on a miss).
  2. The decode stage identifies the source register and addressing mode, producing a micro-operation (µop).
  3. The address generation unit (AGU) computes the effective memory address from the base register and displacement.
  4. The load µop is issued to the load/store unit, which checks the L1 data cache for the target address.
  5. On an L1 hit, data returns in 4–5 clock cycles. On an L1 miss, the request propagates to L2.
  6. On an L2 miss, the request propagates to the L3 (last-level) cache.
  7. On an L3 miss, a memory access request is issued to the DRAM controller over the memory bus.
  8. The DRAM controller activates the target row, transfers the 64-byte cache line to the CPU, and the line is installed in L1, L2, and L3.
  9. The data value is forwarded to the destination register, and the µop is retired from the reorder buffer in program order.
  10. If another core holds a modified copy of the same cache line, the MESI protocol triggers an invalidation or writeback before the requesting core receives the data.

Reference table or matrix

CPU Microarchitecture Feature Comparison

Feature In-Order Pipeline Out-of-Order (OoO) Superscalar OoO VLIW
Instruction ordering Program order Dynamic reorder Dynamic, multi-issue Static compiler-scheduled
IPC ceiling 1 >1 with data independence 4–6 (typical modern CPUs) Compiler-dependent
Hardware complexity Low High Very high Low (complexity in compiler)
Power efficiency High Moderate Lower Moderate
Typical use case Embedded, IoT Desktop, server Server, high-performance DSP, media processors
Standards/examples ARM Cortex-M0 ARM Cortex-A55 Intel Core Ultra, AMD Zen 4 Intel Itanium

Memory Hierarchy Reference

Level Typical Size Typical Latency Technology Governed by
Registers 256–512 bytes <1 cycle SRAM flip-flops ISA specification
L1 cache 32–64 KB 4–5 cycles SRAM Vendor microarchitecture
L2 cache 256 KB–1 MB 12–15 cycles SRAM Vendor microarchitecture
L3 (LLC) 6–64 MB 30–50 cycles SRAM Vendor microarchitecture
Main memory (DRAM) 8 GB–6 TB ~80 ns DRAM (DDR4/DDR5) JEDEC standards
NVMe SSD 500 GB–8 TB ~100 µs NAND Flash NVM Express standard
HDD 1–20 TB 5–10 ms Magnetic platter INCITS T13/T10

This reference domain is covered in depth across the Computer Science Authority knowledge base, where topics like parallel computing, distributed systems, and embedded systems extend directly from the architectural foundations described here.


References