Computer Architecture and Organization: CPUs, Memory, and I/O

Computer architecture and organization defines the structural principles governing how processors execute instructions, how memory systems store and retrieve data, and how input/output subsystems connect computational logic to the external world. This page covers the foundational mechanics of CPU design, the memory hierarchy from registers to secondary storage, I/O communication models, and the classification boundaries and tradeoffs that shape real hardware decisions. The subject underpins every layer of software—from operating systems to compilers—making it foundational knowledge for computer science practitioners at every level.

Definition and scope
Core mechanics or structure
Causal relationships or drivers
Classification boundaries
Tradeoffs and tensions
Common misconceptions
Checklist or steps
Reference table or matrix
References

Definition and scope

Computer architecture describes the abstract model of a computing system as seen by software—the instruction set, register model, addressing modes, and memory semantics that programmers and compilers target. Computer organization describes the concrete hardware implementation of that model: how logic gates, buses, cache arrays, and control units are arranged to execute the architectural specification. The distinction matters because a single instruction set architecture (ISA) can be implemented with radically different organizations optimized for different power, performance, or cost targets.

The formal scope of the discipline, as characterized by IEEE and ACM's Computing Curricula 2020 (CC2020), spans ISA design, processor microarchitecture, memory system design, I/O subsystems, and parallel hardware structures. The ACM/IEEE Computer Engineering 2016 curriculum identifies computer architecture as one of 12 knowledge areas required for an accredited computing program.

For a broader map of how this topic fits within the discipline, see Key Dimensions and Scopes of Computer Science.

Core mechanics or structure

The CPU

A central processing unit performs four repeating operations: fetch an instruction from memory, decode it into control signals, execute the specified operation, and write results back. This fetch-decode-execute cycle is the atomic unit of computation in von Neumann and Harvard architectures alike.

Modern CPUs add pipelining, which overlaps these four stages across successive instructions. A 5-stage pipeline—Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), Write-Back (WB)—can theoretically sustain one instruction completion per clock cycle at steady state, as described in Patterson and Hennessy's Computer Organization and Design (Morgan Kaufmann, 6th edition, 2020). Out-of-order execution extends this by allowing later instructions to execute when earlier instructions are stalled, using reservation stations and a reorder buffer to maintain architectural correctness.

Superscalar processors issue more than one instruction per cycle by replicating execution units. The Intel Core microarchitecture, for example, maintains 4 integer ALUs and 2 floating-point units simultaneously active per core (Intel 64 and IA-32 Architectures Optimization Reference Manual).

The Memory Hierarchy

Memory is organized in a hierarchy where each level trades capacity against latency:

Registers: typically 32–64 general-purpose 64-bit registers; access in under 1 clock cycle
L1 cache: 32–64 KB per core; 4–5 cycle latency
L2 cache: 256 KB–1 MB per core; 12–15 cycle latency
L3 cache (LLC): 6–64 MB shared; 30–50 cycle latency
DRAM: gigabytes of capacity; 60–80 nanosecond access latency
SSD/NVMe: terabytes of capacity; microsecond-to-millisecond access latency

Cache coherency protocols such as MESI (Modified, Exclusive, Shared, Invalid) maintain consistency across multiple cores accessing shared cache lines. The JEDEC standards body publishes specifications for DRAM interface standards including DDR4 and DDR5, governing signal timing, voltage, and capacity ranges for commodity memory modules.

I/O Subsystems

I/O devices communicate with the CPU through three mechanisms: programmed I/O (polling), interrupt-driven I/O, and direct memory access (DMA). DMA offloads bulk data transfer from the CPU entirely, allowing a DMA controller to write data from a device directly into DRAM while the CPU continues executing other instructions. The PCI-SIG governs the PCIe (Peripheral Component Interconnect Express) standard, which as of PCIe 5.0 provides 32 GT/s (gigatransfers per second) per lane bidirectionally.

Causal relationships or drivers

Three engineering forces fundamentally shape architectural choices:

Power density constrains clock frequency. Intel's Pentium 4 Prescott core reached 3.8 GHz in 2004 but dissipated over 100 watts at full load, making further frequency scaling thermally unsustainable. This drove the industry toward multi-core designs rather than single-threaded frequency scaling, a transition documented in Intel's processor development history.

Memory wall refers to the growing gap between processor clock speeds and DRAM latency. As of DDR5-6400, peak theoretical bandwidth reaches approximately 51 GB/s, but latency remains in the 14–16 nanosecond CAS range. Cache hierarchies exist entirely to bridge this gap.

Locality of reference is the empirical observation that programs tend to reuse recently accessed data (temporal locality) and access addresses near recently accessed locations (spatial locality). Cache designs exploit both: temporal locality through cache retention policies, and spatial locality through cache line sizes of 64 bytes that prefetch adjacent memory locations.

Compiler design, covered in Compiler Design and Interpreters, interacts directly with these causal forces because compilers must generate instruction sequences that maximize cache utilization and minimize pipeline stalls.

Classification boundaries

ISA taxonomy

ISA Class	Characteristic	Examples
CISC	Variable-length instructions; memory-to-memory operations	x86, x86-64
RISC	Fixed-length instructions; load/store memory model	ARM, RISC-V, MIPS
VLIW	Multiple operations encoded per instruction word	Intel Itanium (IA-64)
SIMD extensions	Single instruction, multiple data lanes	x86 AVX-512, ARM NEON

Architecture models

Von Neumann architecture uses a single address space and bus for both instructions and data. Harvard architecture uses physically separate instruction and data memories, eliminating structural hazards between instruction fetch and data access. Most modern CPUs use a modified Harvard architecture: physically separate L1 instruction and data caches that unify at the L2 level.

Memory types

SRAM (static RAM) retains state without refresh and forms cache arrays. DRAM (dynamic RAM) requires periodic refresh and forms main memory. NVRAM (non-volatile RAM) categories include Flash (NAND and NOR), 3D XPoint, and MRAM, each differing in endurance, density, and write latency.

The JEDEC Solid State Technology Association publishes formal classifications for all major memory interface standards.

Tradeoffs and tensions

IPC vs. clock frequency: Increasing instructions-per-clock (IPC) requires more complex out-of-order logic, branch predictors, and wider superscalar pipelines, all of which consume die area and power. Pushing clock frequency increases dynamic power consumption as a square of voltage. ARM's big.LITTLE architecture addresses this by pairing high-IPC performance cores with low-power efficiency cores on the same die, a design that requires the operating systems scheduler to assign workloads to appropriate cores.

Latency vs. throughput: Reducing cache latency by shrinking cache size reduces hit rate, increasing average memory latency under working sets that exceed the cache. Increasing cache size reduces hit rate misses but adds latency to every access. This is not a solvable tradeoff—it is managed through multi-level hierarchies.

Coherency vs. scalability: MESI-based coherency protocols generate bus traffic proportional to the number of cores sharing a cache line. In systems with 128 or more cores (such as AMD EPYC 9004 series), directory-based coherency protocols replace snooping protocols to bound traffic, at the cost of additional latency for inter-node communication.

Speculation vs. security: Speculative execution enables CPUs to execute instructions before confirming they will be needed, dramatically increasing throughput. However, the Spectre and Meltdown vulnerability classes (disclosed in January 2018 and documented in CVE-2017-5715 and CVE-2017-5754 by NIST's NVD) demonstrated that speculative side-channel attacks can leak privileged memory contents. Mitigations such as Retpoline and kernel page-table isolation (KPTI) each impose measurable throughput penalties.

Common misconceptions

Misconception: More cores always means faster execution
Single-threaded performance depends entirely on per-core IPC and clock speed. A sequential algorithm cannot be parallelized by adding cores. Amdahl's Law (Gene Amdahl, 1967) establishes that if 95% of a program is parallelizable, the theoretical maximum speedup with infinite cores is 20×—the remaining 5% serialized fraction forms an absolute ceiling.

Misconception: Cache size is the primary determinant of CPU performance
Cache hit rate, not raw cache size, governs performance. A 6 MB L3 cache with high spatial and temporal locality in the workload outperforms a 32 MB cache with poor access patterns. Cache-oblivious algorithms and data structure layout matter as much as hardware provisioning.

Misconception: 64-bit CPUs are simply "faster" than 32-bit CPUs
The primary advantage of 64-bit architecture is the expanded virtual address space—32-bit addressing limits physical memory to 4 GB per process, while 64-bit addressing extends this to 16 exabytes theoretically. Arithmetic on 64-bit operands is not inherently faster than on 32-bit operands and can consume more memory bandwidth when not required.

Misconception: DRAM access latency has improved proportionally with bandwidth
DRAM bandwidth has improved by roughly 10× from DDR1 to DDR5, but access latency (CAS latency in nanoseconds) has improved by less than 2× over the same period. This asymmetry explains why the memory wall remains a persistent architectural constraint.

Checklist or steps

The following sequence describes how a single load instruction proceeds from software to hardware and back:

Reference table or matrix

CPU Microarchitecture Feature Comparison

Feature	In-Order Pipeline	Out-of-Order (OoO)	Superscalar OoO	VLIW
Instruction ordering	Program order	Dynamic reorder	Dynamic, multi-issue	Static compiler-scheduled
IPC ceiling	1	>1 with data independence	4–6 (typical modern CPUs)	Compiler-dependent
Hardware complexity	Low	High	Very high	Low (complexity in compiler)
Power efficiency	High	Moderate	Lower	Moderate
Typical use case	Embedded, IoT	Desktop, server	Server, high-performance	DSP, media processors
Standards/examples	ARM Cortex-M0	ARM Cortex-A55	Intel Core Ultra, AMD Zen 4	Intel Itanium

Memory Hierarchy Reference

Level	Typical Size	Typical Latency	Technology	Governed by
Registers	256–512 bytes	<1 cycle	SRAM flip-flops	ISA specification
L1 cache	32–64 KB	4–5 cycles	SRAM	Vendor microarchitecture
L2 cache	256 KB–1 MB	12–15 cycles	SRAM	Vendor microarchitecture
L3 (LLC)	6–64 MB	30–50 cycles	SRAM	Vendor microarchitecture
Main memory (DRAM)	8 GB–6 TB	~80 ns	DRAM (DDR4/DDR5)	JEDEC standards
NVMe SSD	500 GB–8 TB	~100 µs	NAND Flash	NVM Express standard
HDD	1–20 TB	5–10 ms	Magnetic platter	INCITS T13/T10

This reference domain is covered in depth across the Computer Science Authority knowledge base, where topics like parallel computing, distributed systems, and embedded systems extend directly from the architectural foundations described here.