Parallel Computing: Threads, Multiprocessing, and GPU Programming

Parallel computing restructures computation so that multiple operations execute simultaneously rather than sequentially, enabling programs to complete work orders of magnitude faster than single-threaded execution allows. This page covers the three dominant execution models — threading, multiprocessing, and GPU programming — along with their mechanical structure, classification boundaries, practical tradeoffs, and common points of confusion. The topic spans hardware architecture, operating system scheduling, and programming language abstractions, all of which interact to determine what parallelism is actually achievable on a given system.


Definition and scope

Parallel computing is the simultaneous execution of computational tasks across two or more processing units, whether those units are CPU cores within a single chip, separate processors sharing memory, independent nodes in a cluster, or the thousands of shader cores inside a graphics processing unit. The field is formally bounded by Flynn's taxonomy, a classification scheme published by Michael Flynn in 1972 that partitions computer architectures into four categories based on instruction stream and data stream counts: SISD, SIMD, MISD, and MIMD. Most practical parallel systems fall into SIMD (common in GPU workloads) or MIMD (common in multi-core CPU programming).

The scope of parallel computing intersects with the broader study of distributed systems, though the two are distinct: parallel systems share memory or a common clock domain, while distributed systems communicate across independent nodes with separate memory spaces. Within the foundational landscape of the discipline — covered across computer science subfields — parallel computing occupies the layer between hardware architecture and algorithm design.

The IEEE Technical Committee on Parallel Processing defines parallel computing as encompassing parallel algorithm design, parallel programming languages and compilers, parallel architectures, and performance analysis (IEEE TCPP). Scope boundaries are relevant because funding bodies, curriculum boards, and job classifications treat parallel computing and distributed computing as overlapping but non-identical domains.


Core mechanics or structure

Threads

A thread is the smallest unit of scheduled execution within an operating system process. Threads within the same process share heap memory, file descriptors, and global variables, but each maintains its own stack pointer, program counter, and register set. The POSIX Threads (pthreads) standard (The Open Group Base Specifications Issue 7), adopted across UNIX-derived operating systems, defines the system calls and data structures that govern thread creation, synchronization, and termination.

Synchronization primitives — mutexes, semaphores, condition variables, and read-write locks — prevent race conditions when threads access shared data. A mutex grants exclusive access to one thread at a time; a semaphore allows up to N threads through a critical section simultaneously, where N is a configurable integer.

Multiprocessing

Multiprocessing assigns separate processes to separate CPU cores or sockets. Because each process has its own memory space, inter-process communication (IPC) requires explicit mechanisms: pipes, shared memory segments, message queues, or sockets. Python's multiprocessing module, documented by the Python Software Foundation, uses process forking and serialization via the pickle protocol to pass data between worker processes.

GPU Programming

GPUs execute thousands of lightweight threads simultaneously using a Single Instruction, Multiple Data (SIMD) execution model, also described as SIMT (Single Instruction, Multiple Thread) in NVIDIA's architecture documentation (NVIDIA CUDA C++ Programming Guide). The CUDA programming model organizes threads into blocks, and blocks into grids. A warp — the fundamental scheduling unit on NVIDIA hardware — consists of 32 threads that execute in lockstep. NVIDIA's A100 GPU contains 6,912 CUDA cores, illustrating the scale differential between CPU and GPU parallelism.

OpenCL, maintained by the Khronos Group, provides a vendor-neutral alternative that targets GPUs, FPGAs, and multi-core CPUs through a unified programming model (Khronos OpenCL Registry).


Causal relationships or drivers

Three hardware trends drive the adoption of parallel computing as a necessity rather than an optimization:

Clock speed stagnation. After hitting physical limits in heat dissipation around 2004, CPU manufacturers shifted from increasing clock frequency to increasing core count. Intel's Tick-Tock and Process-Architecture-Optimization roadmaps, documented in Intel's public architecture disclosures, reflect this structural shift. Sequential code gains no automatic speedup from additional cores.

Memory bandwidth asymmetry. CPU performance improves faster than DRAM latency, a gap called the "memory wall." Parallel execution hides memory latency by allowing other threads to proceed while some threads wait for data from main memory.

Workload characteristics. Machine learning training, scientific simulation, image processing, and financial modeling all exhibit data parallelism — the same operation applied to independent data elements — which maps directly onto parallel hardware. The relationship between workload parallelizability and achievable speedup is governed by Amdahl's Law, formalized by Gene Amdahl in 1967, which states that the maximum speedup S from N processors is bounded by the fraction of the program that cannot be parallelized (1 − P): S ≤ 1 / ((1 − P) + P/N). If 5% of a program is strictly sequential, the theoretical maximum speedup from any number of processors is 20×.


Classification boundaries

Parallel computing subdivides along four orthogonal axes, each of which determines programming model, hardware requirements, and performance characteristics:

Memory model: Shared memory (threads see the same address space) versus distributed memory (processes communicate via messages). OpenMP targets shared memory; MPI (Message Passing Interface), standardized by the MPI Forum, targets distributed memory.

Granularity: Fine-grained parallelism (thousands of short-lived threads, typical of GPU workloads) versus coarse-grained parallelism (a small number of long-running processes, typical of MPI job launches on HPC clusters).

Synchronization model: Synchronous (all threads reach a barrier before proceeding) versus asynchronous (threads proceed independently, with explicit signaling only where needed).

Hardware coupling: Tightly coupled (shared-memory multiprocessors) versus loosely coupled (networked nodes). The boundary between parallel and distributed systems is drawn here — distributed computing, covered separately in the computer architecture and organization context, assumes loosely coupled nodes with independent failure domains.


Tradeoffs and tensions

Concurrency vs. parallelism. Concurrency describes program structure (tasks logically overlap); parallelism describes execution (tasks physically overlap). A single-core system can be concurrent but not parallel. Go's goroutine model and Python's asyncio framework enable concurrency without physical parallelism on single-threaded runtimes.

Thread overhead vs. process isolation. Threads are cheaper to create than processes (Linux clone() without CLONE_VM copies the entire address space), but shared memory creates correctness hazards. Processes provide isolation at the cost of IPC overhead and serialization latency.

GPU utilization vs. data transfer cost. Moving data between host (CPU) RAM and device (GPU) VRAM over a PCIe 4.0 x16 bus delivers a theoretical peak bandwidth of 32 GB/s in each direction. A computation that runs 10× faster on GPU than CPU may still show no net speedup if the data transfer time dominates total runtime. This is the primary reason GPU acceleration is effective for large batch operations but not for low-latency, small-data workloads.

Scalability vs. synchronization cost. Adding cores increases potential throughput but also increases contention on shared resources — locks, cache lines, and memory buses. Gustafson's Law offers a more optimistic scaling model than Amdahl's Law by holding the per-processor workload constant rather than total problem size, but neither law eliminates synchronization overhead in practice.


Common misconceptions

Misconception: More threads always means faster execution.
Thread creation, context switching, and synchronization all carry overhead. On a 4-core machine, spawning 1,000 threads for a CPU-bound workload produces worse throughput than 4 threads pinned to physical cores, because the OS spends significant time in scheduler decisions rather than useful work.

Misconception: Python's threading module enables CPU parallelism.
CPython's Global Interpreter Lock (GIL) serializes bytecode execution so that only one thread runs Python code at a time, regardless of core count. The Python Software Foundation's documentation explicitly describes this constraint. Threading in CPython achieves I/O concurrency, not CPU parallelism; multiprocessing or external C extensions that release the GIL are required for CPU-bound parallelism.

Misconception: GPU programming is universally faster than CPU programming.
GPUs excel at data-parallel, arithmetic-intensive workloads with regular memory access patterns. Control-flow-heavy algorithms, pointer-chasing data structures (linked lists, trees), and workloads with high branch divergence perform worse on GPU than on CPU because branch divergence causes warp threads to serialize.

Misconception: Race conditions are prevented by operating systems automatically.
The OS guarantees atomicity of individual system calls but not of multi-step critical sections in user space. Two threads can each read a shared counter, increment their local copy, and write back — producing a lost update — entirely within user space, with no OS intervention point.


Checklist or steps

The following phases describe the process of analyzing and structuring a parallel program:

  1. Decompose the problem. Identify independent work units — data decomposition (splitting datasets) or task decomposition (splitting distinct operations).
  2. Identify dependencies. Map data flow and control flow dependencies that constrain execution order.
  3. Choose a parallelism model. Select threading (shared memory, same process), multiprocessing (separate address spaces, same or different machines), or GPU execution (SIMT, data-parallel) based on workload characteristics.
  4. Select a programming interface. OpenMP for shared-memory CPU parallelism; MPI for distributed multiprocessing; CUDA or OpenCL for GPU; language-native constructs (Java's ForkJoinPool, Python's concurrent.futures) as higher-level abstractions.
  5. Minimize critical sections. Reduce the scope and duration of locks to limit contention.
  6. Measure baseline sequential performance. Amdahl's Law requires knowledge of the sequential fraction P before speedup projections are meaningful.
  7. Profile parallel execution. Use hardware performance counters, profiling tools (NVIDIA Nsight, Intel VTune), and thread sanitizers to identify bottlenecks and race conditions.
  8. Validate correctness under concurrent conditions. Stress-test with thread count, load, and timing variations to surface non-deterministic bugs.

Reference table or matrix

Model Memory Space Typical Core Count Communication Mechanism Representative API Best-fit Workload
Multithreading Shared 2–128 CPU cores Mutexes, condition variables POSIX pthreads, OpenMP I/O-bound, shared-data workloads
Multiprocessing Separate (per process) 2–128+ CPU cores Pipes, sockets, shared memory Python multiprocessing, MPI CPU-bound, isolation-required workloads
GPU (CUDA) Host + Device (separate) 128–16,384+ shader cores PCIe transfer, unified memory NVIDIA CUDA Large-batch matrix math, ML training
GPU (OpenCL) Host + Device (separate) Vendor-variable PCIe transfer Khronos OpenCL Cross-vendor data-parallel workloads
Distributed MPI Distributed (per node) Thousands of nodes Network messages MPI-3.1 standard HPC simulation, scientific computing

For practitioners working through the broader hierarchy of foundational topics — from algorithms and data structures to computational complexity theory — parallel computing represents the point at which algorithmic design constraints intersect directly with physical hardware limits.


References