Big Data Technologies: Hadoop, Spark, and Scalable Pipelines

Big data technologies address the engineering challenge of storing, processing, and analyzing datasets too large or fast-moving for single-machine relational systems to handle cost-effectively. Apache Hadoop and Apache Spark are the two most widely deployed open-source frameworks in this space, each occupying distinct positions within scalable data pipeline architectures. Understanding how these systems differ, where each fits, and how pipelines are structured around them is essential background for data science and computer science practitioners, infrastructure engineers, and architects designing large-scale analytical systems.


Definition and scope

Big data as an engineering domain is commonly framed around three structural dimensions first formalized in analyst literature: volume (dataset size exceeding the capacity of single-node storage or RAM), velocity (ingestion rates requiring near-real-time processing), and variety (structured tables, semi-structured JSON or XML, and unstructured text, audio, or binary formats co-existing in the same pipeline). The Apache Software Foundation (ASF), which governs both Hadoop and Spark as top-level projects, defines the scope of each framework through its project governance documentation at apache.org.

Apache Hadoop is a distributed batch-processing ecosystem built on two foundational components: the Hadoop Distributed File System (HDFS) and the MapReduce execution engine. HDFS replicates data blocks — 128 MB per block by default in Hadoop 3.x — across commodity nodes, tolerating node failures without data loss. MapReduce executes computation in two discrete phases: a Map phase that processes each data partition independently and a Reduce phase that aggregates results. Hadoop also includes YARN (Yet Another Resource Negotiator), the cluster resource manager that schedules jobs across available nodes.

Apache Spark is an in-memory distributed processing engine released as an ASF top-level project in 2014. Spark represents computation as a Directed Acyclic Graph (DAG) of transformations applied to Resilient Distributed Datasets (RDDs) or, in its structured API, DataFrames and Datasets. By caching intermediate results in memory rather than writing them to disk between stages, Spark achieves benchmark speeds up to 100 times faster than MapReduce for iterative workloads such as machine learning training loops, according to ASF project documentation (Apache Spark Overview).

The broader scalable pipeline ecosystem also includes Apache Kafka (distributed message streaming), Apache Flink (stateful stream processing), and Apache Hive (SQL abstraction over HDFS). These components connect to distributed systems principles governing fault tolerance, consistency, and partitioning that apply across all large-scale architectures.


How it works

A production big data pipeline typically moves through five discrete phases:

  1. Ingestion — Raw data arrives from source systems (databases, application logs, IoT sensors, API streams) via batch file transfer or streaming brokers such as Apache Kafka. Kafka stores messages in partitioned, replicated logs and retains them for a configurable period, commonly 7 days, enabling replay.

  2. Storage — Ingested data lands in a distributed storage layer. HDFS stores data across the cluster's local disks. Cloud-native pipelines substitute object storage (such as Amazon S3 or Google Cloud Storage) using the S3A or GCS connector interfaces compatible with the Hadoop filesystem API.

  3. Processing — Spark or Hadoop MapReduce reads stored data, applies transformation logic, and writes results. Spark organizes this as a DAG: each transform() or filter() call registers a lazy transformation; execution is deferred until an action (such as count() or write()) is called, at which point the DAG optimizer (Catalyst) selects a physical execution plan.

  4. Serving — Processed results are written to an analytical database, data warehouse, or low-latency key-value store for query access. Apache Hive provides HiveQL, a SQL dialect that compiles queries into MapReduce or Spark jobs over HDFS.

  5. Orchestration — A workflow scheduler such as Apache Airflow sequences tasks, handles dependencies, and retries failed stages. The National Institute of Standards and Technology (NIST) Big Data Public Working Group's reference architecture document, NIST SP 1500-6, maps these functional layers as Application Provider, Data Consumer, and Big Data Framework Provider roles.

Fault tolerance in both Hadoop and Spark relies on lineage: if a data partition is lost, the system recomputes it from its parent partitions using the recorded transformation graph rather than maintaining redundant copies in memory.


Common scenarios

Batch ETL pipelines represent the dominant Hadoop use case. A financial institution might run nightly jobs that read 10 terabytes of transaction logs from HDFS, apply aggregation MapReduce jobs, and populate a data warehouse. MapReduce's disk-based execution makes it resilient for very large single-pass jobs where intermediate data volume exceeds cluster RAM.

Iterative machine learning is where Spark's in-memory model delivers measurable advantages. Training a gradient-boosted model over 50 iterations on the same dataset requires 50 passes through the data; Spark caches the dataset in memory after the first read, while MapReduce writes to disk between each iteration. The MLlib library, distributed with Spark, provides classification, regression, clustering, and collaborative filtering algorithms tuned for this execution model — a domain covered in greater depth on the machine learning fundamentals reference.

Stream processing uses Spark Structured Streaming or Apache Flink to process data as it arrives. A retail analytics platform might compute rolling 5-minute revenue aggregates over a Kafka topic, enabling operational dashboards to reflect near-real-time sales without batch latency.

Graph analytics on large social or network datasets uses GraphX (Spark's graph library) or Apache Giraph, which runs on Hadoop YARN, to execute algorithms such as PageRank or connected-component detection across billions of edges.


Decision boundaries

Choosing between Hadoop MapReduce and Spark, or combining them in a single architecture, depends on four measurable dimensions:

Dimension Hadoop MapReduce Apache Spark
Execution model Disk-based, sequential Map → Reduce In-memory DAG with lazy evaluation
Latency Minutes to hours for typical jobs Seconds to minutes; sub-second for streaming
Best workload fit Single-pass batch ETL, very large shuffles Iterative ML, interactive queries, streaming
Memory requirement Low (disk-backed) High; cluster RAM must accommodate working sets

When to prefer Hadoop MapReduce: Jobs with shuffle sizes exceeding available cluster memory, legacy ecosystem integration requirements, or environments where licensing costs for cluster RAM dominate infrastructure budgets.

When to prefer Spark: Iterative algorithms, interactive data exploration via Spark SQL or notebooks, unified batch-and-streaming pipelines using a single API, or workloads where latency below one minute is a hard requirement.

Hybrid architectures are common in practice. HDFS and YARN remain the storage and resource layers, while Spark replaces MapReduce as the execution engine — a configuration the ASF documents as "Spark on YARN." This preserves Hadoop's mature storage reliability while gaining Spark's processing speed.

The computer science authority reference index provides orientation across the broader subfield landscape, including parallel computing and cloud computing concepts that directly intersect with big data infrastructure design decisions.


References