What Is Hadoop, and Why It Isn't 10 Microservices on K8s

Someone asked me this exact question last week, and it’s a good one because both setups look the same if you squint. A bunch of machines, some shared storage in the middle, work spread across nodes. So why does one get called “big data” and the other “microservices”? Are they just two words for the same cluster? Honestly, no. They’re built on opposite assumptions about one thing: where the data lives and who moves to whom.

Let me unpack what Hadoop actually is first, then we’ll put the two side by side.

What is Hadoop, really?

Apache Hadoop is a framework for storing and processing datasets that are too big to fit on one machine — we’re talking terabytes to petabytes. It was built by the Apache Software Foundation and it follows a classic master-slave design across a cluster of cheap commodity machines [1]. The whole point is: instead of buying one giant expensive server, you buy 50 ordinary ones and make them work together.

It has three layers, and you really need all three to understand it:

HDFS (Hadoop Distributed File System) — the storage layer. It takes your big files, chops them into block-sized chunks, and scatters those blocks across the disks of all the machines in the cluster [1].
YARN (Yet Another Resource Negotiator) — the resource manager. It decides which machine runs which piece of work, splitting resource management and job scheduling into separate daemons [1].
MapReduce — the original processing model. You write a “map” step and a “reduce” step, and the framework runs them in parallel across the cluster [1].

How HDFS spreads your data

When you drop a 1 TB file into HDFS, it doesn’t sit on one disk. HDFS breaks it into blocks (128 MB each by default) and stores those blocks across the slave nodes. A master daemon called the NameNode keeps the metadata — filenames, which blocks belong to which file, and crucially, which machine holds each block. The actual blocks live on DataNodes, the slave daemons running on each machine [1].

Each block also gets replicated (usually 3 copies on different machines), so if a disk dies — and with cheap hardware, disks die constantly — your data survives. This is the part people gloss over: Hadoop assumes failure is normal, not exceptional.

How MapReduce processes it

Here’s the clever bit. In a typical Hadoop setup, the compute nodes and the storage nodes are the same machines. The MapReduce framework and HDFS run on the same set of nodes. This lets the framework schedule each task on the node where the data already physically sits, giving you very high aggregate bandwidth across the cluster [2].

That last sentence is the whole ballgame. Let me make it loud.

The one idea that defines Hadoop: move the code, not the data

A design goal baked into Hadoop from day one: move computation to the data, rather than moving data to the computation [3]. This is called data locality, and it’s not a nice-to-have optimization — it’s the reason Hadoop exists.

Think about it. Your code — the map function — is maybe a few kilobytes. The data block it needs to process is 128 MB. If you ship the data to where the code is, you’re pushing 128 MB across the network for every block, and you’ve got thousands of blocks. The network melts. Instead, Hadoop ships the tiny code to the machine that already holds the block, and runs it there, reading from the local disk [3].

Hadoop even ranks how good a task placement is:

Data-local — the task runs on the exact node holding the block. Best case.
Rack-local — can’t get the exact node, so run it on another machine in the same server rack (fast network within a rack) [4].
Off-rack — worst case, data crosses racks. Hadoop tries hard to avoid this.

Why does this matter so much? Because of something the storage world calls data gravity — large, active datasets attract applications to wherever they live, because moving the data becomes too slow and too expensive [5]. The storage itself isn’t the bottleneck; data movement is. Physics imposes limits you can’t engineer around — you can’t ship data as fast as you can create it [5]. Hadoop’s answer to data gravity was to stop fighting it. Bring the compute to the data. Surrender to gravity, basically.

move code not data

Now the microservices-on-Kubernetes setup

Picture the other thing the question described: 10 microservices running in a Kubernetes cluster, all talking to some shared storage (say a network file system, or a cloud object store, or a managed database).

Kubernetes was fundamentally designed for stateless applications — services that don’t need to remember anything when they go down. That statelessness is what makes the magic work: declarative deployments, high availability, autoscaling, the ability to stop, restart, and clone a service with ease [6]. Your shopping-cart service, your auth service, your payment service — each one is a small box that handles a request and forgets about it.

And the architectural principle here is the exact opposite of Hadoop’s. In a Kubernetes microservices setup, you deliberately decouple storage from compute. The storage layer is completely separated from the compute layer that Kubernetes manages [6]. Stateless app-server compute connects to data services that often run outside the cluster entirely [6].

There’s another rule that comes from the microservices world itself: services shouldn’t share a data store. Each service is supposed to own its own dataset, precisely to avoid hidden dependencies and accidental coupling between services [6]. So the phrase “10 microservices with shared storage” is already a bit of a smell — true microservices architecture pushes away from shared storage, toward each service owning its slice. (You can read more about this tension in Microsoft’s AKS microservices reference architecture.)

When you do attach storage in Kubernetes, you usually wire the cluster to traditional infrastructure exposed over NFS, GlusterFS, or cloud file systems like Amazon EFS, Azure Files, and Google Cloud Filestore [6]. The compute and the storage are on different machines, connected by the network. The data is remote by design.

Side by side: the actual difference

So let’s stop being abstract. Here’s where they genuinely diverge.

Dimension	Hadoop	10 microservices on K8s + shared storage
Core job	Process huge datasets in bulk	Handle many independent requests
Where data lives	On the same nodes that compute (HDFS)	On remote/shared storage, separate from compute
Guiding principle	Move code to the data [3]	Decouple compute from storage [6]
State	Storage-centric, data is the system	Compute is stateless, state pushed out [6]
Unit of work	A batch job split into map/reduce tasks	A request/response per service
What crosses the network	Mostly tiny code; data stays local	The data itself, on every operation
Coupling	Tightly co-located compute + storage	Loosely coupled, each service independent
Scaling target	Throughput on massive data	Concurrency and availability of requests
Failure model	Assumes disks die; replicates blocks	Assumes pods die; reschedules stateless pods

The headline: Hadoop co-locates compute and storage on purpose; Kubernetes microservices separate them on purpose. They aren’t two flavors of the same idea. They’re answers to two different questions.

Hadoop asks: “I have a mountain of data sitting still. How do I run computation over all of it without choking the network?”

Kubernetes microservices ask: “I have a flood of small, independent requests. How do I serve them reliably, scale each piece on its own, and survive crashes?”

A scenario to make it click

Say you run an e-commerce site.

The microservices on Kubernetes are your storefront in motion: a user clicks “add to cart,” the cart service handles it, the inventory service checks stock, the payment service charges the card. Ten small services, each scaling to traffic, each ideally owning its own data. Requests are tiny, the data each touches is tiny, and latency is everything. Shipping a few KB to a shared database over the network is totally fine here.

Now, at 2 AM, you want to analyze every order from the last five years to find buying patterns. That’s 8 TB of historical logs. You’re not going to pull 8 TB through a microservice over the network — your network and your wallet would both die from data gravity [5]. This is the Hadoop-shaped job: park the data on HDFS, ship the analysis code to where the blocks live, crunch it in parallel, write out a summary. The data never moves; the code does.

Same company, two completely different problems, two completely different architectures. They’re not competitors — honestly, they often coexist in the same building.

“But my microservices process data too!”

Sure they do. This is where it gets tricky, and where the comparison feels blurry. A microservice can absolutely read a file, transform it, and write it back. So what’s the real line?

It comes down to three things:

Volume and locality. A microservice fetches a row, a document, a small blob — it pulls remote data to the compute. Hadoop refuses to do that at scale because the data is too big to move, so it sends compute to the data [3]. If your job’s bottleneck is “I can’t even read all the data fast enough,” you’re in Hadoop territory.
Granularity of work. Microservices think in requests — short, isolated, low-latency. Hadoop thinks in jobs — long-running batch passes over an entire dataset where you happily wait minutes or hours [7]. MapReduce’s batch nature actually causes latency issues for real-time work, which is exactly why it’s the wrong tool for serving live requests [7].
Coupling philosophy. Microservices want loose coupling and independent storage ownership so teams can move fast without stepping on each other [6]. Hadoop wants tight co-location of compute and storage so it can win the network fight. These goals literally contradict each other.

So a microservice doing a bit of data processing is still a microservice. It becomes a “big data system” when the data is the immovable center of gravity and the compute has to orbit it.

A wrinkle worth knowing: even big data moved away from “pure” Hadoop

I’d be doing you a disservice if I made it sound like Hadoop is still the default for big data. It isn’t anymore. Hadoop pioneered distributed computing, but it’s declining for new projects [7]. MapReduce writes intermediate results to disk between every step, which is slow, so Apache Spark came along and did the same kind of distributed processing largely in memory, dramatically speeding up iterative and interactive work [7].

And the broader ecosystem is huge — Hadoop isn’t just MapReduce. It grew tools like:

Hive — a data-warehouse layer that lets you query HDFS data with a SQL-like language called HiveQL [8].
HBase — a distributed database for storing structured data in tables with billions of rows and millions of columns, sitting directly on HDFS [8].
Pig — a high-level platform using PigLatin to load, filter, and transform large datasets [8].

Interestingly, the modern cloud-data world has partly un-learned Hadoop’s core lesson. Tools like BigQuery and Snowflake deliberately separate storage from compute again — the same decoupling Kubernetes microservices use — because in the cloud, network bandwidth between object storage and compute got fast and cheap enough that data locality matters less than it did in 2008. So in a funny way, the pendulum swung from “co-locate everything” (Hadoop) toward “separate everything” (cloud data warehouses and microservices alike). Whether that holds as AI workloads explode data volumes again is an open question — some folks argue data gravity is roaring back and never really left [5].

So, are they the same thing or not?

No. And here’s the cleanest way I can put it.

A Kubernetes cluster running microservices is a request-serving machine. Its instinct is to keep compute stateless and lean, and to push data out to wherever it can be shared or owned independently. The network carries data to the compute, and that’s acceptable because each request touches a tiny amount [6].

Hadoop is a data-crunching machine. Its instinct is to nail the data down on the same disks that compute, because moving petabytes is a losing battle against physics. The network carries code to the data [3].

If you remember just one line: microservices move data to code; Hadoop moves code to data. Everything else — the daemons, the replication, the YARN scheduling, the StatefulSets — flows from that single decision.

They overlap in superficial ways (clusters, distribution, fault tolerance), which is why the question is so natural to ask. But the architectures are built on opposite bets about the most expensive thing in any distributed system: moving bytes across a wire.

End