Understanding MinIO Erasure Coding: Erasure Sets, Quorums, Degraded States, and Recovery at Scale

January 22, 2026

storage, MinIO, erasure-coding, distributed-systems

This post explains how MinIO implements erasure coding (EC) in practice—covering erasure sets, shard layout, read/write quorums, disk-full behavior, degraded operation, recovery (rebalance + heal), and how MinIO can safely scan very large object namespaces without collapsing under load.

Summarized by ChatGPT from the Q/As in ChatGPT.

0. Why Erasure Coding? MinIO’s Answer to Data Durability and Availability #

Unlike traditional storage systems that rely on RAID or simple replication for data protection, MinIO implements erasure coding as its core strategy for achieving both intra-cluster data durability and high availability.

Traditional Approaches: Trade-offs and Limitations #

RAID (e.g., RAID6):

[+] Simple, hardware-based, well-understood
[+] Low CPU overhead (hardware controllers)
[-] Protects against only 2 drive failures
[-] Rebuilds operate at the volume level (high downtime)
[-] Cannot leverage distributed compute for recovery
[-] Single controller bottleneck

Replication (e.g., 3x copies):

[+] Simple to implement and reason about
[+] Fast reads (can read from any copy)
[+] No reconstruction overhead
[-] High storage overhead: 3x replication = 200% overhead
- To tolerate 2 failures, you need 3 copies
- 1TB of user data requires 3TB of raw storage
- Usable capacity = 33% (1TB usable / 3TB raw)
[-] Linear cost increase with redundancy
[-] No flexible durability-vs-capacity trade-offs

Erasure Coding Benefits #

How Reed-Solomon Erasure Coding Works

Example: 2 data shards + 2 parity shards (N=4)

Encoding (Write):

Assuming we have 1TB file
Split it into 2 equal data chunks (D1 = 0.5TB, D2 = 0.5TB)
Use Reed-Solomon algorithm to compute 2 parity chunks (P1, P2)
Parity chunks are mathematical combinations, e.g.:
- P1 = D1 + D2
- P2 = D1 + 2×D2 (using different coefficients)
Store: D1, D2, P1, P2 (each on a different drive)

Decoding (Read with failures):

All drives healthy: Read D1 + D2 directly → reconstruct file ✓
Lose D1: Have D2, P1, P2. Solve: D1 = P1 - D2 → reconstruct file ✓
Lose D1 and D2 (both data!): Have P1, P2. Solve system of equations to recover D1 and D2 ✓
Lose any 2 drives: With 2 parity shards, you can always solve for any 2 missing shards

Simplified overview: Parity shards aren’t just “copies”—they’re mathematical combinations that allow us to solve for missing data algebraically. We need at least K data shards (any combination of data or parity) to reconstruct. This is why 2 data + 2 parity can tolerate 2 failures with 100% overhead, vs. 3x replication’s 200% overhead for the same fault tolerance.

Erasure coding provides:

Superior Fault Tolerance: Lose up to N/2 drives (vs. RAID6’s 2 drives)
Lower Storage Overhead: ~100% overhead with N/2 parity (default configuration)
- To tolerate 2 failures with 4 drives: 2 data + 2 parity
- 1TB of user data requires 2TB of raw storage
- Usable capacity = 50% (1TB usable / 2TB raw)
- Storage overhead = 100% (1TB extra / 1TB usable)
- Compare to 3x replication: 200% overhead for similar fault tolerance
Object-Level Granularity: Each object is independently erasure-coded, enabling:
- Incremental, per-object healing
- No volume-level rebuild storms
- Parallel recovery across the cluster
Bit Rot Protection: Built-in checksums (HighwayHash) detect silent data corruption
Distributed Recovery: No single controller—each node participates in healing

MinIO’s Design Choice: Explicit Over Automatic #

MinIO deliberately chooses explicit, operator-driven recovery over automatic background processes. This design decision ensures:

Predictable performance (no surprise rebuild storms)
Stable foreground I/O during recovery
Clear operational semantics
Bounded resource consumption

This is not a limitation—it’s a design principle for production systems at scale.

1. Erasure Coding Basics (What MinIO Actually Implements) #

MinIO uses Reed–Solomon erasure coding with a default rule per erasure set:

data shards   = floor(N / 2)
parity shards = ceil(N / 2)

Where N is the number of drives in one erasure set (can be 2 to 16).

Examples:

N	Layout	Max Tolerable Failures
4	2 data + 2 parity	2 drives
8	4 data + 4 parity	4 drives
9	4 data + 5 parity	5 drives
12	6 data + 6 parity	6 drives
16	8 data + 8 parity	8 drives

Note: The parity level can be customized using storage classes, but the default configuration is recommended for best protection.

Key invariants:

Each shard (data or parity) is placed on a different drive
Default layout provides maximum protection (ceil(N/2) failures tolerable)
Parity ≥ data by design (safety-first)

2. Erasure Sets vs the Cluster (The Most Important Distinction) #

A MinIO cluster is not one giant EC system.

Instead:

Cluster
├── Erasure Set 1 (N=8)
├── Erasure Set 2 (N=8)
├── Erasure Set 3 (N=8)
└── ...

Properties:

Objects belong to exactly one erasure set
Objects never span erasure sets
Failure, quorum, heal, and rebalance are all per erasure set
N is per erasure set, not cluster-wide

Only in very small clusters (e.g., 4 drives total) do these coincide.

3. How MinIO Chooses Erasure Set Size #

Erasure set size is auto-selected, not user-configurable.

MinIO considers together:

Number of nodes
Drives per node
Total number of drives

It then chooses a bounded, conservative size, typically:

4, 8, or 16

Key consequences:

Large clusters (e.g., 100 drives) are partitioned into many erasure sets
MinIO will never create a 100-drive erasure set
Existing erasure sets are never resized
Adding drives creates new sets (server pools)

This design bounds failure blast radius and recovery cost.

4. Quorums: Reads and Writes Are Different #

MinIO uses two different quorums.

Read quorum (data quorum) #

read quorum = floor(N / 2) (data shards)

Reads succeed as long as MinIO can gather enough shards to reconstruct the object.

Example (N=4):

Read quorum = 2
Reads tolerate up to 2 unavailable drives

Example (N=9):

Read quorum = 4
Reads tolerate up to 5 unavailable drives

Write quorum #

write quorum = floor(N / 2) + 1

Writes require a strict majority to commit safely.

Example (N=4):

Write quorum = 3
Writes tolerate up to 1 unavailable drive

Example (N=9):

Write quorum = 5
Writes tolerate up to 4 unavailable drives

This prevents:

Split-brain object versions
Unhealable writes
Partial visibility anomalies

5. What Happens When a Drive Becomes Full #

A full drive is treated as failed #

When a drive hits ENOSPC:

It is marked unavailable
It is excluded from both reads and writes
MinIO does not treat it as “readable but not writable”

Reason:

Metadata updates and fsync may fail
Quorum math requires binary participation
Predictable failure behavior is prioritized

Operationally:

A full drive is logically dead until space is available again.

6. Read Behavior Under Disk Full #

Missing shards are treated as absent
Reads reconstruct from parity
Reads succeed if read quorum exists

Example (N=4):

Unavailable drives	Reads
1	✅
2	✅
3	❌

7. Write Behavior and Degraded Writes #

Writes require write quorum.

Example (N=4):

Unavailable drives	Writes
1	✅ (degraded)
2	❌
3	❌

What “degraded write” means #

All data shards are written
Some parity shards are missing
Object is readable
Object is temporarily under-protected

MinIO allows degraded writes only while safety guarantees still hold.

8. Recovery Is Explicit: Rebalance and Heal #

Nothing happens automatically.

Rebalance (capacity & placement) #

mc admin rebalance start <alias>

Makes new drives eligible for writes
Redistributes objects across erasure sets
Driven by capacity imbalance, not degradation

Heal (durability & parity) #

mc admin heal -r <alias>

Reconstructs missing shards
Repairs objects written in degraded mode
Requires available writable space

Correct order:

rebalance → heal

9. Multiple Degraded Erasure Sets: No Global Priority #

When multiple erasure sets are degraded and new drives are added:

MinIO does not choose “which erasure set to recover first”
Recovery is object-scoped, not set-scoped
Rebalance and heal both operate over objects

Think:

MinIO heals objects; erasure sets recover implicitly as their objects are repaired.

10. Yes, Rebalance and Heal Scan Objects — Here’s Why That Works #

At first glance, “scanning all objects” sounds infeasible. It works because MinIO does not do a naïve scan.

10.1 Metadata-first, not data-first #

Scans object metadata, not payload
Metadata is tiny compared to data
Healthy objects are skipped early

Result:

Most objects incur near-zero data I/O.

10.2 Object independence (critical) #

Each object is:

Immutable
Versioned
An independent EC unit

There is:

No global stripe table
No block-group rebuild
No set-wide dependency

This enables massive parallelism.

10.3 Distributed scanning #

Each node scans only:
- Its own disks
- Its own metadata
No central scanner
No global coordinator bottleneck

Scan throughput scales with cluster size.

10.4 Incremental, resumable progress #

Rebalance and heal are:

Checkpointed
Pause/resume safe
Long-lived background jobs

If interrupted:

Progress is not lost
No restart-from-zero penalty

10.5 Aggressive throttling and yielding #

Background operations:

Yield to foreground reads/writes
Are rate-limited (IOPS, bandwidth, CPU)
Never block client I/O

Recovery is slow by design, but safe.

10.6 Selective data movement #

Rebalance moves only objects that must move
Heal reads data only for degraded objects

Data I/O scales with:

damage and imbalance, not cluster size.

10.7 Small erasure sets limit blast radius #

Because each object belongs to only one erasure set:

Damage is localized
Recovery touches only affected subsets
Most of the cluster remains untouched

This is a key reason MinIO avoids large erasure sets.

11. What MinIO Deliberately Avoids #

MinIO does not:

Auto-rebalance on capacity change
Auto-heal synchronously
Perform full-speed cluster scans
Rebuild disks as a single unit

All of these cause rebuild storms and latency collapse in other systems.

12. The Correct Mental Model #

Do not think:

“MinIO repairs disks or erasure sets.”

Think instead:

“MinIO runs a long-lived, throttled, distributed map-reduce over object metadata.”

This is why:

Large-scale recovery is feasible
Foreground traffic remains stable
Operator intent is explicit

13. Final Takeaways #

Erasure coding is per erasure set, not per cluster
Erasure sets are small, fixed, and conservative
Reads and writes use different quorums
A full drive is treated as failed
Degraded writes are allowed—but bounded
Rebalance and heal are explicit
Recovery is object-scoped, metadata-first, incremental

One-sentence summary #

MinIO trades automatic, opaque recovery for explicit, predictable, object-driven convergence—and that is exactly why it scales safely.