Understanding MinIO Erasure Coding: Erasure Sets, Quorums, Degraded States, and Recovery at Scale
January 22, 2026
This post explains how MinIO implements erasure coding (EC) in practice—covering erasure sets, shard layout, read/write quorums, disk-full behavior, degraded operation, recovery (rebalance + heal), and how MinIO can safely scan very large object namespaces without collapsing under load.
Summarized by ChatGPT from the Q/As in ChatGPT.
0. Why Erasure Coding? MinIO’s Answer to Data Durability and Availability #
Unlike traditional storage systems that rely on RAID or simple replication for data protection, MinIO implements erasure coding as its core strategy for achieving both intra-cluster data durability and high availability.
Traditional Approaches: Trade-offs and Limitations #
RAID (e.g., RAID6):
- [+] Simple, hardware-based, well-understood
- [+] Low CPU overhead (hardware controllers)
- [-] Protects against only 2 drive failures
- [-] Rebuilds operate at the volume level (high downtime)
- [-] Cannot leverage distributed compute for recovery
- [-] Single controller bottleneck
Replication (e.g., 3x copies):
- [+] Simple to implement and reason about
- [+] Fast reads (can read from any copy)
- [+] No reconstruction overhead
- [-] High storage overhead: 3x replication = 200% overhead
- To tolerate 2 failures, you need 3 copies
- 1TB of user data requires 3TB of raw storage
- Usable capacity = 33% (1TB usable / 3TB raw)
- [-] Linear cost increase with redundancy
- [-] No flexible durability-vs-capacity trade-offs
Erasure Coding Benefits #
How Reed-Solomon Erasure Coding Works
Example: 2 data shards + 2 parity shards (N=4)
Encoding (Write):
- Assuming we have 1TB file
- Split it into 2 equal data chunks (D1 = 0.5TB, D2 = 0.5TB)
- Use Reed-Solomon algorithm to compute 2 parity chunks (P1, P2)
- Parity chunks are mathematical combinations, e.g.:
- P1 = D1 + D2
- P2 = D1 + 2×D2 (using different coefficients)
- Store: D1, D2, P1, P2 (each on a different drive)
Decoding (Read with failures):
- All drives healthy: Read D1 + D2 directly → reconstruct file ✓
- Lose D1: Have D2, P1, P2. Solve: D1 = P1 - D2 → reconstruct file ✓
- Lose D1 and D2 (both data!): Have P1, P2. Solve system of equations to recover D1 and D2 ✓
- Lose any 2 drives: With 2 parity shards, you can always solve for any 2 missing shards
Simplified overview: Parity shards aren’t just “copies”—they’re mathematical combinations that allow us to solve for missing data algebraically. We need at least K data shards (any combination of data or parity) to reconstruct. This is why 2 data + 2 parity can tolerate 2 failures with 100% overhead, vs. 3x replication’s 200% overhead for the same fault tolerance.
Erasure coding provides:
- Superior Fault Tolerance: Lose up to N/2 drives (vs. RAID6’s 2 drives)
- Lower Storage Overhead: ~100% overhead with N/2 parity (default configuration)
- To tolerate 2 failures with 4 drives: 2 data + 2 parity
- 1TB of user data requires 2TB of raw storage
- Usable capacity = 50% (1TB usable / 2TB raw)
- Storage overhead = 100% (1TB extra / 1TB usable)
- Compare to 3x replication: 200% overhead for similar fault tolerance
- Object-Level Granularity: Each object is independently erasure-coded, enabling:
- Incremental, per-object healing
- No volume-level rebuild storms
- Parallel recovery across the cluster
- Bit Rot Protection: Built-in checksums (HighwayHash) detect silent data corruption
- Distributed Recovery: No single controller—each node participates in healing
MinIO’s Design Choice: Explicit Over Automatic #
MinIO deliberately chooses explicit, operator-driven recovery over automatic background processes. This design decision ensures:
- Predictable performance (no surprise rebuild storms)
- Stable foreground I/O during recovery
- Clear operational semantics
- Bounded resource consumption
This is not a limitation—it’s a design principle for production systems at scale.
1. Erasure Coding Basics (What MinIO Actually Implements) #
MinIO uses Reed–Solomon erasure coding with a default rule per erasure set:
data shards = floor(N / 2)
parity shards = ceil(N / 2)
Where N is the number of drives in one erasure set (can be 2 to 16).
Examples:
| N | Layout | Max Tolerable Failures |
|---|---|---|
| 4 | 2 data + 2 parity | 2 drives |
| 8 | 4 data + 4 parity | 4 drives |
| 9 | 4 data + 5 parity | 5 drives |
| 12 | 6 data + 6 parity | 6 drives |
| 16 | 8 data + 8 parity | 8 drives |
Note: The parity level can be customized using storage classes, but the default configuration is recommended for best protection.
Key invariants:
- Each shard (data or parity) is placed on a different drive
- Default layout provides maximum protection (ceil(N/2) failures tolerable)
- Parity ≥ data by design (safety-first)
2. Erasure Sets vs the Cluster (The Most Important Distinction) #
A MinIO cluster is not one giant EC system.
Instead:
Cluster
├── Erasure Set 1 (N=8)
├── Erasure Set 2 (N=8)
├── Erasure Set 3 (N=8)
└── ...
Properties:
- Objects belong to exactly one erasure set
- Objects never span erasure sets
- Failure, quorum, heal, and rebalance are all per erasure set
Nis per erasure set, not cluster-wide
Only in very small clusters (e.g., 4 drives total) do these coincide.
3. How MinIO Chooses Erasure Set Size #
Erasure set size is auto-selected, not user-configurable.
MinIO considers together:
- Number of nodes
- Drives per node
- Total number of drives
It then chooses a bounded, conservative size, typically:
4, 8, or 16
Key consequences:
- Large clusters (e.g., 100 drives) are partitioned into many erasure sets
- MinIO will never create a 100-drive erasure set
- Existing erasure sets are never resized
- Adding drives creates new sets (server pools)
This design bounds failure blast radius and recovery cost.
4. Quorums: Reads and Writes Are Different #
MinIO uses two different quorums.
Read quorum (data quorum) #
read quorum = floor(N / 2) (data shards)
Reads succeed as long as MinIO can gather enough shards to reconstruct the object.
Example (N=4):
- Read quorum = 2
- Reads tolerate up to 2 unavailable drives
Example (N=9):
- Read quorum = 4
- Reads tolerate up to 5 unavailable drives
Write quorum #
write quorum = floor(N / 2) + 1
Writes require a strict majority to commit safely.
Example (N=4):
- Write quorum = 3
- Writes tolerate up to 1 unavailable drive
Example (N=9):
- Write quorum = 5
- Writes tolerate up to 4 unavailable drives
This prevents:
- Split-brain object versions
- Unhealable writes
- Partial visibility anomalies
5. What Happens When a Drive Becomes Full #
A full drive is treated as failed #
When a drive hits ENOSPC:
- It is marked unavailable
- It is excluded from both reads and writes
- MinIO does not treat it as “readable but not writable”
Reason:
- Metadata updates and fsync may fail
- Quorum math requires binary participation
- Predictable failure behavior is prioritized
Operationally:
A full drive is logically dead until space is available again.
6. Read Behavior Under Disk Full #
- Missing shards are treated as absent
- Reads reconstruct from parity
- Reads succeed if read quorum exists
Example (N=4):
| Unavailable drives | Reads |
|---|---|
| 1 | ✅ |
| 2 | ✅ |
| 3 | ❌ |
7. Write Behavior and Degraded Writes #
Writes require write quorum.
Example (N=4):
| Unavailable drives | Writes |
|---|---|
| 1 | ✅ (degraded) |
| 2 | ❌ |
| 3 | ❌ |
What “degraded write” means #
- All data shards are written
- Some parity shards are missing
- Object is readable
- Object is temporarily under-protected
MinIO allows degraded writes only while safety guarantees still hold.
8. Recovery Is Explicit: Rebalance and Heal #
Nothing happens automatically.
Rebalance (capacity & placement) #
mc admin rebalance start <alias>
- Makes new drives eligible for writes
- Redistributes objects across erasure sets
- Driven by capacity imbalance, not degradation
Heal (durability & parity) #
mc admin heal -r <alias>
- Reconstructs missing shards
- Repairs objects written in degraded mode
- Requires available writable space
Correct order:
rebalance → heal
9. Multiple Degraded Erasure Sets: No Global Priority #
When multiple erasure sets are degraded and new drives are added:
- MinIO does not choose “which erasure set to recover first”
- Recovery is object-scoped, not set-scoped
- Rebalance and heal both operate over objects
Think:
MinIO heals objects; erasure sets recover implicitly as their objects are repaired.
10. Yes, Rebalance and Heal Scan Objects — Here’s Why That Works #
At first glance, “scanning all objects” sounds infeasible. It works because MinIO does not do a naïve scan.
10.1 Metadata-first, not data-first #
- Scans object metadata, not payload
- Metadata is tiny compared to data
- Healthy objects are skipped early
Result:
Most objects incur near-zero data I/O.
10.2 Object independence (critical) #
Each object is:
- Immutable
- Versioned
- An independent EC unit
There is:
- No global stripe table
- No block-group rebuild
- No set-wide dependency
This enables massive parallelism.
10.3 Distributed scanning #
-
Each node scans only:
- Its own disks
- Its own metadata
-
No central scanner
-
No global coordinator bottleneck
Scan throughput scales with cluster size.
10.4 Incremental, resumable progress #
Rebalance and heal are:
- Checkpointed
- Pause/resume safe
- Long-lived background jobs
If interrupted:
- Progress is not lost
- No restart-from-zero penalty
10.5 Aggressive throttling and yielding #
Background operations:
- Yield to foreground reads/writes
- Are rate-limited (IOPS, bandwidth, CPU)
- Never block client I/O
Recovery is slow by design, but safe.
10.6 Selective data movement #
- Rebalance moves only objects that must move
- Heal reads data only for degraded objects
Data I/O scales with:
damage and imbalance, not cluster size.
10.7 Small erasure sets limit blast radius #
Because each object belongs to only one erasure set:
- Damage is localized
- Recovery touches only affected subsets
- Most of the cluster remains untouched
This is a key reason MinIO avoids large erasure sets.
11. What MinIO Deliberately Avoids #
MinIO does not:
- Auto-rebalance on capacity change
- Auto-heal synchronously
- Perform full-speed cluster scans
- Rebuild disks as a single unit
All of these cause rebuild storms and latency collapse in other systems.
12. The Correct Mental Model #
Do not think:
“MinIO repairs disks or erasure sets.”
Think instead:
“MinIO runs a long-lived, throttled, distributed map-reduce over object metadata.”
This is why:
- Large-scale recovery is feasible
- Foreground traffic remains stable
- Operator intent is explicit
13. Final Takeaways #
- Erasure coding is per erasure set, not per cluster
- Erasure sets are small, fixed, and conservative
- Reads and writes use different quorums
- A full drive is treated as failed
- Degraded writes are allowed—but bounded
- Rebalance and heal are explicit
- Recovery is object-scoped, metadata-first, incremental
One-sentence summary #
MinIO trades automatic, opaque recovery for explicit, predictable, object-driven convergence—and that is exactly why it scales safely.