Understanding MinIO Erasure Coding: Erasure Sets, Quorums, Degraded States, and Recovery at Scale

Understanding MinIO Erasure Coding: Erasure Sets, Quorums, Degraded States, and Recovery at Scale

January 22, 2026
storage, MinIO, erasure-coding, distributed-systems
This post explains how MinIO implements erasure coding (EC) in practice—covering erasure sets, shard layout, read/write quorums, disk-full behavior, degraded operation, recovery (rebalance + heal), and how MinIO can safely scan very large object namespaces without collapsing under load.

Summarized by ChatGPT from the Q/As in ChatGPT.


0. Why Erasure Coding? MinIO’s Answer to Data Durability and Availability #

Unlike traditional storage systems that rely on RAID or simple replication for data protection, MinIO implements erasure coding as its core strategy for achieving both intra-cluster data durability and high availability.

Traditional Approaches: Trade-offs and Limitations #

RAID (e.g., RAID6):

  • [+] Simple, hardware-based, well-understood
  • [+] Low CPU overhead (hardware controllers)
  • [-] Protects against only 2 drive failures
  • [-] Rebuilds operate at the volume level (high downtime)
  • [-] Cannot leverage distributed compute for recovery
  • [-] Single controller bottleneck

Replication (e.g., 3x copies):

  • [+] Simple to implement and reason about
  • [+] Fast reads (can read from any copy)
  • [+] No reconstruction overhead
  • [-] High storage overhead: 3x replication = 200% overhead
    • To tolerate 2 failures, you need 3 copies
    • 1TB of user data requires 3TB of raw storage
    • Usable capacity = 33% (1TB usable / 3TB raw)
  • [-] Linear cost increase with redundancy
  • [-] No flexible durability-vs-capacity trade-offs

Erasure Coding Benefits #

How Reed-Solomon Erasure Coding Works

Example: 2 data shards + 2 parity shards (N=4)

Encoding (Write):

  • Assuming we have 1TB file
  • Split it into 2 equal data chunks (D1 = 0.5TB, D2 = 0.5TB)
  • Use Reed-Solomon algorithm to compute 2 parity chunks (P1, P2)
  • Parity chunks are mathematical combinations, e.g.:
    • P1 = D1 + D2
    • P2 = D1 + 2×D2 (using different coefficients)
  • Store: D1, D2, P1, P2 (each on a different drive)

Decoding (Read with failures):

  • All drives healthy: Read D1 + D2 directly → reconstruct file ✓
  • Lose D1: Have D2, P1, P2. Solve: D1 = P1 - D2 → reconstruct file ✓
  • Lose D1 and D2 (both data!): Have P1, P2. Solve system of equations to recover D1 and D2 ✓
  • Lose any 2 drives: With 2 parity shards, you can always solve for any 2 missing shards

Simplified overview: Parity shards aren’t just “copies”—they’re mathematical combinations that allow us to solve for missing data algebraically. We need at least K data shards (any combination of data or parity) to reconstruct. This is why 2 data + 2 parity can tolerate 2 failures with 100% overhead, vs. 3x replication’s 200% overhead for the same fault tolerance.

Erasure coding provides:

  1. Superior Fault Tolerance: Lose up to N/2 drives (vs. RAID6’s 2 drives)
  2. Lower Storage Overhead: ~100% overhead with N/2 parity (default configuration)
    • To tolerate 2 failures with 4 drives: 2 data + 2 parity
    • 1TB of user data requires 2TB of raw storage
    • Usable capacity = 50% (1TB usable / 2TB raw)
    • Storage overhead = 100% (1TB extra / 1TB usable)
    • Compare to 3x replication: 200% overhead for similar fault tolerance
  3. Object-Level Granularity: Each object is independently erasure-coded, enabling:
    • Incremental, per-object healing
    • No volume-level rebuild storms
    • Parallel recovery across the cluster
  4. Bit Rot Protection: Built-in checksums (HighwayHash) detect silent data corruption
  5. Distributed Recovery: No single controller—each node participates in healing

MinIO’s Design Choice: Explicit Over Automatic #

MinIO deliberately chooses explicit, operator-driven recovery over automatic background processes. This design decision ensures:

  • Predictable performance (no surprise rebuild storms)
  • Stable foreground I/O during recovery
  • Clear operational semantics
  • Bounded resource consumption

This is not a limitation—it’s a design principle for production systems at scale.


1. Erasure Coding Basics (What MinIO Actually Implements) #

MinIO uses Reed–Solomon erasure coding with a default rule per erasure set:

data shards   = floor(N / 2)
parity shards = ceil(N / 2)

Where N is the number of drives in one erasure set (can be 2 to 16).

Examples:

N Layout Max Tolerable Failures
4 2 data + 2 parity 2 drives
8 4 data + 4 parity 4 drives
9 4 data + 5 parity 5 drives
12 6 data + 6 parity 6 drives
16 8 data + 8 parity 8 drives

Note: The parity level can be customized using storage classes, but the default configuration is recommended for best protection.

Key invariants:

  • Each shard (data or parity) is placed on a different drive
  • Default layout provides maximum protection (ceil(N/2) failures tolerable)
  • Parity ≥ data by design (safety-first)

2. Erasure Sets vs the Cluster (The Most Important Distinction) #

A MinIO cluster is not one giant EC system.

Instead:

Cluster
├── Erasure Set 1 (N=8)
├── Erasure Set 2 (N=8)
├── Erasure Set 3 (N=8)
└── ...

Properties:

  • Objects belong to exactly one erasure set
  • Objects never span erasure sets
  • Failure, quorum, heal, and rebalance are all per erasure set
  • N is per erasure set, not cluster-wide

Only in very small clusters (e.g., 4 drives total) do these coincide.


3. How MinIO Chooses Erasure Set Size #

Erasure set size is auto-selected, not user-configurable.

MinIO considers together:

  1. Number of nodes
  2. Drives per node
  3. Total number of drives

It then chooses a bounded, conservative size, typically:

4, 8, or 16

Key consequences:

  • Large clusters (e.g., 100 drives) are partitioned into many erasure sets
  • MinIO will never create a 100-drive erasure set
  • Existing erasure sets are never resized
  • Adding drives creates new sets (server pools)

This design bounds failure blast radius and recovery cost.


4. Quorums: Reads and Writes Are Different #

MinIO uses two different quorums.

Read quorum (data quorum) #

read quorum = floor(N / 2) (data shards)

Reads succeed as long as MinIO can gather enough shards to reconstruct the object.

Example (N=4):

  • Read quorum = 2
  • Reads tolerate up to 2 unavailable drives

Example (N=9):

  • Read quorum = 4
  • Reads tolerate up to 5 unavailable drives

Write quorum #

write quorum = floor(N / 2) + 1

Writes require a strict majority to commit safely.

Example (N=4):

  • Write quorum = 3
  • Writes tolerate up to 1 unavailable drive

Example (N=9):

  • Write quorum = 5
  • Writes tolerate up to 4 unavailable drives

This prevents:

  • Split-brain object versions
  • Unhealable writes
  • Partial visibility anomalies

5. What Happens When a Drive Becomes Full #

A full drive is treated as failed #

When a drive hits ENOSPC:

  • It is marked unavailable
  • It is excluded from both reads and writes
  • MinIO does not treat it as “readable but not writable”

Reason:

  • Metadata updates and fsync may fail
  • Quorum math requires binary participation
  • Predictable failure behavior is prioritized

Operationally:

A full drive is logically dead until space is available again.


6. Read Behavior Under Disk Full #

  • Missing shards are treated as absent
  • Reads reconstruct from parity
  • Reads succeed if read quorum exists

Example (N=4):

Unavailable drives Reads
1
2
3

7. Write Behavior and Degraded Writes #

Writes require write quorum.

Example (N=4):

Unavailable drives Writes
1 ✅ (degraded)
2
3

What “degraded write” means #

  • All data shards are written
  • Some parity shards are missing
  • Object is readable
  • Object is temporarily under-protected

MinIO allows degraded writes only while safety guarantees still hold.


8. Recovery Is Explicit: Rebalance and Heal #

Nothing happens automatically.

Rebalance (capacity & placement) #

mc admin rebalance start <alias>
  • Makes new drives eligible for writes
  • Redistributes objects across erasure sets
  • Driven by capacity imbalance, not degradation

Heal (durability & parity) #

mc admin heal -r <alias>
  • Reconstructs missing shards
  • Repairs objects written in degraded mode
  • Requires available writable space

Correct order:

rebalance → heal

9. Multiple Degraded Erasure Sets: No Global Priority #

When multiple erasure sets are degraded and new drives are added:

  • MinIO does not choose “which erasure set to recover first”
  • Recovery is object-scoped, not set-scoped
  • Rebalance and heal both operate over objects

Think:

MinIO heals objects; erasure sets recover implicitly as their objects are repaired.


10. Yes, Rebalance and Heal Scan Objects — Here’s Why That Works #

At first glance, “scanning all objects” sounds infeasible. It works because MinIO does not do a naïve scan.

10.1 Metadata-first, not data-first #

  • Scans object metadata, not payload
  • Metadata is tiny compared to data
  • Healthy objects are skipped early

Result:

Most objects incur near-zero data I/O.


10.2 Object independence (critical) #

Each object is:

  • Immutable
  • Versioned
  • An independent EC unit

There is:

  • No global stripe table
  • No block-group rebuild
  • No set-wide dependency

This enables massive parallelism.


10.3 Distributed scanning #

  • Each node scans only:

    • Its own disks
    • Its own metadata
  • No central scanner

  • No global coordinator bottleneck

Scan throughput scales with cluster size.


10.4 Incremental, resumable progress #

Rebalance and heal are:

  • Checkpointed
  • Pause/resume safe
  • Long-lived background jobs

If interrupted:

  • Progress is not lost
  • No restart-from-zero penalty

10.5 Aggressive throttling and yielding #

Background operations:

  • Yield to foreground reads/writes
  • Are rate-limited (IOPS, bandwidth, CPU)
  • Never block client I/O

Recovery is slow by design, but safe.


10.6 Selective data movement #

  • Rebalance moves only objects that must move
  • Heal reads data only for degraded objects

Data I/O scales with:

damage and imbalance, not cluster size.


10.7 Small erasure sets limit blast radius #

Because each object belongs to only one erasure set:

  • Damage is localized
  • Recovery touches only affected subsets
  • Most of the cluster remains untouched

This is a key reason MinIO avoids large erasure sets.


11. What MinIO Deliberately Avoids #

MinIO does not:

  • Auto-rebalance on capacity change
  • Auto-heal synchronously
  • Perform full-speed cluster scans
  • Rebuild disks as a single unit

All of these cause rebuild storms and latency collapse in other systems.


12. The Correct Mental Model #

Do not think:

“MinIO repairs disks or erasure sets.”

Think instead:

“MinIO runs a long-lived, throttled, distributed map-reduce over object metadata.”

This is why:

  • Large-scale recovery is feasible
  • Foreground traffic remains stable
  • Operator intent is explicit

13. Final Takeaways #

  • Erasure coding is per erasure set, not per cluster
  • Erasure sets are small, fixed, and conservative
  • Reads and writes use different quorums
  • A full drive is treated as failed
  • Degraded writes are allowed—but bounded
  • Rebalance and heal are explicit
  • Recovery is object-scoped, metadata-first, incremental

One-sentence summary #

MinIO trades automatic, opaque recovery for explicit, predictable, object-driven convergence—and that is exactly why it scales safely.