A memo for Linux I/O and Filesystem

November 27, 2025

DBMS

A memo for Linux I/O and Filesystem, collected from AI chat threads

This memo is an extended memo for this post, consolidates knowledge about:

VFS & Filesystem Delegation
File descriptors & open file table
stdio / unbuffered / direct I/O
Page Cache, Writeback
fsync vs fdatasync (accurate across ext4, XFS, Btrfs, ZFS)
open flags
Pipes, dup2
fork, clone, exec
splice / zero-copy design

Designed as a reference entry for systems and storage engine development.

1. VFS & Filesystem Delegation #

Linux I/O has a layered architecture:


Application → syscalls → VFS → filesystem → block layer → disk

1.1 VFS Responsibilities #

VFS handles:

FD table and struct file
Page cache management
Generic read/write dispatch
Permission checking
Pipes and non-regular file types
Dispatch to filesystem-specific callbacks

1.2 Filesystem Responsibilities #

The filesystem implements:

inode_operations
file_operations
address_space_operations
Block/extents mapping
Metadata layout
Journaling or CoW mechanisms

1.3 Example — How a Read Works #

When you call read(fd, buf, len):


read()
↓
vfs_read()
↓
file->f_op->read_iter()             (filesystem-specific)
↓
generic_file_read_iter()            (common)
↓
page cache
↓ page miss
fs->aops->readpage() → disk

Example Code #

int fd = open("data.txt", O_RDONLY);
char buf[128];
ssize_t n = read(fd, buf, sizeof(buf));
close(fd);

2. File Descriptors, Open File Table, and Inodes #

Linux tracks file state in three layers:

Process FD Table → Open File Description (struct file) → Inode

Example — One FD #

int fd = open("file.txt", O_RDONLY);
// fd 3 → struct file → inode #123

Example — dup shares offset #

int a = open("file.txt", O_RDWR);
int b = dup(a);

read(a, buf, 10);   // offset = 10
read(b, buf, 10);   // offset = 20 (shared)

Example — two opens → two offsets #

int a = open("file.txt", O_RDWR);
int b = open("file.txt", O_RDWR);

3. Stdio vs Unbuffered vs Direct I/O #

3.1 Stdio (`FILE*`) #

Buffered in user space.

FILE* f = fopen("out.txt", "w");
fprintf(f, "hello\n");   // buffered
fclose(f);               // flush to kernel

Unsuitable for DBMS durability.

3.2 POSIX Unbuffered I/O (read/write) #

Data goes directly to kernel page cache.

int fd = open("out", O_WRONLY | O_CREAT);
write(fd, "abc", 3);
fsync(fd);

3.3 Direct I/O (`O_DIRECT`) #

Bypasses page cache → alignment required.

void* buf;
posix_memalign(&buf, 4096, 4096);

int fd = open("raw.dat", O_DIRECT | O_RDWR | O_CREAT, 0644);
write(fd, buf, 4096);
close(fd);

Used in DB engines.

4. Page Cache, Dirty Pages, and Writeback #

Buffered writes behave like:

write() → page cache (dirty) → return
later: writeback thread flushes to disk

Durability requires calling fsync or fdatasync.

Example #

int fd = open("log", O_WRONLY | O_CREAT);
write(fd, "entry", 5);
fsync(fd);      // durable

5. fsync vs fdatasync #

fsync() and fdatasync() both ensure durability, but differ in how much metadata must reach stable storage and how much journaling or CoW flushing they trigger. This is a fully accurate, filesystem-correct explanation.

5.1 What `fdatasync()` Guarantees #

fdatasync(fd) ensures Minimal Metadata:

All file data pages are flushed
Only metadata required to read the file is durable:
- file size (i_size)
- block pointers / extents

fdatasync does not flush:

timestamps
uid/gid/mode
xattrs
directory entries
parent directory inode metadata
quota/security metadata

Result on ext4/XFS: #

Typically one journal transaction + one commit

Example #

int fd = open("log", O_WRONLY | O_CREAT);
write(fd, "hello", 5);
fdatasync(fd);

5.2 What `fsync()` Must Guarantee #

fsync(fd) must persist for Full Metadata Durability:

File data
All inode metadata (ctime, mtime, uid/gid/mode)
Extended attributes
Directory entry
Parent directory inode metadata (mtime)
Security/quota metadata

This leads to:

More metadata updates
Multiple journal transactions (ext4/XFS)
More I/O
CoW tree flushes (Btrfs/ZFS)

Example #

int fd = open("newfile", O_WRONLY | O_CREAT);
write(fd, "data", 4);
fsync(fd);

5.3 Does fsync write the inode block? #

ext4 (ordered/journal mode) and XFS #

fsync/fdatasync do NOT need to immediately write the inode table block.
Inode updates are recorded in the journal, and the journal commit record
ensures durability. The inode table block is written later by normal
writeback.

Meaning:

Metadata durability comes from the journal transaction
The inode table block is not flushed during fsync
It is flushed lazily afterward

Btrfs and ZFS (Copy-on-Write filesystems) #

fsync MUST write the actual new CoW metadata blocks (B-tree nodes),
because metadata durability requires writing the updated tree structure
directly. There is no journal that can replay inode changes.

Thus:

CoW FS always writes metadata blocks during fsync
No lazy metadata persists via journal replay

5.4 Cost Difference #

Why fsync Is More Expensive

fdatasync: #

Writes data blocks
Writes minimal inode updates
Issues one journal transaction
No directory flush
No parent directory flush

fsync: #

Writes data blocks
Writes full inode metadata
Writes directory entry
Writes parent directory metadata
Issues multiple journal transactions (ext4/XFS)
Writes CoW metadata nodes (Btrfs/ZFS)

fsync = multiple metadata writes + multiple journal txns
fdatasync = minimal metadata + one journal txn

5.5 Summary #

fdatasync: Durable data + essential inode fields. Minimal metadata. One journal txn.

fsync: Durable data + full metadata + directory entry + parent directory metadata.
Multiple journal txns (ext4/XFS) or metadata block flushes (Btrfs/ZFS).

ext4/XFS: Inode block not flushed during fsync; durability comes from journal.
Btrfs/ZFS: Metadata blocks must be physically written during fsync.

The real cost diffrence lands on the jornalling or metadata blocks writting

6. Important open(2) Flags #

Flag	Meaning
`O_RDONLY`, `O_WRONLY`, `O_RDWR`	Access mode
`O_APPEND`	Append writes
`O_CREAT`	Create file
`O_TRUNC`	Truncate file
`O_CLOEXEC`	Auto-close on exec
`O_DIRECT`	Bypass page cache
`O_SYNC`	fsync-on-every-write
`O_DSYNC`	fdatasync-on-every-write

Example #

int fd = open("data", O_WRONLY | O_CREAT | O_DSYNC);
write(fd, buf, len);

O_APPEND #

Under linux, in case of open, we should avoid O_APPEND here due to ta the following bug: POSIX requires that opening a file with the O_APPEND flag should have no affect on the location at which pwrite() writes data. However, on Linux, if a file is opened with O_APPEND, pwrite() appends data to the end of the file, regardless of the value of offset.

More info here: https://linux.die.net/man/2/pwrite

O_DIRECT #

O_DIRECT enable the direct io mode.

Direct I/O requires the caller to manage aligned buffers because Linux cannot fall back to page cache or bounce buffers. The pointer, length, and file offset must be aligned to the filesystem block size (usually 4 KB). This ensures the storage hardware can perform direct DMA to/from the user buffer without copying or partial block modifications.

7. Pipes — Kernel FIFO With Backpressure #

~64KB kernel buffer
Blocking producer/consumer semantics
No file offsets
Used as a synchronization mechanism

Example #

int p[2];
pipe(p);
write(p[1], "ABC", 3);
char buf[3];
read(p[0], buf, 3);

8. dup, dup2, dup3 #

Redirect stdout → pipe #

int p[2];
pipe(p);
dup2(p[1], STDOUT_FILENO);
execlp("echo", "echo", "hi", NULL);

9. fork, clone, exec #

fork() #

Creates new process
Memory is CoW
File descriptions are shared

if (fork() == 0)
    printf("child\n");

clone() — threads #

clone(start, stack,
      CLONE_VM | CLONE_THREAD | CLONE_FILES | CLONE_SIGHAND,
      arg);

Threads share FD table.

exec() #

Replaces the process image.

if (fork() == 0)
    execl("/bin/ls", "ls", "-l", NULL);
wait(NULL);

10. splice / vmsplice / tee — Kernel Zero-Copy #

10.1 splice() #

Zero-copy movement between FDs.

splice(file_fd, NULL, pipefd[1], NULL, 65536, 0);
splice(pipefd[0], NULL, sock_fd, NULL, 65536, 0);

10.2 tee() #

Duplicate pipe data zero-copy.

tee(p1[0], p2[1], 65536, 0);

10.3 vmsplice() #

Pin user memory into pipe buffer.

struct iovec iov = { buf, len };
vmsplice(p[1], &iov, 1, 0);

11. Zero-Copy Pipeline: file → pipe → socket #

int pipefd[2];
pipe(pipefd);

while (remaining > 0) {
    ssize_t n = splice(file_fd, NULL, pipefd[1], NULL, 65536, 0);
    ssize_t m = splice(pipefd[0], NULL, sock_fd, NULL, n, 0);
    remaining -= m;
}

Zero-copy path:

Disk → page cache → pipe → socket → network

12. Summary #

This document covers:

VFS architecture and filesystem delegation
FD tables, file descriptions, inode behavior
Stdio vs POSIX vs Direct I/O
Page cache and writeback
fsync/fdatasync semantics
Pipes, dup2, fork/clone/exec
splice/vmsplice/tee
Inline practical examples throughout

A reference entry for storage engine and systems engineering.

A memo for Linux I/O and Filesystem

November 27, 2025

1. VFS & Filesystem Delegation #

1.1 VFS Responsibilities #

1.2 Filesystem Responsibilities #

1.3 Example — How a Read Works #

Example Code #

2. File Descriptors, Open File Table, and Inodes #

Example — One FD #

Example — dup shares offset #

Example — two opens → two offsets #

3. Stdio vs Unbuffered vs Direct I/O #

3.1 Stdio (FILE*) #

3.2 POSIX Unbuffered I/O (read/write) #

3.3 Direct I/O (O_DIRECT) #

4. Page Cache, Dirty Pages, and Writeback #

Example #

5. fsync vs fdatasync #

5.1 What fdatasync() Guarantees #

Result on ext4/XFS: #

Example #

5.2 What fsync() Must Guarantee #

Example #

5.3 Does fsync write the inode block? #

ext4 (ordered/journal mode) and XFS #

Btrfs and ZFS (Copy-on-Write filesystems) #

5.4 Cost Difference #

fdatasync: #

fsync: #

5.5 Summary #

6. Important open(2) Flags #

Example #

O_APPEND #

O_DIRECT #

7. Pipes — Kernel FIFO With Backpressure #

Example #

8. dup, dup2, dup3 #

Redirect stdout → pipe #

9. fork, clone, exec #

fork() #

clone() — threads #

exec() #

10. splice / vmsplice / tee — Kernel Zero-Copy #

10.1 splice() #

10.2 tee() #

10.3 vmsplice() #

11. Zero-Copy Pipeline: file → pipe → socket #

12. Summary #

3.1 Stdio (`FILE*`) #

3.3 Direct I/O (`O_DIRECT`) #

5.1 What `fdatasync()` Guarantees #

5.2 What `fsync()` Must Guarantee #