A memo for Linux I/O and Filesystem
November 27, 2025
A memo for Linux I/O and Filesystem, collected from AI chat threads
This memo is an extended memo for this post, consolidates knowledge about:
- VFS & Filesystem Delegation
- File descriptors & open file table
- stdio / unbuffered / direct I/O
- Page Cache, Writeback
- fsync vs fdatasync (accurate across ext4, XFS, Btrfs, ZFS)
- open flags
- Pipes, dup2
- fork, clone, exec
- splice / zero-copy design
Designed as a reference entry for systems and storage engine development.
1. VFS & Filesystem Delegation #
Linux I/O has a layered architecture:
Application → syscalls → VFS → filesystem → block layer → disk
1.1 VFS Responsibilities #
VFS handles:
- FD table and
struct file - Page cache management
- Generic read/write dispatch
- Permission checking
- Pipes and non-regular file types
- Dispatch to filesystem-specific callbacks
1.2 Filesystem Responsibilities #
The filesystem implements:
inode_operationsfile_operationsaddress_space_operations- Block/extents mapping
- Metadata layout
- Journaling or CoW mechanisms
1.3 Example — How a Read Works #
When you call read(fd, buf, len):
read()
↓
vfs_read()
↓
file->f_op->read_iter() (filesystem-specific)
↓
generic_file_read_iter() (common)
↓
page cache
↓ page miss
fs->aops->readpage() → disk
Example Code #
int fd = open("data.txt", O_RDONLY);
char buf[128];
ssize_t n = read(fd, buf, sizeof(buf));
close(fd);
2. File Descriptors, Open File Table, and Inodes #
Linux tracks file state in three layers:
Process FD Table → Open File Description (struct file) → Inode
Example — One FD #
int fd = open("file.txt", O_RDONLY);
// fd 3 → struct file → inode #123
Example — dup shares offset #
int a = open("file.txt", O_RDWR);
int b = dup(a);
read(a, buf, 10); // offset = 10
read(b, buf, 10); // offset = 20 (shared)
Example — two opens → two offsets #
int a = open("file.txt", O_RDWR);
int b = open("file.txt", O_RDWR);
3. Stdio vs Unbuffered vs Direct I/O #
3.1 Stdio (FILE*)
#
Buffered in user space.
FILE* f = fopen("out.txt", "w");
fprintf(f, "hello\n"); // buffered
fclose(f); // flush to kernel
Unsuitable for DBMS durability.
3.2 POSIX Unbuffered I/O (read/write) #
Data goes directly to kernel page cache.
int fd = open("out", O_WRONLY | O_CREAT);
write(fd, "abc", 3);
fsync(fd);
3.3 Direct I/O (O_DIRECT)
#
Bypasses page cache → alignment required.
void* buf;
posix_memalign(&buf, 4096, 4096);
int fd = open("raw.dat", O_DIRECT | O_RDWR | O_CREAT, 0644);
write(fd, buf, 4096);
close(fd);
Used in DB engines.
4. Page Cache, Dirty Pages, and Writeback #
Buffered writes behave like:
write() → page cache (dirty) → return
later: writeback thread flushes to disk
Durability requires calling fsync or fdatasync.
Example #
int fd = open("log", O_WRONLY | O_CREAT);
write(fd, "entry", 5);
fsync(fd); // durable
5. fsync vs fdatasync #
fsync() and fdatasync() both ensure durability, but differ in how much metadata must reach stable storage and how much journaling or CoW flushing they trigger.
This is a fully accurate, filesystem-correct explanation.
5.1 What fdatasync() Guarantees
#
fdatasync(fd) ensures Minimal Metadata:
-
All file data pages are flushed
-
Only metadata required to read the file is durable:
- file size (
i_size) - block pointers / extents
- file size (
fdatasync does not flush:
- timestamps
- uid/gid/mode
- xattrs
- directory entries
- parent directory inode metadata
- quota/security metadata
Result on ext4/XFS: #
- Typically one journal transaction + one commit
Example #
int fd = open("log", O_WRONLY | O_CREAT);
write(fd, "hello", 5);
fdatasync(fd);
5.2 What fsync() Must Guarantee
#
fsync(fd) must persist for Full Metadata Durability:
- File data
- All inode metadata (ctime, mtime, uid/gid/mode)
- Extended attributes
- Directory entry
- Parent directory inode metadata (mtime)
- Security/quota metadata
This leads to:
- More metadata updates
- Multiple journal transactions (ext4/XFS)
- More I/O
- CoW tree flushes (Btrfs/ZFS)
Example #
int fd = open("newfile", O_WRONLY | O_CREAT);
write(fd, "data", 4);
fsync(fd);
5.3 Does fsync write the inode block? #
ext4 (ordered/journal mode) and XFS #
fsync/fdatasync do NOT need to immediately write the inode table block.
Inode updates are recorded in the journal, and the journal commit record
ensures durability. The inode table block is written later by normal
writeback.
Meaning:
- Metadata durability comes from the journal transaction
- The inode table block is not flushed during fsync
- It is flushed lazily afterward
Btrfs and ZFS (Copy-on-Write filesystems) #
fsync MUST write the actual new CoW metadata blocks (B-tree nodes),
because metadata durability requires writing the updated tree structure
directly. There is no journal that can replay inode changes.
Thus:
- CoW FS always writes metadata blocks during fsync
- No lazy metadata persists via journal replay
5.4 Cost Difference #
Why fsync Is More Expensive
fdatasync: #
- Writes data blocks
- Writes minimal inode updates
- Issues one journal transaction
- No directory flush
- No parent directory flush
fsync: #
- Writes data blocks
- Writes full inode metadata
- Writes directory entry
- Writes parent directory metadata
- Issues multiple journal transactions (ext4/XFS)
- Writes CoW metadata nodes (Btrfs/ZFS)
fsync = multiple metadata writes + multiple journal txns
fdatasync = minimal metadata + one journal txn
5.5 Summary #
fdatasync: Durable data + essential inode fields. Minimal metadata. One journal txn.
fsync: Durable data + full metadata + directory entry + parent directory metadata.
Multiple journal txns (ext4/XFS) or metadata block flushes (Btrfs/ZFS).
ext4/XFS: Inode block not flushed during fsync; durability comes from journal.
Btrfs/ZFS: Metadata blocks must be physically written during fsync.
The real cost diffrence lands on the jornalling or metadata blocks writting
6. Important open(2) Flags #
| Flag | Meaning |
|---|---|
O_RDONLY, O_WRONLY, O_RDWR |
Access mode |
O_APPEND |
Append writes |
O_CREAT |
Create file |
O_TRUNC |
Truncate file |
O_CLOEXEC |
Auto-close on exec |
O_DIRECT |
Bypass page cache |
O_SYNC |
fsync-on-every-write |
O_DSYNC |
fdatasync-on-every-write |
Example #
int fd = open("data", O_WRONLY | O_CREAT | O_DSYNC);
write(fd, buf, len);
7. Pipes — Kernel FIFO With Backpressure #
- ~64KB kernel buffer
- Blocking producer/consumer semantics
- No file offsets
- Used as a synchronization mechanism
Example #
int p[2];
pipe(p);
write(p[1], "ABC", 3);
char buf[3];
read(p[0], buf, 3);
8. dup, dup2, dup3 #
Redirect stdout → pipe #
int p[2];
pipe(p);
dup2(p[1], STDOUT_FILENO);
execlp("echo", "echo", "hi", NULL);
9. fork, clone, exec #
fork() #
- Creates new process
- Memory is CoW
- File descriptions are shared
if (fork() == 0)
printf("child\n");
clone() — threads #
clone(start, stack,
CLONE_VM | CLONE_THREAD | CLONE_FILES | CLONE_SIGHAND,
arg);
Threads share FD table.
exec() #
Replaces the process image.
if (fork() == 0)
execl("/bin/ls", "ls", "-l", NULL);
wait(NULL);
10. splice / vmsplice / tee — Kernel Zero-Copy #
10.1 splice() #
Zero-copy movement between FDs.
splice(file_fd, NULL, pipefd[1], NULL, 65536, 0);
splice(pipefd[0], NULL, sock_fd, NULL, 65536, 0);
10.2 tee() #
Duplicate pipe data zero-copy.
tee(p1[0], p2[1], 65536, 0);
10.3 vmsplice() #
Pin user memory into pipe buffer.
struct iovec iov = { buf, len };
vmsplice(p[1], &iov, 1, 0);
11. Zero-Copy Pipeline: file → pipe → socket #
int pipefd[2];
pipe(pipefd);
while (remaining > 0) {
ssize_t n = splice(file_fd, NULL, pipefd[1], NULL, 65536, 0);
ssize_t m = splice(pipefd[0], NULL, sock_fd, NULL, n, 0);
remaining -= m;
}
Zero-copy path:
Disk → page cache → pipe → socket → network
12. Summary #
This document covers:
- VFS architecture and filesystem delegation
- FD tables, file descriptions, inode behavior
- Stdio vs POSIX vs Direct I/O
- Page cache and writeback
- fsync/fdatasync semantics
- Pipes, dup2, fork/clone/exec
- splice/vmsplice/tee
- Inline practical examples throughout
A reference entry for storage engine and systems engineering.