Scheduling In Go: Part 1 - Overview #

Most of the content is copied from this quora answer

This is the first post in a three part series that will provide an understanding of the mechanics and semantics behind the scheduler in Go. This post focuses on the high level overview of N:M scheduling.

Index of the three part series:

Scheduling In Go: Part 1 - Overview

Scheduling In Go: Part 2 - Go Scheduler

Scheduling In Go: Part 3 - Concurrency

Goroutine #

What goroutine bring us is in fact a version of N:M scheduling, where there are N goroutines which are backed by M OS threads.

Almost all of the benefit revolves around call conversion scheduling. This is where the scheduler take what would be a blocking call, and trade it for a non-blocking call plus a context switch in user space.

The remainder of the benefit comes from voluntary preemption; this is where a thread which would be suspended pending a condition, where the condition check taking place in the OS kernel, instead moves the condition check to user space. The suspended goroutine is placed on a condition wait queue, and there is a context switch — just like in the call conversion scheduler.

N:M threading was invented specifically to address the issue of partial quantum context switch overhead.

It turns out that making a context switch from one kernel thread to another kernel thread is relatively expensive. We have two system calls of overhead, full register and context spills, and context reloading.

Effectively, we have four protection domain crossings, and a lot of data movement overhead.

If we can accomplish this in user space, we have zero protection domain crossings, and a much smaller data movement overhead.

N:M threading does this.

Additionally, if we make a blocking call partially into our quantum, we pay that expense to run another thread instead. So if we were given 100ms of quantum, and used on average 33ms of quantum per thread, then we pay 3 heavyweight context switch overheads per quantum — one of which we’d pay anyway, because it’s an involuntary preemption at the expiration of the quantum, because the CPU scheduler implements time sharing between many processes running on a system.

If we do it in user space, instead of 3, we only pay 1. That’s a 66% savings in context switch overhead — N:M threading — such as in goroutines — accomplishes this goal.

Here is the quote from Terry Lambert

The OS gave me the quantum; it is my damn quantum, and I am going to run threads in my process with it, if I have another thread ready to run, and I am not giving the remainder of that quantum back to the OS to give to someone else.

That’s goroutines

But because it divides the scheduling responsibility between the kernel scheduler, and another scheduler — which lives in the goroutines runtime in the library — there’s complexity.

Every time a potentially blocking system call is added to the system, we have to examine it to see if we can make a call conversion using it, or if it has to be a blocking call.

And if we get to the point we are about to block all of our OS threads — our M’s — because we have that many unconvertible blocking operations outstanding, but we have goroutines whose wait condition has been satisfied, so they are ready to run… then we have to start another OS thread (M := M + 1) to back the ready to run goroutines, while we then block the Mth OS thread.

In the limit, it’s possible for M to grow to equal N (if we have N blocking outstanding operations).

Worse, as an implementation detail, almost all user space N:M schedulers of this type tend to keep a separate reserve thread-spawning-thread; so the degenerate case is N effective threads (goroutines) + 1 thread-spawning-thread (N goroutines vs. M + 1 kernel threads).

Typically, the way goroutines are intended to be utilized means that N > M for all cases that are interesting, and usually N » M (N very much larger than M).

Context switch in kernel threads #

A context switch refers to the process of storing and restoring the state (context) of a thread or a instruction stream so that execution can be resumed from the same point at a later time. This allows multiple threads/instruction-stream to share a single CPU, as the operating system can switch between each thread/instruction-stream to give the illusion of concurrent execution (on a single-core CPU), or can actually execute them concurrently (on a multi-core CPU).

Switching from one kernel thread to another is considered “expensive” in terms of computing resources for several reasons:

CPU Cycles: Saving and restoring the context of a thread involves storing and retrieving data to and from memory, which consumes CPU cycles.
Cache Flushes: During a context switch, the CPU’s cache can be invalidated. When a new thread is scheduled, it’s likely to need completely different data in the cache. Thus, valuable time may be spent refilling the cache.
TLB Flushes: The Translation Lookaside Buffer (TLB) is a CPU cache used by memory management hardware to improve virtual address translation speed. Similar to CPU caches, TLB entries could be invalidated during a context switch, which can lead to a performance penalty.
Scheduler Overhead: Determining which thread to switch to requires some computation, adding to the overhead.
Latency: During the actual switch, both the thread being switched out and the thread being switched in are effectively stalled, which can have a notable impact on overall system performance, particularly if context switches are happening frequently.

All of these factors together contribute to the “cost” of a context switch. This is why in high-performance systems, there is often a focus on reducing the frequency of context switches, or using techniques like user-level threads or green threads to avoid the need for full context switches.

System call recap #

A system call is a programmatic way in which a computer program requests a service from the kernel of the operating system it is executed on. This service could include I/O operations, creating processes, communication with other processes, accessing system hardware, etc.

System calls provide an interface between a process and the operating system. They allow user-space programs to interact with hardware resources and other facilities of the system, without having to know the details of how the hardware or the kernel works. This provides a level of abstraction that simplifies programming.

When a system call is executed, the execution context switches from user mode (where user programs normally run) to kernel mode (a privileged mode where the kernel and some system tasks run), where the required service is performed. This is necessary because many operations that system calls need to perform (such as accessing hardware or managing memory) require privileged access, which user mode programs don’t have. After the system call is completed, control is returned to the calling program in user mode.

Some common examples of system calls include open(), read(), write(), close(), fork(), exec(), exit(), wait(), etc., which are used for operations such as file handling, process management, inter-process communication, and so on.

Let’s take C++ as an example:

In a C++ program that is statically linked, the libraries that the program uses, including the system’s C library, are incorporated into the final executable. However, system calls themselves cannot be statically linked because they are not regular functions.

System calls are interfaces to services provided by the operating system kernel. They are not implemented in libraries but rather in the kernel itself. Libraries, including the standard C library, provide functions (such as open, read, write, close, etc.) that act as wrappers around these system calls. These wrapper functions set up the appropriate registers and then use a special instruction (SYSCALL on Linux, INT 0x80 on older x86 systems, etc.) to transition from user mode to kernel mode, triggering the actual system call within the kernel.

When a program is statically linked, all of these wrapper functions from the library are included in the final executable. However, the actual system calls that these functions trigger are not part of the executable, but rather part of the running kernel. So even with a statically linked program, system calls still require a transition to kernel mode.

And in case of Golang:

When making system calls, Go bypasses the C standard library and interacts directly with the kernel. This is done using assembly language to implement the low-level details of each system call.

The Go runtime contains separate system call packages for different operating systems, each containing the necessary assembly code to perform system calls on that OS. For example, the syscall package in Go’s standard library provides an interface to many of the system calls and low-level functions available on the underlying operating system.

Let’s take a simple example like opening a file. Here’s how it works:

A Go program uses the os.Open function to open a file.
The os package uses the syscall.Open function to make a system call to open the file.
The syscall.Open function is actually a wrapper around the raw system call number for the open system call. This function sets up the appropriate registers and then uses a special assembly instruction (SYSCALL on Linux, INT 0x80 on older x86 systems, etc.) to transition from user mode to kernel mode, triggering the actual system call within the kernel.
The kernel handles the system call and then returns control to the Go runtime, which then resumes the execution of the original goroutine.

In summary, Go does not use the C standard library, it performs system calls by using assembly language to interface directly with the kernel. This approach provides Go with the flexibility and performance benefits that come with direct access to system calls, while still providing developers with a high-level, easy-to-use API via Go’s standard library.

Scheduling In Go: Part 1 - Overview #

Goroutine #

Context switch in kernel threads #

System call recap #

References #