Memory Consistency models

Memory Consistency models #

A learning note for CMU 15-418/618 Lecture 13.

Memory coherence vs. memory consistency #

Memory coherence defines the requirements for the observed behavior of reads and writes to the same memory location:

  • All processors must agree on the order of reads/writes to X
  • In other words: It is possible to put operations involving X on a timeline such that the observations of all processors are consistent with that timeline
  • The goal of cache/memory coherence is to ensure that the memory systme in a parallel computer behaves as if the caches were not there, i.e., Just like how the memory system in a uni-processor system behaves as of the cache was not there.
  • A system without caches would have no need for cache coherence
01
Figure 1: memory coherence example

However, Memory Consistency defines the behavior of reads and writes to different memory locations(as observed by other processors)

  • Coherence only guarantees that writes to address X will eventually propagate to other processors
  • Consistency deals with when writes to X propagate to other processors, relative to reads and writes to other address
  • Memory consistency defines the allowed behavior of loads and stores to different addresses in a parallel system, i.e., the allowed behavior of memory should be specified whether or not caches are present (and that’s what a memory consistency model does)

Memory operation ordering #

A program defines a sequence of loads and stores operations, and this is the “program order” of loads and stores.

There are four types of memory operation orderings

  • W -> R: write to X must commit before subsequent read from Y
  • R -> R: read from X must commit before subsequent read from Y
  • R -> W: read to X must commit before subsequent write to Y
  • W -> W: write to X must commit before subsequent write to Y

Note: “write must commit before subsequent read” means: When a write comes before a read in program order, the write must commit(its results are visible) by the time the read occurs.

A sequentially consistent memory system maintains all four memory operation orderings.

Sequential consistency #

A parallel system is sequentially consistent if the result of any parallel execution is the same as if all the memory operations were executed in some sequential order, and the memory operation operations of any one processor are executed in program order.

02
Figure 2: sequential consistency example, Note, now timeline lists operations to address X and Y

A switch metaphor for sequential consistency

  • All processors issue load and stores in program order
  • Memory chooses a processor, performs a memory operations to completion, then chooses another processor,…
03
Figure 3: A switch metaphor for sequential consistency

Quick example

04
Figure 4: Quick example: assume A and B are intiallized to 0. Questiong: Imageine threads 1 and 2 are beign run simutaneously on a two processor system. What will get printed?

Answer: assuming writes propagate immediately(e.g., P1 won’t continue to if statement untile P2 observes the write to A), then the code will either print “hello” or “world” or nothing, but not both.

Relaxing memory operation ordering #

  • A sequentially consistent memory system maintains all for memory operation orderings(w -> R, R -> R, R -> w, W -> W).
  • Relaxed memory consistency models allow certain orderings to be violated. Back to the quick example:
05
Figure 5: ILP as a motivation to reorder the instruction execution order

Motivation for relaxed consistency: hiding latency #

Why are we interested in relaxing ordering requirements?

  • To gain performance
  • Specifically, hiding memory latency: overlap memory access operations with other operations when they are independent
  • Remember, memory access in a cache coherent system may entail much more work than simply reading bits from memory(finding data, sending invalidations, etc.)
06
Figure 6: Overlap memory access operations with other operations when they are independent to hide latency

Another way of thinking about relaxed ordering

07
Figure 7: Another way of thinking about relaxed ordering

Allowing reads to move ahead of writes #

By allowing reads to move ahead of writes, it means the W -> R is relaxed:

  • W -> R: write to X must commit before subsequent read from Y
  • R -> R: read from X must commit before subsequent read from Y
  • R -> W: read to X must commit before subsequent write to Y
  • W -> W: write to X must commit before subsequent write to Y

By allowing reads to move ahead of writes, it allows processor to hide latency of writes

08
Figure 8: Another way of interpreting Figure 6, Allowing reads to move ahead of writes to hide latency

Write buffering example #

Write buffering is a common processor optimization that allows reads to proceed in front of prior writes(relaxes W -> R ordering)

09
Figure 9: Write buffering as an example to hide latency
  • When store is issued, processor buffers store in write buffer(assume store is to address X)
  • Processor immediately begins executing subsequent loads, provided they are not accessing address X(exploits ILP in program)
  • Further writes can also be added to write buffer(write buffer is processed in order, there is no W -> W reordering)

Relaxed consistency performance #

10
Figure 10: W-R relaxed consistency performance

Different W-R relaxed ordering #

Total store ordering(TSO)

  • Processor P can read B before its write to A is seen by all processors(processor can move its own reads in front of its own writes)
  • Reads by other processors cannot return new value of A until the write to A is observed by all processors

Processor consistency(PC)

  • Any processor can read new value of A before the write is observed by all processors

In TSO and PC, only W -> R order is relaxed, The W -> W constraint still exists. Writes by the same thread are not reorderd(they occur in program order).

For examples on TSO & PC

11
Figure 11: For examples on TSO & PC

Clarification #

The cache coherency problem exists because of the optimization of duplicating data in multiple processor caches. The copies of the data must be kept coherent.

Relaxed memory consistency issues arise from the optimization of reordering memory operations.(Consistency is unrelated to whether there are caches in the system.)

Allowing writes to be reordered(PSO) #

By allowing reads to move ahead of writes, it means the W -> R and w -> W are relaxed:

  • W -> R: write to X must commit before subsequent read from Y
  • R -> R: read from X must commit before subsequent read from Y
  • R -> W: read to X must commit before subsequent write to Y
  • W -> W: write to X must commit before subsequent write to Y

This is so called Partial Store Ordering(PSO)

12
Figure 12: Example of PSO

Execution may not match sequential consistency on the program(P2 may observe change to flat before change to A). i.e., the program may now print 1 or 0 compared to the case of TSO where it can only print 1. Printing 0 can happen because writes can now be reordered, and so if the flag gets set to 1 and thread 2 executes before A is set to 1, then print A results in 0.

Alowing all reorderings #

  • W -> R: write to X must commit before subsequent read from Y
  • R -> R: read from X must commit before subsequent read from Y
  • R -> W: read to X must commit before subsequent write to Y
  • W -> W: write to X must commit before subsequent write to Y

Why might it be useful to allow more aggressive memory operation reorderings?

  • W -> W: processor might reorder write operations in a write buffer(e.g., one is a cache miss while the other is a hit)
  • R ->, R -> R: processor might redorer independent instructions in an instruction stream(out-of-order execution)
  • Keep in mind these are all valid optimizations if a program consists of a single instruction stream.

Examples

  • Weak ordering(WO)
  • Release Consistency(RC)

With these aggressive memory operation reorderings, It is the programmer’s responsibility to ensure the uniqueness of the meaning of a program, although it is implicitly handled through synchronization libraries. The library handles the architectural details in order to present the typical synchronization abstractions (locks etc). All the programmer needs to do is work with the sync abstractions.

For example, synchronization libraries may use processor-supported special synchronization operations like this:

reorderable reads and writes here
...
MEMORY FENCE
...
reorderable reads and writes here
...
MEMORY FENCE

Synchronization #

The basic idea of synchronization is to ensure the uniqueness of the meaning of a program no matter what kind of memory operation reorderings are used.

Example: expressing synchronization in relaxed models

Intel x86/x64 processors use total store ordering as the memory consistency model, It provides sync instructions if the software requires a specific instruction ordering not guaranteed by the TSO consistency model.

  • mm_lfence(“load fence”: wait for all loads to complete)
  • mm_sfence(“store fence”: wait for all stores to complete)
  • mm_mfence(“mem fence”: wait for all memory operations to complete)

In case of ARM processors: it uses very relaxed consistency model.

Acquire/release sematics #

  • Operation X with acquire semantics: prevent reordering of X with any load/store after X in program order.
  • Other processors see X’s effect before the effect of all subsequent operations
  • Example: taking a lock must have acquire sematics
13
Figure 13: Operation X with acquire semantics
  • Operation X with release semantics: prevent reordering of X with any load/store before X in program order.
  • Other processor see effects of all prior operations befoer seeing effect of X
  • Example: releasing a lock must have release sematics
14
Figure 14: Operation X with release semantics

C++ 11 atomic #

  • Provides atomic read, write, read-modify-write of entire objects, the atomicity maybe implemented by mutex or efficiently by processor-supported atomic instructions(if T is a basic type), for example:
  std::atomic<int> i;
  i++; // atomically increment i

  int a = i;
  // do stuff
  i.compare_exchange_strong(a, 10); // if i has same value as a, set to 10
  bool b = i.is_lock_free();        // true if implementation of atomicity is look free
  • Provides memory ordering semantics for opertions before and after atomic operations, it is default to sequential consistency, See std::memory_order for more detail.
15
Figure 15: C++11 atomic ensures sequentially consistent behavior by default

The memory ordering semantics can be control explicitly:

16
Figure 16: C++11 explicit control on memory ordering semantics

Conflicting data access #

Two memory access by different processors conflict if

  • They access the same memory location
  • At least one is write

A program is Unsynchronized program if conflicting accesses not ordered by synchronization(e.g., a fence, operation with release/acquire sematics, barrier, etc.)

Unsynchronized program contains data-races: the output of the program depends on relative speed of processors(i.e., the program results is non-deterministic).

Synchronized programs yield SC results on non-SC systems, i.e., thare are data-race-free

Summary: relaxed consistency #

  • Motivation: obtain higher performance by allowing reordering of memmory operations(reordering is not allowed by sequential consistency)
  • One cost is software complexity: programmer or compiler must correctly insert synchronization to ensure certain specific operation orderings when needed. However, In pratice complexities encapsulated in labraries that provide intuitive primitives like lock/unlock, barrier(or lower level primitives like fence)
  • It is optimzied for the common case: most memory access are not conflicting, so don’t design a system that pays the cost as if they are.

Relaxed consistency models differ in which memory ordering constraints they ignore.

References #