PP

Parallel Architecture & Programing VIII

15618 PP8 Synchronization Implementing

Posted by freeCookie🍪 on February 1, 2019

Parallel Architecture & Programing VIII

🍋🍋🍋包围着我,今天也要加油鸭。WFH了所以没有自闭,甚至没有工作,而是在学习如何实现同步,虽然没啥用。啊啊啊啊啊啊啊内心十分难受了。啊难受_(:з」∠)_我是想好好工作的

Implemenging Locks

Primitives for ensuring mutual exclusion

  • Locks
  • Atomic primitives
  • Transactions

Primitives for event signaling

  • Barriers
  • Flags

Three phases of a synchronization event

  • Acquire method
  • Waiting algorithm
    • Busy waiting (spinning)
      • Scheduling overhead is larger than expected wait time
      • Processor’s resources not needed for other tasks
    • Blocking: if progress cannot be made because a resource cannot be acquired, it is desirable to free up execution resources for another thread
  • Release method

Desirable lock performance

  • Low latency: If lock is free and no other processors are trying to acquire it, a processor should

    be able to acquire the lock quickly

  • Low interconnect traffic: If all processors are trying to acquire lock at once, they should acquire the lock in succession with as little traffic as possible
  • Scalability: Latency / traffic should scale reasonably with number of processors
  • Low storage cost
  • Fairness: Avoid starvation or substantial unfairness

Test and set lock

Characteristics:

  • Slightly higher latency than test-and-set in uncontended case
  • Generates much less interconnect traffic
    • One invalidation, per waiting processor, per lock release (O(P) invalidations)
    • O(P2) interconnect traffic if all processors have the lock cached
    • Test-and-set lock generated one invalidation per waiting processor per test
  • More scalable (due to less traffic)
  • Storage cost unchanged (one int)
  • Still no provisions for fairness

Test-and-set lock with back off:

  • Upon failure to acquire lock, delay for awhile before retrying

  • Same uncontended latency as test-and-set, but potentially higher latency under contention
  • Generates less traffic than test-and-set
  • Improves scalability (due to less traffic)
  • Storage cost unchanged
  • Exponential back-off can cause severe unfairness (newer requesters back off for shorter intervals)
  • Main problem with test-and-set style locks: upon release, all waiting processors attempt to acquire lock using test-and-set

想回工大吃年夜饭 Test and set lock保证共享资源在任一时刻只有一个锁的拥有者,其他处理器busy waiting.

Ticket lock

  • No atomic operation needed to acquire the lock (only a read)
  • Only one invalidation per lock release (O(P) interconnect traffic)

Array-based lock

  • Each processor spins on a different memory address
  • Utilizes atomic operation to assign address on attempt to acquire

  • O(1) interconnect traffic per release, but lock requires space linear in P
  • The atomic circular increment is a more complex operation (higher overhead)

Queue-based Lock (MCS lock)

  • Create a queue of waiters
    • Each thread allocates a local space on which to wait
  • Pseudo-code:
    • Glock – global lock
    • Mlock –my lock (state, next pointer)

Implementing Barriers

Centralized barrier

  • O(P) traffic on interconnect per barrier:
    • All threads: 2P write transactions to obtain barrier lock and update counter
    • Last thread: 2 write transactions to write to the flag and reset the counter
  • Still serialization on a single shared lock
    • So span (latency) of entire operation is O(P)
    • Optimization:Tree implementation

Tree implementation of barrier

  • Combining trees make better use of parallelism in interconnect topologies
    • lg(P) span (latency)
    • Strategy makes less sense on a bus (all traffic still serialized on single shared bus)
  • Barrier acquire: when processor arrives at barrier, performs increment of parent counter
  • Barrier release: beginning from root, notify children of release

Fine-grained locking

  • Use fine-grained locking to reduce contention (maximize parallelism) in operations on shared data structures
    • But fine-granularity can increase code complexity (errors) and increase execution overhead

细粒度同步,举个向有序链表中插入节点的例子:

多个线程同时修改导致丢失节点。

Solution 1: protect the list with a single lock

  • Good: Simple
  • Bad: Operations on the data structure are serialized

Solution 2: fine-grained locking

  • Goal: enable parallelism in data structure operations
    • Reduces contention for global data structure lock
    • In linked-list example: a single monolithic lock is overly conservative
  • Challenge: tricky to ensure correctness
  • Costs
    • Overhead of taking a lock each traversal step
    • Extra storage cost

每个node都拥有自己的锁,或许前后两个相邻节点的锁后可以插入新节点。

Lock-free data structures

  • An algorithm that uses locks is blocking regardless of whether the lock implementation uses spinning or pre-emption

  • Non-blocking algorithms are lock-free if some thread is guaranteed to make progress (“systemwide progress”)
  • Lock-free data structures: non-blocking solution to avoid overheads due to locks
    • But can be tricky to implement
    • Still requires appropriate memory fences on modern relaxed consistency hardware
    • A lock-free design does not eliminate contention

Single reader, single writer unbounded queue:

  • Tail points to last element added
  • Head points to element BEFORE head of queue
  • Allocation and deletion performed by the same thread
    • Only push modifies tail & reclaim; only pop modifies head

ABA problem

ABA问题,compare and swap中可能导致的一个问题,也可以通过版本戳解决。

  • Maintain counter of pop operations
  • Requires machine to support “double compare and swap” (DCAS) or doubleword CAS
  • Could also solve ABA problem with node allocation and/or element reuse policies

Another ABA solution: Hazard Pointers

  • Node cannot be recycled or reused if matches any hazard pointer

Reference

None-Blocking Linked List

Lock-Free Code: A False Sense of Security

Common Pitfalls in Writing Lock-Free Algorithms