PP

Parallel Architecture & Programing III

15618 PP3 - Parallel Performance and Optimization

Posted by freeCookie🍪 on October 10, 2018

Parallel Architecture & Programing 叁

不知道自己在干嘛,大概是出现了错觉。 _(:з」∠)_ 日常错觉

不管错觉的事,这篇复习一下performance optimization.

Balancing workload

Ideally: all processors are computing all the time during program execution

Key goals

  • balance workload onto available execution resources
  • reduce communication (avoid stalls)
  • reduce extra work (overhead)

Static assignment

  • assignment of work to threads is pre-determined
  • two approaches: blocked and interleaved
  • simple and zero runtime overhead
  • when cost/amount of work is predictable
  • “semi-static” assignment - periodically profiles and re-adjusts

Dynamic assignment

  • determines assignment dynamically at runtime - not pre-determined
  • workload balance and overhead (sync cost) trade-off
  • smarter scheduling - long tasks first
  • stealing - increase locality
    • child stealing, continuation stealing
    • gready join scheduling policy

Fork-join parallelism

Natural way to express independent work in divide-and-conquer algorithms(不认识分治法的我orz

日常失忆的我并不记得讲了这个fork-join的并行模式大概看起来是这样的我真的会过吗

Fork: create new logical thread of control

  • 父线程调用,创造新的并发子线程
  • 子线程和父线程分别异步执行

Join: returns when current function calls have completed

  • 子线程结束时调用
  • 父线程等待子线程结束

发现这个别家的课件详细介绍了fork-join model

Minimizing communication costs

(三个月后

我也想回家过年呀~ 今天把买了2个月的手办拆开摆上了真的好好看啊。不过手办买多了也没用,等预购的迹部景吾到了可能也不会再买了。(真香预警?

Message passing solver

Message passing model的一个例子,data被分成几个部分,每一个部分在一个线程的地址空间中,每个thread完成本地计算后要和相邻thread进行数据交换。

Synchronous (blocking) send & receive

阻塞型通信函数,send()在sender收到receiver发送的ack时返回。recv()在data被复制到receiver的地址空间时返回并且向sender发送ack。这是最容易实现的方式,因此被广泛应用。但是为了避免死锁,必须实现在发送和接受端都有很大buffer空间的情况。这里send()函数需要等待匹配的recv()函数发送ack,因此sender可能会出现aherad messages或者持续发送导致buffer overflow等情况造成死锁。一个解决方法是利用模2操作来确定send()和recv()的顺序,比如下面这段代码就可以避免死锁。

void solve() {
	bool done = false;
	while (!done) {
		float my_diff = 0.0f;
		if (tid % 2 == 0) {
			sendDown(); recvDown();
			sendUp(); recvUp();
		} else {
			recvUp(); sendUp();
			recvDown(); sendDown();
		}
		
	}
}

Non-blockling asynchronous send/recv

非阻塞型通信函数,send()和recv()均立即返回,并且发送端和接收端在等待消息发送/接受时都可以进行其他工作,利用checksend()和checkrecv()函数来判断消息是否传递成功,即是否可以更改buffer内容。

Pipelining 流水线

说到pipeling,就不得不想到我考试前复习算不明白cycles真的头大脑壳痛。。。最后还是成功的打了个B在这个课上,已经是我能做的最好了! 🤦‍♂️ 一定是因为不明原因的大脑当机!不是因为我笨_(:з」∠)_ 我很努力复习了第二个期中考试和第二个第一个期中考试了!

Goal: maximing throughtput of many loads

先复习一下延迟和带宽:

  • Latency 延迟: The amount of time needed for an operation to complete
  • Bandwidth 带宽: The rate at which operations are performed

  • Throughput 吞吐量:the rate of successful message delivery over a communication channel (wiki)

一个指令pipeline的例子如下,如图所示的latency为4 cycles/instruction, throughput为1 instruction/cycle:

Simple non-pipelined communication:

Pipelined communication:

Cost

通信成本 = 通信时间 - 重叠

Two reason for communication

Inherent communication

Communication that must occur in a parallel algorithm, is fundamental given how the problem is decomposed and how work is assigned

Communication to computation ratio 计算通信比

Definition:

Arithmetic intensity 运算强度: 1/communication to computation ratio

High arithmetic intensity (low communication-to-computation ratio) is required to efficiently utilize modern parallel processors since the ratio of compute capability to available bandwidth is high.

三种不同的任务分配方式对应inherent communication计算:

Artifactual communication

All other communication results from practical details of system implementation, depends on machine implementation details

Examples:

  • System might have a minimum granularity of transfer
  • System might have rules of operation that result in unnecessary communication
  • Poor placement of data in distributed memories
  • Finite replication capacity

Review of Cs

  • Cold miss: First time data touched. Unavoidable in a sequential program
  • Capacity miss: Working set is larger than cache. Can be avoided/reduced by increasing cache size
  • Conflict miss: Miss induced by cache management policy. Can be avoided/reduced by changing cache associativity, or data access pattern in application
  • Communication miss: Due to inherent or artifactual communication in parallel system

不知道这几个的中文是什么,好像分别是冷失效、容量失效、冲突失效和通信失效

Reducing communication costs

  • Reduce overhead of communication to sender/receiver
    • Send fewer messages, make messages larger (amortize overhead)

    • Coalesce many small messages into large ones

  • Reduce delay
    • Application writer: restructure code to exploit locality

    • HW implementor: improve communication architecture

  • Reduce contention

    • Replicate contended resources

    • Stagger access to contended resources

  • Increase communication/computation overlap
    • Application writer: use asynchronous communication
    • HW implementor: pipelining, multi-threading, pre-fetching, out-of-order exec
    • Requires additional concurrency in application

Techniques for reducing communication

用这个grid data access来举例:

  • Improved temporal locality by changing grid traversal order: “Blocked” iteration order

  • Improced temporal locality by fusing loops

  • Improve arithmetic intensity by sharing data
    • Exploit sharing: co-locate tasks that operate on the same data
      • Reduces inherent communication
      • Schedule threads working on the same data structure at the same time on the same processor
    • Example: CUDA thread block
  • Reducing artifactual communication: blocked data layout

Spatial locality 空间局部性

Granularity 粒度

  • Granularity of communication can be important because it may introduce artifactual communication
    • Granularity of communication / data transfer
    • Granularity of cache coherence

感觉现在看课件也能看懂,当时是怎么了就是看不懂呢?

Contention 竞争

Contention occurs when many requests to a resource are made within a small window of time

举个在并行机器上构建网格式数据结构的例子,比如计算相邻interaction:

Solution 1: parallelize over cells

  • For each cell independently compute particles within it
  • Eliminates contention because no synchronization is required

  • Insufficient parallelism: only 16 parallel tasks, but need thousands of independent tasks to efficiently utilize GPU

  • Work inefficient: performs 16 times more particle-in-cell computations than sequential algorithm
list cell_lists[16]; // 2D array of lists
for each cell c // in parallel
    for each particle p // sequentially
        if (p is within c)
			append p to cell_lists[c]

Solution 2: parallelize over particles

  • Assign one particle to each CUDA thread. Thread computes cell containing particle, then atomically updates list
    • Massive contention: thousands of threads contending for access to update single shared data structure
list cell_list[16]; // 2D array of lists
lock cell_list_lock;
for each particle p // in parallel
	c = compute cell containing p
	lock(cell_list_lock)
	append p to cell_list[c]
	unlock(cell_list_lock)

Solution 3: use finer-granularity locks

  • Alleviate contention for single global lock by using per-cell locks
    • 16x less contention than solution 2, assuming uniform distribution of particles in 2D space
list cell_list[16]; // 2D array of lists
lock cell_list_lock[16];
for each particle p // in parallel
    c = compute cell containing p
	lock(cell_list_lock[c])
	append p to cell_list[c]
	unlock(cell_list_lock[c])

Solution 4: compute partial results + merge

  • Generate N “partial” grids in parallel, then combine
    • Requires extra work: merging the N grids at the end of the computation
    • Requires extra memory footprint: Store N grids of lists, rather than 1

Solution 5: data-parallel approach

  • Step 1: compute cell containing each particle (parallel over input particles)
  • Step 2: sort results by cell (particle index array permuted based on sort)
  • Step 3: find start/end of each cell (parallel over particle_index elements)

Improving program performance

  • Identify and exploit locality: communicate less (increase arithmetic intensity)

  • Reduce overhead (fewer, large messages)
  • Reduce contention
  • Maximize overlap of communication and processing (hide latency so as to not incur cost)

Reference

并行计算导论

并行计算入门blog一篇

我都毕业半年多了居然还在看课件和视频orz 这毫无意义,不如刷题。但是人总是通过做一些不重要的事来避免做那些重要的事,通过做别的事来缓解不做该做的事的焦虑,所以我还在写,就如当时在晚上做鲜芋仙一样。反正也是自己的博客丧一丧没关系吧,头大脑壳痛。