Parallel Architecture & Programing 贰
本来打算放弃写笔记了不过最近因为学妹住到家里又想起了CMU生涯,捡起之前的笔记看看,一个挣扎着起来的垃圾。
并行编程的基础知识,包括抽象和实现的区分,几种常用的并发编程模型及其应用,识别依赖。
Abstraction Vs. Implementation
之前不太能分清上层抽象和下层实现的关系_(:з」∠)_
举个ISPC(Intel SPMD Programing Complier)的例子。ISPC code:
export void sinx(
uniform int N,
uniform int terms,
uniform float* x,
uniform float* result)
{
// assume N % programCount = 0
for (uniform int i=0; i<N; i+=programCount)
{
int idx = i + programIndex;
float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
uniform int sign = -1;
for (uniform int j=1; j<=terms; j++)
{
value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}
result[idx] = value;
}
}
- SPMD programing abstraction: call to ISPC function spawns “gang” of ISPC “program instances”, all instances run ISPC code concurrently, upon return, all instances have complete
- ISPC compiler generates SIMD implementation: number of instances in a gang is the SIMD width of the hardware, ISPC compiler generates binary with SIMD instructions, C++ code links against object files
上层抽象和具体实现没关系,我内心复杂和Parallel也没关系∠( ᐛ 」∠)_
感觉Abstracion明确了程序功能的实现,具体如何使其实现的硬件Implementation可以有很多种方法呢
Two ways program instances loop iterations:
-
Interleaved assignment (交错?
For all program instances, the four values are contiguous in memory
-
Blocked assignment (列块?
Touches four non-contiguous values in memory.
Need “gather” instruction to implement.
Interleaved | Blocked |
---|---|
Interleaved分配方式在instance中连续的数据存在连续的内存中。
Three parallel programing model
Shared address space
共享地址空间模型(抽象.jpg), 所有的线程可以读写共享变量
- Threads communicate by reading/writing to shared variables
- Threads manipulate synchronization primitives: locks, semaphors, etc
- Logical extension of uniprocessor programming
Impementation: e.g. Non-uniform memory access(NUMA)
(所有processor可以存取任何地址,成本较高
Message passing
- Threads operate own private address spaces
- Threads communicate by sending/receiving message (想写go语言了_(:з」∠)_
Data parallel
- Map a function onto a large colleciton of data
- Often takes form of SPMD programming
- Related to Stream programing model
- Gather/Scatter communication primitive
实际应用常常3种模式混用。Typora的bug似乎有点多
等我再打开这个文件都快过了2个月了,我在干嘛呀orz 这篇不知道在干嘛就这样吧反正连我自己都不会再看了emmmmmm工作了这么久果然还是比上学时代轻松多了虽然自己菜。实在没想到现在从事的东西和我最耗尽心力的dspp一点关系都没得。想想其实自己也没有讨厌ml,也没有多热爱system。果然还是…呃,果然还是失礼了。要是我不菜也不蠢就好了。学习了学习了,学习使我快乐
Update10/10 到底在写什么鬼玩意
Speedup and Dependency
For a fix computation: Speedup(P processors) = Time(1 processor)/Time(P processors)
Amdahl’s law: dependencies limit maximum speedup due to parallelism
aka S = inherently sequential fraction of execution, maximum speed up <=
For p processors, max speed up <=
Decomposition
- breakup to tasks can be carried out in parallel
- identifying dependencies
- programmer responsible
Assignment
- balance workload, reduce communication costs
- statically or dynamically during execution
- languages/runtimes responsible
Orchestration
- reduce costs of communication/sync, preserve locality, reduce overhead
Mapping to hardware
- mapping threads to hardware execution units