DS

Distributed System VIII

15640 DS8 - GFS/HDFS/Spanner

Posted by freeCookie🍪 on December 24, 2017

Distributed System VIII

现在PIT飞往DEN的✈️。Daniel🎄要飞回🇩🇪,觉得偶遇机会为0,果然机场没遇到Daniel好可惜啊。这一篇主要关于谷歌和Hadoop的文件系统案例学习,GFS/HDFS。以及谷歌Database,Spanner。从来不晕机的我好像有点晕机,糟糕(;´༎ຶД༎ຶ`)

GFS/HDFS

Google file system和Hadoop file system的结构很相似,都是将文件分成blocks分别存在chunk结构中,根据master中存储的metadata来进行访问。

GFS

GFS is a distributed fault-tolerant file system.

Assumption: large files/large sequential writes and append/large streaming reads/concurrent appends by multiple clients

GFS的重点在于理解GFS的结构和读写操作模型,其他的倒也没什么…

GFS Architecture

  • One master server
  • Many chunk servers: 64MB portion of file, identified by global ids
  • Many clients accessing same/different files stored on same cluster

GFS的master server存储所有的metadata,负责与Client通信并且保证系统的一致性,同时通过与Chunk servers通信来迁移数据。Master server将所有的metadata存在RAM中,因此具有很快的速度。Chunk server对于data没有整体性的认知,通过与master sever通信来传递/更新数据。Chunk server不缓存任何数据,数据存在硬盘的block中,定期向master server发送heartbeat。Client缓存metadata,不缓存任何数据,通过与master server进行W/R操作访问数据。

GFS Client Read
  • Client sends master: read(file name, chunk index)
  • Master’s reply: chunk ID, chunk version number, locations of replicas
  • Client sends “closest” chunkserver with replica:
    • read (chunk ID, byte range)
    • “Closest” determined by IP address on simple rack based network topology
  • Chunkserver replies with data
GFS Client Write
  • All replicas acknowledge data write to client
  • Client sends write request to primary (commit phase)
  • Primary assigns serial number to write request, providing ordering
  • Primary forwards write request with same serial number to secondary replicas
  • Secondary replicas all reply to primary after completing writes in the same order
  • Primary replies to client

GFS的写操作,数据从Client流向最近的replica,直到所有的replica都得到了更新的数据。Replica得到了更新的数据后会与primary通信。Primary知道所有的replica都到了更新会通知Client。

GFS Consistency Model
  • Changes to data are primary ordered as chosen by a primary
  • Record append completes at least once, at offset of GFS’s choosing
  • Failures can cause inconsistency
GFS Limitations
  • Master biggest impediment to scaling
    • Performance bottleneck
    • Holds all data structures in memory
    • Takes long time to rebuild metadata
    • Must vulnerable point for reliability
    • Solution:
      • Have systems with multiple master nodes, all sharing set of chunk servers.
      • Not a uniform name space.
  • Large chunk size
    • Can’t afford to make smaller, since this would create more work for master.
GFS Summary
  • Success: used actively by Google to support search service and other applications
  • Semantics not transparent to apps
    • Must verify file contents to avoid inconsistent regions, repeated appends (at least once)
  • Performance not good for all apps
    • Assumes read once, write once workload (no client caching!)

HDFS

Hadoop file system, 和GFS非常相似。将master server换成name node,chunk server换做data note,chunk大小不是64MB而是128MB。结构几乎一致。没有什么重点的样子。

GFS Vs HDFS

### Spanner

…四天后 事实证明我的话假期果然很难进行学术活动_(:з」∠)_

Spanner,谷歌的分布式数据库,然而谷歌不给我面试,不想写Spanner,气愤。

数据库嘛,SQL语言,ACID什么的。对于分布式的事务具有外部一致性的特征,利用2PC实现。Fault-tolerance和replica基于Paxos实现。Spanner的中心思想是对于每个Transaction保存一个时间段(作为Timestamp),根据两个时间段是否重叠来判断两个事务的发生次序, Timetamp的顺序就是commit的顺序。我已经全忘了

Spanner Concurrency Control

Key aspect of differentiating Spanner – using globally meaningful timestamps for distributed transactions in achieving external consistency

对于每个Transaction,当天时间分别+-时间段ε(好像一般是传输时间来着),构成时间段。

保证外部一致性也就是,如果T1事务的结束比T2事务的开始早,可以认为T1的timestamp小于T2.

参考资料:

Google Filesystem

GFS followup interview

Spanner paper

Consistent hashing paper