DS

Distributed System IX

15640 DS9 - Hashing & P2P/Blockchains

Posted by freeCookie🍪 on December 24, 2017

Distributed System IX

还在去往丹佛的飞机上瞎写,越写越差。日常怀疑自己是垃圾。

Hashing

…六天后 假期回家摸鱼🐟 还以为在飞机和高铁上能都写完呢 可我太困了还把围巾丢了 哭泣ಥ_ಥ

当前主要两种Hashing比较流行,consistent hashing和content-based hashing.

Conventional Hash

bucket = hash(item)/num_buckets

convenitonal hashing可以将item平均分配在server上,但是当item/server的进行改变时会导致大量的迁移。

减少一个server,需要迁移1/2的item。

Consistent Hash

Desired Features:

  • Balanced - in any one view, load is equal across buckets
  • Smoothness – little impact on hash bucket contents when buckets are added/removed
  • Spread – small set of hash buckets that may hold an object regardless of views
  • Load – across all views # of objects assigned to hash bucket is small

Main idea:

  • map both keys and nodes to the same (metric) identifier space
  • find a “rule” how to assign keys to nodes

Consistent hashing利用SHA-1分别哈希key和node,将key存在哈希值不小于其哈希的下一个node里。为了保证每个node的存储key数目一致,利用virtual node将所有的node分布在ring上,保证在crash时当前node的key不会全部迁移到下一个node上面。

Properties:

  • Ring-based construction using hash of key and node

  • Smoothness: addition of bucket does not cause much movement

    between existing buckets

  • Load: for N nodes and K keys, with high probability, each node holds at most (1+ε)K/N keys

  • Spread: small set of buckets that lie near object

  • Balance: no bucket is responsible for large number of objects

Hashing for naming

完全没有印象,难道是因为欣赏Daniel的美颜而走神了吗?

大概讲Hash的应用。一个是在Storage System中,把内容hash可以避免存储相同的数据,啊dropbox的方法喽。然后利用hash来命名,同样可以避免存储一个item两次。利用hash也可以检验从位置源下载的文件。

Desirable properities of Hashes

  • Compression: Maps a variable-length length input to a fixed-length output
  • Ease of computation: A relative metric…
  • Pre-image resistance: For all outputs, computationally infeasible to find input that produces output.
  • 2nd pre-image resistance: For all inputs, computationally infeasible to find second input that produces same output as a given input.
  • collision resistance: For all outputs, computationally infeasible to find two inputs that produce the same output.

P2P System

P2P, 上课走神OH找帅助教Felipe讨论了一波终于理解了。之前课上讨论的分布式系统大多是由一个master server和许多participants组成的模式,利用master和participants的通信来完成数据的交换和保证一致性。但是当系统的规模比较大时master会成为系统的bottleneck影响系统整体的性能。在这种情况下引入了P2P系统的概念,对于P2P系统,所有的peers被视为平等的,能够和neighbours进行通信。

P2P Networks

  • Typically each member stores/provides access to content
  • Basically a replication system for files
    • tradeoff between possible location of files and searching difficulty
  • Challenge 1: searching / lookup (files can be anywhere)
  • Challenge 2: fairness (everyone gets rightful share and everyone contributes)
    • Contribution to P2P system must be incentivized.
  • Challenge 3: trust (downloading files from unknown sources)

Efficient lookups

  Centralized Decentralized BitTorrent DHT
Example Napster Gnutella BitTorrent use Chord
Join on startup, clinet contacts central server on startup, client contacts a few other nodes, becomes neighbours contact centralized “tracker” server, get a list of peers on startup, contact a “bootstrap” node and get a node id
Publish reports list of files to central server no need run a tracker server route publication for file id toward a close node id
Search query the server, O(1) ask neighbors continuouslly until found and reply to sender, O(n) out-of-band (e.g, use Google), O(???) route a query for file id toward a close node id, O(lgN)
Fetch get the file directly from peer get the file directly from peer download chunks of the file from your peers, upload chunks you have to them. fetch from where query stops/use IP routing to get X
Pros simple, search O(1), cotrollable decentralized, search cost distributed works reasonably well in practice, efficient out of band search mechanism decentralized, guaranteed lookup, O(lgN) per node state and search scope
Cons server maintain O(N) state and does all processing, single point of failu search scope O(n), nodes leave often network unstable, does not search every node causing re-issue central tracker server needed to bootstrap swarm, tracker needs to maintain peer list for every file supporting non-exact match search is hard

DHT uses consistent hashing. Chord的查找重点。Consistent hashing保证了有M个nodeN个key情况下当有key发生改变时,需要移动的key大约N/M个,而不是convenient hashing的N/2个。但是对于P2P,查找仍需要迭代地对邻居进行查找,O(N)时间复杂度。为了优化查找的时间复杂度,就在每个node存finger table,就每次需要优化时间都cache吧。每个table所占用空间O(lgN),查找的时间复杂度变为O(lgN)。

每个node的finger table具有m个entry,这里假设2^m = N. 对每个entry number i计算id + 2^i,successor是这个结果应该存储的node number。这样每个finger table的size = lgN,查找时首先在本地进行查找,否则到finger table的successor表中第一个>=key的node进行查找,这样查找时间O(lgN)。

Fairness

如何保证每个peers都会上传文件并且保证文件时正确的。

  • Freeloader problem
    • All peers must contribute
    • How to prevent peers leeching resources
  • Efficiency
    • Single peer is bottleneck => similar to bottleneck in a centralized approach
    • Load balancing across peers

BitTorrent sharing strategy

  • Employ “Tit-for-tat” sharing strategy
    • A is downloading from some other people
      • A will let the fastest N of those download from him
    • Be optimistic: occasionally let freeloaders download
  • Distribute chunks across peers
    • When a new file is seeded, peers download random chunks
    • Results in load balancing as different peers have different subset of chunks

Trust

BitTorrent:

  • Check that the file is correct
  • Check that the chunk downloaded from a peer is valid

Blockchains

Motivation: decentralized transactions

Key problem: double spending -> someone track all transactions

Solution: Ledger: every transactions made

Blockchain consensus:

Solution: Blockchain Consensus

  • broadcast new transactions
  • each member collects transactions into a block
  • each member seeks proof-of-work for its block
    • proof-of-work (PoW): solve a computationally hard problem
  • member who finds PoW broadcasts block+PoW
  • other member check block, seek next PoW
  • consensus over time

piazza17F Disributed System已经没有了 本来打算把Daniel的科普贴上的ಥ_ಥ… 18天后: 找到了…

Blockchain and BitCoin Q&A

 

1)  When is a transaction committed?

 

Blockchain miners/peers can include transactions in any order they chose (or even exclude transactions). Each transaction includes a transaction fee it is willing to pay. So in practice, miners includes those with high transaction fees with higher priority.

A transaction is committed only once it is included in a block and there are X successive blocks also included in the blockchain.

 

If a miner receives two valid blocks, A and B, with the same parent, it chooses one of the two blocks and continues mining the next block. As soon as there is a success for A, the miner will switch over working on this branch because it is the longest one. Being the longest, the A branch has the highest likelihood of continuing to be part of the blockchain.

 

As a consequence, BitCoin clients keep both tips of a fork until one exceeds the other by X.

 

(A typical value of X is 4.)

2) Why do we need to adjust the mining difficulty?

 

If too many blocks are successfully mined in a short period of time, then there might be many collisions. So, the network tries to keep the number of blocks mined per time interval constant (BitCoing: 2016 blocks per two weeks).

3) How do we adjust the mining difficulty?

 

By changing the difficulty K of the hashing puzzle: picking a nonce such that the hash value has K leading zeros (or, equivalently, the hash value as an integer is less than 2^(256-K)).

 

The nonce is a 32-bit field. By picking a different nonce, the hash changes. Mining means randomly trying nonces until a matching hash has been found (due to the cryptographic properties of SHA, even sequentially trying nonces is basically random).

4) If a new transaction arrives at a miner, does the miner try to include the transaction into the block currently being mined?

 

BitCoin has a block limit of 1MB. So, if the block is full (and the transaction has a low fee) the new transaction will be ignored for the moment.

 

If the block is not full yet, the miner probably includes the transaction (to get the fee). Due to the mining (search for nonces) being purely random, this does not lead to lost effort.

 

5) Is there a guarantee that trying out all nonces actually solves the puzzle?

 

No, there is no such guarantee. It is in the interest of the miner to add randomness (e.g., add or switch out transactions).

6) Why is there a limit on the total supply of BitCoins?

 

For purely economic reasons. Anything of which there is an infinite amount has no value.

However, BitCoins are infinitely divisible.

7) How is the total BitCoin supply controlled?

 

The number of bitcoins generated per block decreases geometrically. Starting with 50 BitCoins per block initially, the reward for mining is reduced to 50% every 210,000 blocks.

 

8) Why should the miners continue mining if the reward becomes very small?

 

As people want to use BitCoin to make transactions, they will incentivize miners by raising their transaction fees (which go to the miner who solved the block included that transaction).