RaftでReadをどう考えればよいか

引き続きRaft. Rustは書いていない. Rustを書く時間はすべてオーバーウォッチに捧げてしまった.

の6.4を説明する. Raftでreadを行う時はどう考えればいいか.

Read-only client commands only query the replicated state machine; they do not change it. Thus, it is natural to ask whether these queries can bypass the Raft log, whose purpose is to replicate changes to the servers’ state machines in the same order. Bypassing the log offers an attractive performance advantage: read-only queries are common in many applications, and the synchronous disk writes needed to append entries to the log are time-consuming.

Raftでは, State Machine (以下SM)に適用した時にクライアントにackする. だから, read queryは直接SMを読めばいいんじゃないの？と思うかも知れないけど, それはほとんどの場合正しいけど完全ではない.

For example, a leader might be partitioned from the rest of the cluster, and the rest of the cluster might have elected a new leader and committed new entries to the Raft log. If the partitioned leader responded to a read-only query without consulting the other servers, it would return stale results, which are not linearizable.

仮にそのleaderがネットワーク分断されていたらどうか？実は他にleaderがいて, たくさん新しいログが追加されていて, もっと先までackしているかも知れない. だとすると, 自分がackしたところを基準にしてread requestに応答するのはstale readとなる可能性がある. (これはネットワーク分断にかぎらず, 自分がしばらくstallしていた場合にも起こりうるはず. 起きた瞬間にread requestを受け取った場合)

This approach is more efficient than committing read-only queries as new entries in the log, since it avoids synchronous disk writes.

This approach(以下で説明するefficientなもの)はnaiveなアプローチより効率が良いとしている. naiveなアプローチとは, read requestも更新系と同様にコマンドとしてログに格納して, 適用順序についての合意に乗せるというやり方. 効率が良い理由は, logへの同期ライトを省けるから (だから, これが微小であればこの主張は嘘になる. 例えば不揮発メモリやSSDを使えば？)

This approachの考え方:

自分のログのうちcommitしたところまではSMに適用してからackしよう (このindexをreadIndexと呼んでいる)
readIndexが全ノードの中で最新であることを保証しよう
全ノードで最新のreadIndexを適用したSMは最新だからstaleにならない！！！

そのために5つの手続きをする必要がある:

leaderになった瞬間にnoopコマンドをコミットする: こうすることでleaderのcommitIndexはそのtermにおける全ノード中最大であることを保証出来る (これはread requestの実装うんぬんに関わらずleader safety propertyの確保に必要だからどのみち省けない)
今のcommitIndexをreadIndexとして記録する (これはtempなのでメモリでおｋ)
ハートビートを飛ばして, majorityからackを得る: これは自分が本当にleaderであることを確かめるために行う
SMにreadIndexまで適用する
read requestに対してSMからのresponseを返す

To improve efficiency further, the leader can amortize the cost of confirming its leadership: it can use a single round of heartbeats for any number of read-only queries that it has accumulated.

明らかに重いのは, ハートビートを飛ばすところだ. しかしこれは最適化として, 複数のread requestをまとめて, それに対して1回のハートビートを飛ばすことでamortize出来ると言っている. しかし, おれの考えではこの考え方は都合が良すぎると思う. そして, ハートビートがamortize出来ないならば, このアプローチのどこがefficientなのだ？という話になってくる. あと, read requestが溜まるのを一定時間待つという自然な実装を考えると, readのレイテンシに影響する.

Followers could also help offload the processing of read-only queries. This would improve the system’s read throughput, and it would also divert load away from the leader, allowing the leader to process more read-write requests. However, these reads would also run the risk of returning stale data without additional precautions. For example, a partitioned follower might not receive any new log entries from the leader for long periods of time, or even if a follower received a heartbeat from a leader, that leader might itself be deposed and not yet know it. To serve reads safely, the follower could issue a request to the leader that just asked for a current readIndex (the leader would execute steps 1–3 above); the follower could then execute steps 4 and 5 on its own state machine for any number of accumulated read-only queries.

Raftではleaderがすべてのリクエストに応答することになっているが, writeはともかくreadはfollower(非leader)がoffloadした方が効率良いよねという考え方は出来る. しかしこれはもちろんstale readに繋がる. ここでは, leaderに3までを実行させて, readIndexを得た上で4,5はfollowerが実行するという解法が書かれている.

LogCabin implements the above algorithm on leaders, and it amortizes the cost of the heartbeats across multiple read-only queries under high load. Followers in LogCabin do not currently serve read-only requests.

しかし彼がRaftのアプリケーションとして作ったLogCabinではfollowerからのreadは実装していないとのこと. (つまりアイデアのみ. もちろん, うまく行きそうではあるけど)

まとめ

readを実装する時はstale readに注意しよう. その上で性能要件なども考えて, 実装方式を決定しよう.