zookeeper简介_大数据系统

zookeeper简介

为什么要讲zookeeper

高性能
- 提供异步的接口
- 并没提供完全的读写线性一致性，可以在副本读，从而在读写比高的场景中提高性能
用途广泛
- ZooKeeper 作为一个分布式的协调服务框架，主要用来解决分布式集群中，应用系统需要面对的各种通用的一致性问题
- 像是一个“瑞士军刀”，它提供了很多基本的操作，能实现什么样的功能更多取决于使用者如何来使用它。

Performance

关于写
- zk也是基于replicated state machine的，所有写操作要经过zk的leader
- client write–》zkServer–》 zab层(类似raft, 每个write的log需要被commit,才能被leader返回)
  - 用zab层来容错和保证写操作的线性一致性
关于读
- 如果要提高读的性能，就需要让副本响应client的读，但这样就违反了线性一致性，因为：
  - Replica may not be in majority, so may not have seen a completed write.
    
    Replica may not yet have seen a commit for a completed write.
    
    Replica may be entirely cut off from the leader (same as above).

zk提供的一致性保证

Linearizable writes

clien 向zk的写操作，是要经过leader的，从而保证写操作的线性一致性

FIFO client order

所有的client 向zk发送操作的顺序，和这些操作被执行的顺序，是一致的
写
1. 保证每个client的“写”的顺序在zk执行的是一致的
2. 操作的原子性用“ready file”来实现
  1. 大致的思想是，要操作到某个数据，先检查对于的标记“ready file”是否存在，存在才能操作
  2. 在修改对应的数据的时候，会先删除这个“ready file”标记，修改完再create 这个"ready file"

读
1. 保证每个client执行读的操作，在这个client的“读写”的顺序是一致的
2. 不会出现“go backward read”，即 client会记录已经读到的最大新数据的zxid，之后的读不会读低版本的数据
3. 一个client的读会等这个client的之前的write都完成后，再读
1. zk提供了一个sync的操作

图中看出的东西

当只有 write 时，server + , ops -
横坐标+ server+的 ops+

zk其他提高性能的地方

Clients can send async writes to leader (async = don’t have to wait).

Leader batches up many requests to reduce net and disk-write overhead.

Assumes lots of active clients.

zk如何成为一个通用的分分布式协调框架

zk的结构【图】

the state: a file-system-like tree of znodes

file names, file content, directories, path names

typical use: configuration info in znodes

set of machines that participate in the application

which machine is the primary

each znode has a version number

types of znodes:

regular

ephemeral

sequential: name + seqno

zk的api

create(path, data, flags)

exclusive – only first create indicates success

delete(path, version)

if znode.version = version, then delete

exists(path, watch)

watch=true means also send notification if path is later created/deleted

getData(path, watch)

setData(path, data, version)

if znode.version = version, then update

getChildren(path, watch)

sync()

sync then read ensures writes before sync are visible to same client’s read

client could instead submit a write

api的特性

ZooKeeper API well tuned to synchronization:

+ exclusive file creation; exactly one concurrent create returns success

+ getData()/setData(x, version) supports mini-transactions

+ sessions automate actions when clients fail (e.g. release lock on failure)

+ sequential files create order among multiple clients

+ watches – avoid polling

一些例子

Example: add one to a number stored in a ZooKeeper znode

what if the read returns stale data?

write will write the wrong value!

what if another client concurrently updates?

will one of the increments be lost?

while true:

x, v := getData(“f”)

if setData(x + 1, version=v):

break

Example: Locks without Herd Effect

(look at pseudo-code in paper, Section 2.4, page 6)

1. create a “sequential” file

2. list files

3. if no lower-numbered, lock is acquired!

4. if exists(next-lower-numbered, watch=true)

5. wait for event…

6. goto 2

zk在kafka的应用
https://time.geekbang.org/column/article/137655?cid=100032301

zookeeper简介

大数据系统相关栏目本月热门文章