6.824 Lab 1: MapReduce
文档创建时间:3月6日 14:43
All-pass-time:3月13日13:08:42
原始物料一个单进程的顺序执行的简单mr
src/main/mrsequential.go
两个mr应用
mrapps/wc.go WordCounter
mrapps/indexer.go text indexer
wordCounter的启动方法如下
$ cd ~/6.824 $ cd src/main $ go build -race -buildmode=plugin ../mrapps/wc.go 把wc编译成.so中间文件 $ rm mr-out* $ go run -race mrsequential.go wc.so pg*.txt 运行这个简单的mr,并传入进行处理的行为wc.so和物料 $ more mr-out-0 A 509 about 2 ACT 8 ...Job
完成两个程序 the coordinator and the worker 即分配task的master和完成task的worker
lab的运行环境里会运行一个coordinator进程和若干个worker在单机上,The workers will talk to the coordinator via RPC.每个worker向coordinator要一个task。coordinator在worker一定时间内没有完成task之后将task重新分配给另一个worker.
主流程在main/mrcoordinator.go and main/mrworker.go里,这俩文件不改
改mr/coordinator.go, mr/worker.go, and mr/rpc.go.
怎样把自己的改动跑起来
- wc.go编译成.so把coordinator跑起来 把所有任务作为输入 准备调度
$ rm mr-out* $ go run -race mrcoordinator.go pg-*.txt
- 把worker跑起来
$ go run -race mrworker.go wc.soJudge
main/test-mr.sh
- wc和indexer的结果正确性worker并行完成task容错
- map阶段应该要把中间数据分成nReduce份。每个mapper需要创建nReduce个中间文件供reduce阶段消费。worker处理第x个reduce task的结果存储在mr-out-x中mr-out-x每行都是一个reduce的输出,这行的正确输出格式参照main/mrsequential.go前述三个文件可以改,其他的在测试的时候是原版map输出的中间文件保存在当前文件夹下,以便reduce读main/mrcoordinator.go 预期 mr/coordinator.go 实现 Done() 方法,当job完成后返回true,main/mrcoordinator.go 退出。任务全部完成之后worker进程也要停止,一个实现方式是worker进程与coordinator断开通信,即可认为不再有任务。也可以发一个停止运行的task。
每条都有用 摘最有用的
Use Go’s race detector, with go build -race and go run -race. test-mr.sh by default runs the tests with the race detector. 用-race编译选项来检测数据竞争问题
Workers will sometimes need to wait.worker怎样wait?
e.g. reduces can’t start until the last map has finished.
One possibility is for workers to periodically ask the coordinator for work, sleeping with time.Sleep() between each request. 1. Worker sleep+轮询
Another possibility is for the relevant RPC handler in the coordinator to have a loop that waits, either with time.Sleep() or sync.Cond. Go runs the handler for each RPC in its own thread, so the fact that one handler is waiting won’t prevent the coordinator from processing other RPCs. 2. coordinator阻塞对应线程忙等
The coordinator can’t reliably distinguish between crashed workers, workers that are alive but have stalled for some reason, and workers that are executing but too slowly to be useful. The best you can do is have the coordinator wait for some amount of time, and then give up and re-issue the task to a different worker. For this lab, have the coordinator wait for ten seconds; after that the coordinator should assume the worker has died (of course, it might not have). coordinator分不清worker不好使了,分不清crash、stall、slowly,所以得加timeout 10s
To ensure that nobody observes partially written files in the presence of crashes, the MapReduce paper mentions the trick of using a temporary file and atomically renaming it once it is completely written. You can use ioutil.TempFile to create a temporary file and os.Rename to atomically rename it.
When passing a pointer to a reply struct to the RPC system, the object that *reply points to should be zero-allocated. The code for RPC calls should always look like
reply := SomeType{}
call(..., &reply)
without setting any fields of reply before the call. If you don’t follow this requirement, there will be a problem when you pre-initialize a reply field to the non-default value for that datatype, and the server on which the RPC executes sets that reply field to the default value; you will observe that the write doesn’t appear to take effect, and that on the caller side, the non-default value remains.
Myhint- RPC的变量名得大写 否则会出现某些变量接不到的变量的bugdialing:dial unix /var/tmp/824-mr-501: connect: connection refused 是coordinator挂了 所以woker call不通reduce一定在map之后有问题的worker产生的文件不可信 所以需要有中间文件的机制共享资源太多 得lock 我用的是读写锁 应该比无脑lock好一丢丢



