- 引入
- 合并
- 查看
hadoop集群中,namenode管理了元数据。那么,元数据存储在哪里呢?
如果是磁盘中,必然效率过低,如果是内存中,又会不安全。
所以元数据存储在内存中,但是有一个备份文件fsimage在磁盘。这就防止了断电元数据丢失的问题。
现在内存中的元数据增加之后,需要不断同步fsimage吗?这样又很没有效率。但你又必然要更新元数据,所以,可以将追加的操作写入一个文件edits。
fsimage与edits一起加载到内存中,就会得到所有的元数据。
合并元数据一旦变得庞大,加载到内存就会很慢。所以,还需要定期去合并fsimage与edits。这个合并的工作最好由别人来做。
一般的集群中,可以由secondaryNameNode完成。
合并的时机可以是到一定的时间,或者是操作数到达一定阈值。
由于我使用的是hadoop3.1.3,所以我找了该版本的hdfs-default.xml:
dfs.namenode.checkpoint.period 3600s The number of seconds between two periodic checkpoints. Support multiple time unit suffix(case insensitive), as described in dfs.heartbeat.interval. dfs.namenode.checkpoint.txns 1000000 The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every 'dfs.namenode.checkpoint.txns' transactions, regardless of whether 'dfs.namenode.checkpoint.period' has expired. dfs.namenode.checkpoint.check.period 60s The SecondaryNameNode and CheckpointNode will poll the NameNode every 'dfs.namenode.checkpoint.check.period' seconds to query the number of uncheckpointed transactions. Support multiple time unit suffix(case insensitive), as described in dfs.heartbeat.interval.
默认的检查合并时间是一个小时,或者操作数到达了1000000。查询操作数的时间间隔为一分钟。
另一方面:
从我的hadoop元数据来看,有些edits的确是1个小时更新的,有些是因为启动关闭的时间不同造成混乱,但是没有一个edits超过1M,这是不是也是一个大小的限制呢?不得而知。如果默认是1M的话也太小了。
至于HA,那么我们就不需要secondaryNameNode来定期检查和合并了,做这份工作的是standBy namenode:
查看Note that, in an HA cluster, the Standby NameNodes also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
我将除了tmp文件之外的所有文件都删了:
最后我们观察几个文件:
首先,当前的最大操作数(事务数)为29325。
使用
hdfs oev -p XML -i edits_0000000000000029321-0000000000000029322 -o /opt/module/hadoop-3.1.3/edits.xml
得到edits_0000000000000029321-0000000000000029322的xml文件:
-64 OP_START_LOG_SEGMENT 29321 OP_END_LOG_SEGMENT 29322
这个操作就是开始和结束日志切分。
以同样的方式查看edits_inprogress_0000000000000029325:
-64 OP_START_LOG_SEGMENT 29325
同样是日志切分的操作。它的TXID也是最大的操作数ID。secondaryNameNode唯独没有这个文件。可以说,它就是hadoop重启之后新增的操作。
我们可以通过
hdfs oiv -p XML -i fsimage_0000000000000029324 -o /opt/module/hadoop-3.1.3/fsimage_new.xml
查看fsimage:
22648 27 16385 DIRECTORY 1629345954854 root:supergroup:0777 9223372036854775807 -1 22554 DIRECTORY tmp 1629289108967 ocean:supergroup:0777 -1 -1 22555 DIRECTORY hadoop-yarn 1629280562345 ocean:supergroup:0777 -1 -1 22556 DIRECTORY staging 1629289107248 ocean:supergroup:0777 -1 -1 22557 DIRECTORY history 1629280562367 ocean:supergroup:0777 -1 -1 22558 DIRECTORY done 1629289269870 ocean:supergroup:0777 -1 -1 22559 DIRECTORY done_intermediate 1629289110600 ocean:supergroup:0777 -1 -1 22573 DIRECTORY ocean 1629289107248 ocean:supergroup:0777 -1 -1 22574 DIRECTORY .staging 1629290164292 ocean:supergroup:0777 -1 -1 22580 DIRECTORY logs 1629289108990 ocean:ocean:0777 -1 -1 22581 DIRECTORY ocean 1629289108992 ocean:ocean:0777 -1 -1 22582 DIRECTORY logs-tfile 1629290147310 ocean:ocean:0777 -1 -1 22583 DIRECTORY application_1629280557686_0001 1629289238916 ocean:ocean:0777 -1 -1 22584 DIRECTORY ocean 1629290169798 ocean:supergroup:0777 -1 -1 22608 FILE job_1629280557686_0001-1629289108111-ocean-hadoop%2Dmapreduce%2Dclient%2Djobclient%2D3.1.3%2D-1629289231294-10-1-SUCCEEDED-default-1629289112345.jhist 3 1629289231358 1629289231340 134217728 ocean:supergroup:0777 1073744990 4174 55176 0 22609 FILE job_1629280557686_0001_conf.xml 3 1629289231386 1629289231366 134217728 ocean:supergroup:0777 1073744991 4175 216419 0 22610 FILE hadoop102_35246 3 1629289238515 1629289238458 134217728 ocean:ocean:0777 1073744992 4176 172209 0 22611 FILE hadoop103_33694 3 1629289238912 1629289238844 134217728 ocean:ocean:0777 1073744993 4177 352999 0 22612 DIRECTORY 2021 1629289269870 ocean:supergroup:0777 -1 -1 22613 DIRECTORY 08 1629289269870 ocean:supergroup:0777 -1 -1 22614 DIRECTORY 18 1629289269870 ocean:supergroup:0777 -1 -1 22615 DIRECTORY 000000 1629290169798 ocean:supergroup:0777 -1 -1 22632 DIRECTORY application_1629280557686_0002 1629290170700 ocean:ocean:0777 -1 -1 22645 FILE job_1629280557686_0002-1629290147211-ocean-hadoop%2Dmapreduce%2Dclient%2Djobclient%2D3.1.3%2D-1629290163180-10-1-SUCCEEDED-default-1629290150116.jhist 3 1629290163238 1629290163222 134217728 ocean:supergroup:0777 1073745012 4196 54938 0 22646 FILE job_1629280557686_0002_conf.xml 3 1629290163262 1629290163244 134217728 ocean:supergroup:0777 1073745013 4197 216417 0 22647 FILE hadoop103_33694 3 1629290170343 1629290170309 134217728 ocean:ocean:0777 1073745014 4198 203114 0 22648 FILE hadoop102_35246 3 1629290170698 1629290170675 134217728 ocean:ocean:0777 1073745015 4199 281375 0
除了核心文件,已经没有自己建文件或文件夹了。
同样的我们可以查看另个一个fsimage。
注意到两个对应的数字(在我写博文的途中又发生了合并,所以数字与前面的截图不同了):
这相当于是存了最后的两个版本。那为什么会有两个fsimage呢?
dfs.namenode.num.checkpoints.retained 2 The number of image checkpoint files (fsimage_*) that will be retained by the NameNode and Secondary NameNode in their storage directories. All edit logs (stored on edits_* files) necessary to recover an up-to-date namespace from the oldest retained checkpoint will also be retained.
默认就是2。



