namenode的fsimage与edits文件

- - 引入
  - 合并
  - 查看

引入

hadoop集群中，namenode管理了元数据。那么，元数据存储在哪里呢？

如果是磁盘中，必然效率过低，如果是内存中，又会不安全。

所以元数据存储在内存中，但是有一个备份文件fsimage在磁盘。这就防止了断电元数据丢失的问题。

现在内存中的元数据增加之后，需要不断同步fsimage吗？这样又很没有效率。但你又必然要更新元数据，所以，可以将追加的操作写入一个文件edits。

fsimage与edits一起加载到内存中，就会得到所有的元数据。

合并

元数据一旦变得庞大，加载到内存就会很慢。所以，还需要定期去合并fsimage与edits。这个合并的工作最好由别人来做。

一般的集群中，可以由secondaryNameNode完成。

合并的时机可以是到一定的时间，或者是操作数到达一定阈值。

由于我使用的是hadoop3.1.3，所以我找了该版本的hdfs-default.xml:


  dfs.namenode.checkpoint.period
  3600s
  The number of seconds between two periodic checkpoints. 
Support multiple time unit suffix(case insensitive), as described in dfs.heartbeat.interval.



  dfs.namenode.checkpoint.txns
  1000000
The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every 'dfs.namenode.checkpoint.txns' transactions, 
regardless of whether 'dfs.namenode.checkpoint.period' has expired.



  dfs.namenode.checkpoint.check.period
  60s
 The SecondaryNameNode and CheckpointNode will poll the NameNode every 'dfs.namenode.checkpoint.check.period' seconds to query the number of uncheckpointed transactions. 
Support multiple time unit suffix(case insensitive), as described in dfs.heartbeat.interval.

默认的检查合并时间是一个小时，或者操作数到达了1000000。查询操作数的时间间隔为一分钟。

另一方面：

从我的hadoop元数据来看，有些edits的确是1个小时更新的，有些是因为启动关闭的时间不同造成混乱，但是没有一个edits超过1M，这是不是也是一个大小的限制呢？不得而知。如果默认是1M的话也太小了。

至于HA，那么我们就不需要secondaryNameNode来定期检查和合并了，做这份工作的是standBy namenode：

Note that, in an HA cluster, the Standby NameNodes also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.

查看

我将除了tmp文件之外的所有文件都删了：

最后我们观察几个文件：

首先，当前的最大操作数(事务数)为29325。

使用

hdfs oev -p XML -i edits_0000000000000029321-0000000000000029322 -o /opt/module/hadoop-3.1.3/edits.xml

得到edits_0000000000000029321-0000000000000029322的xml文件：



  -64
  
    OP_START_LOG_SEGMENT
    
      29321
    
  
  
    OP_END_LOG_SEGMENT
    
      29322

这个操作就是开始和结束日志切分。

以同样的方式查看edits_inprogress_0000000000000029325：



  -64
  
    OP_START_LOG_SEGMENT
    
      29325

同样是日志切分的操作。它的TXID也是最大的操作数ID。secondaryNameNode唯独没有这个文件。可以说，它就是hadoop重启之后新增的操作。

我们可以通过

 hdfs oiv -p XML -i fsimage_0000000000000029324 -o /opt/module/hadoop-3.1.3/fsimage_new.xml

查看fsimage：


		22648
		27
		
			16385
			DIRECTORY
			
			1629345954854
			root:supergroup:0777
			9223372036854775807
			-1
		
		
			22554
			DIRECTORY
			tmp
			1629289108967
			ocean:supergroup:0777
			-1
			-1
		
		
			22555
			DIRECTORY
			hadoop-yarn
			1629280562345
			ocean:supergroup:0777
			-1
			-1
		
		
			22556
			DIRECTORY
			staging
			1629289107248
			ocean:supergroup:0777
			-1
			-1
		
		
			22557
			DIRECTORY
			history
			1629280562367
			ocean:supergroup:0777
			-1
			-1
		
		
			22558
			DIRECTORY
			done
			1629289269870
			ocean:supergroup:0777
			-1
			-1
		
		
			22559
			DIRECTORY
			done_intermediate
			1629289110600
			ocean:supergroup:0777
			-1
			-1
		
		
			22573
			DIRECTORY
			ocean
			1629289107248
			ocean:supergroup:0777
			-1
			-1
		
		
			22574
			DIRECTORY
			.staging
			1629290164292
			ocean:supergroup:0777
			-1
			-1
		
		
			22580
			DIRECTORY
			logs
			1629289108990
			ocean:ocean:0777
			-1
			-1
		
		
			22581
			DIRECTORY
			ocean
			1629289108992
			ocean:ocean:0777
			-1
			-1
		
		
			22582
			DIRECTORY
			logs-tfile
			1629290147310
			ocean:ocean:0777
			-1
			-1
		
		
			22583
			DIRECTORY
			application_1629280557686_0001
			1629289238916
			ocean:ocean:0777
			-1
			-1
		
		
			22584
			DIRECTORY
			ocean
			1629290169798
			ocean:supergroup:0777
			-1
			-1
		
		
			22608
			FILE
			job_1629280557686_0001-1629289108111-ocean-hadoop%2Dmapreduce%2Dclient%2Djobclient%2D3.1.3%2D-1629289231294-10-1-SUCCEEDED-default-1629289112345.jhist
			3
			1629289231358
			1629289231340
			134217728
			ocean:supergroup:0777
			
				
					1073744990
					4174
					55176
				
			
			0
		
		
			22609
			FILE
			job_1629280557686_0001_conf.xml
			3
			1629289231386
			1629289231366
			134217728
			ocean:supergroup:0777
			
				
					1073744991
					4175
					216419
				
			
			0
		
		
			22610
			FILE
			hadoop102_35246
			3
			1629289238515
			1629289238458
			134217728
			ocean:ocean:0777
			
				
					1073744992
					4176
					172209
				
			
			0
		
		
			22611
			FILE
			hadoop103_33694
			3
			1629289238912
			1629289238844
			134217728
			ocean:ocean:0777
			
				
					1073744993
					4177
					352999
				
			
			0
		
		
			22612
			DIRECTORY
			2021
			1629289269870
			ocean:supergroup:0777
			-1
			-1
		
		
			22613
			DIRECTORY
			08
			1629289269870
			ocean:supergroup:0777
			-1
			-1
		
		
			22614
			DIRECTORY
			18
			1629289269870
			ocean:supergroup:0777
			-1
			-1
		
		
			22615
			DIRECTORY
			000000
			1629290169798
			ocean:supergroup:0777
			-1
			-1
		
		
			22632
			DIRECTORY
			application_1629280557686_0002
			1629290170700
			ocean:ocean:0777
			-1
			-1
		
		
			22645
			FILE
			job_1629280557686_0002-1629290147211-ocean-hadoop%2Dmapreduce%2Dclient%2Djobclient%2D3.1.3%2D-1629290163180-10-1-SUCCEEDED-default-1629290150116.jhist
			3
			1629290163238
			1629290163222
			134217728
			ocean:supergroup:0777
			
				
					1073745012
					4196
					54938
				
			
			0
		
		
			22646
			FILE
			job_1629280557686_0002_conf.xml
			3
			1629290163262
			1629290163244
			134217728
			ocean:supergroup:0777
			
				
					1073745013
					4197
					216417
				
			
			0
		
		
			22647
			FILE
			hadoop103_33694
			3
			1629290170343
			1629290170309
			134217728
			ocean:ocean:0777
			
				
					1073745014
					4198
					203114
				
			
			0
		
		
			22648
			FILE
			hadoop102_35246
			3
			1629290170698
			1629290170675
			134217728
			ocean:ocean:0777
			
				
					1073745015
					4199
					281375
				
			
			0

除了核心文件，已经没有自己建文件或文件夹了。

同样的我们可以查看另个一个fsimage。

注意到两个对应的数字（在我写博文的途中又发生了合并，所以数字与前面的截图不同了）：

这相当于是存了最后的两个版本。那为什么会有两个fsimage呢？


  dfs.namenode.num.checkpoints.retained
  2
 The number of image checkpoint files (fsimage_*) that will be retained by the NameNode and Secondary NameNode in their storage directories. 
All edit logs (stored on edits_* files) necessary to recover an up-to-date namespace from the oldest retained checkpoint will also be retained.

默认就是2。

namenode的fsimage与edits文件

大数据系统相关栏目本月热门文章