一篇文章了解HDFS名称空间NameSpace

文章目录

一篇文章了解HDFS名称空间NameSpace
- NameNode
- - NameSpace
  - - 文件结构
  - 文件系统元数据（第一行）
  - - - imgVersion
        numFiles
        genStamp
  - 目录元数据
  - - - path
        replicas
        mtime
        atime
        blocksiz
        nsQuota
        dsQuota
        username
        group
        perm
  - 文件元数据(包含目录元数据)
  - - - blockid
        numBytes
        genStamp
  - FSImage
  - Edits
  - BlocksMap
  - - 数据结构
    - 原理
    - - LightWeightGSet
      - 构造方法
        put方法
        getIndex方法
        get方法
      - BlockInfo
      - private Object[] triplets;
  - 总结
  - - 一方面NameNode掌握了名称空间，也就是文件和目录的列表，这个每个块都有对应的blockid，客户端根据path请求文件，那么NameNode找到对用blockId，去BlocksMap中去查找，时机存储的DataNode，并通过pipeline读取数据

NameNode

掌握整个HDFS的文件树，文件和目录
管理DataNode（心跳机制）
上/下线
副本迁移
数据平衡
可能集群内数据分布不均匀
客户端的读写请求

小问题如何判断是根目录呢？
path的长度为0就是根目录

NameSpace 文件结构

文件系统元数据（第一行） imgVersion

当前fsiamge文件的版本号

当前命名空间的ID，在NameNode的生命周期内保持不变，
DataNode注册时，返回该ID作为其registrationID，
每次和NameNode通信时都要检查，不认识的namespaceID拒绝连接

numFiles

文件系统中的文件数

genStamp

生成该fsimage文件的时间戳

目录元数据 path

replicas

mtime

修改时间

atime

访问时间

blocksiz

nsQuota

dsQuota

username

group

用户所属的组名

perm

即permission，访问权限

文件元数据(包含目录元数据) blockid

文件的文件块id

numBytes

该文件块的bytes数，即文件块的大小

genStamp

文件块的时间戳

FSImage

保存了最新的元数据检查点

Edits

保存了在最新检查点后最新的命名空间的变化记录

BlocksMap

首先FsImage解决的是文件命名空间问题，例如目录、文件路径、blockid等信息，但是没有这些block并没有与DataNode进行关联起来，那么就需要一个block和DataNode的映射关系

数据结构

private final int capacity
是需要在初始化entries数组的时候给初始值
private volatile GSet blocks
LightWeightGSet
- protected LightWeightGSet.linkedElement[] entries;

注:BlockInfo继承了block

原理 LightWeightGSet 构造方法

    public LightWeightGSet(int recommended_length) {
     int actual = actualArrayLength(recommended_length);
     if (LOG.isDebugEnabled()) {
         LOG.debug("recommended=" + recommended_length + ", actual=" + actual);
     }

     this.entries = new LightWeightGSet.linkedElement[actual];
     this.hash_mask = this.entries.length - 1;
  }

put方法

    public E put(E element) {
        if (element == null) {
    throw new NullPointerException("Null element is not supported.");
        } else {
    LightWeightGSet.linkedElement e = null;

            try {
    e = (LightWeightGSet.linkedElement)element;
            } catch (ClassCastException var5) {
   throw new HadoopIllegalArgumentException("!(element instanceof linkedElement), element.getClass()=" + element.getClass());
            }
  //获取下标
  int index = this.getIndex(element);
  //判断是否存在，再这里逻辑，存在就将该节点移除
  E existing = this.remove(index, element);
  ++this.modification;
  ++this.size;
  //将新节点插入到链表的头
  e.setNext(this.entries[index]);
  //更新数组
  this.entries[index] = e;
  return existing;
 }

    }

getIndex方法

      protected int getIndex(K key) {
    //Block的hashcode 
    //(int)(blockId^(blockId>>>32));
     return key.hashCode() & this.hash_mask;

    }

  * 主要目的是降低hash冲突，减少链表的长度，提高检索速度

get方法

   //先根据Block获得index，然后从数组中拿到链表，然后for循环遍历链表拿到最终的key
   public E get(K key) {
      if (key == null) {
         throw new NullPointerException("key == null");
      } else {
      int index = this.getIndex(key);
   
      for(LightWeightGSet.linkedElement e = this.entries[index]; e != null; e = e.getNext()) {
         if (e.equals(key)) {
           return this.convert(e);
         }
      }
       return null;
     }
       }

BlockInfo private Object[] triplets;

存储了block具体的到DataNode的映射关系，还包含了pre和nextBlock，为后续需Pipeline读取做准备

容量
- 副本数
triplets[3*i] i表示副本index

DatanodeDescriptor,Datanode的描述信息，ip，id等
triplets[3*i+1]
previous BlockInfo，上一个block，因为一个文件会被切分城多个快儿，通过namenode中的path->blockid,找到第一个triplets
，然后文件的后续快都可以自动完成，实现pipeline
triplets[3*i+2]
next BlockInfo 下一个Blockinfo
构造方法

  public BlockInfo(Block blk, int replication) {
     super(blk);
     this.triplets = new Object[3*replication];
     this.bc = null;
  }

总结一方面NameNode掌握了名称空间，也就是文件和目录的列表，这个每个块都有对应的blockid，客户端根据path请求文件，那么NameNode找到对用blockId，去BlocksMap中去查找，时机存储的DataNode，并通过pipeline读取数据

一篇文章了解HDFS名称空间NameSpace

大数据系统相关栏目本月热门文章