day09HDFS_大数据系统

一.HDFS的元数据辅助管理（SecondaryNameNode）

namenode的作用：管理元数据Fsimage文件（镜像，存储几乎所有的元数据，不会立刻更新） Edits文件（日志文件，存储最近一段时间元数据，数据格式不一样慢）
SecondaryNameNode辅助管理元数据：隔段时间将fsimage和edits文件拷贝到所在主机，将两个文件合并，合并成新的fsimage.ckpt文件替换旧的fsimage，生成edits.new文件最后到edits。触发条件：每隔一小时，或edits文件大于64m。
SecondaryNameNode在合并edits和fsimage时需要消耗的内存和那么node差不多，所以一般不把namenode和secondarynode放在一台机器上
namenode元数据恢复：可以通过SecondaryNameNode恢复

二.HDFS的JavaApi

配置windows的hadoop
1.1 第一步：将已经编译好的Windows版本Hadoop解压到到一个没有中文没有空格的路径下面
1.2第二步：在windows上面配置hadoop的环境变量： HADOOP_HOME，并将%HADOOP_HOME%bin添加到path中
1.3第三步：把hadoop2.7.5文件夹中bin目录下的hadoop.dll文件放到系统盘: C:WindowsSystem32 目录
1.4第四步：关闭windows重启
导入maven依赖


    
        org.apache.hadoop
        hadoop-common
        2.7.5
    
    
        org.apache.hadoop
        hadoop-client
        2.7.5
    
    
        org.apache.hadoop
        hadoop-hdfs
        2.7.5
    
    
        org.apache.hadoop
        hadoop-mapreduce-client-core
        2.7.5
    
    
        junit
        junit
        4.12

使用文件系统方式访问数据
3.1 涉及的主要类：configuration：封装了客户端或服务器的配置，filesystem是一个文件系统对象，可以用该对象的一些方法对文件进行操作（get方法）

package com.hlzq.hdfs;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.junit.Test;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class TestDemo1 {
    @Test
    public void meth01GetFileSystem() throws IOException {
        //1.创建configuration对象
        Configuration configuration = new Configuration();
        //2.指定创建文件系统类型
        configuration.set("fs.defaultFS", "hdfs://node1:8020");
        //3.获取指定的文件系统
        FileSystem fileSystem = FileSystem.get(configuration);
        System.out.println(fileSystem);
    }
    @Test
    public void meth02GetFileSystem() throws IOException, URISyntaxException {
        //1.创建configuration对象
        //2.指定创建文件系统类型
        //3.获取指定的文件系统
        FileSystem jdjsdj = FileSystem.get(new URI( "hdfs://node1:8020"), new Configuration());
        System.out.println(jdjsdj);
    }

}

//遍历hdfs文件

@Test
public  void bianLi() throws URISyntaxException, IOException {
    //获取filesystem对象
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
    // 获取指定目录下的缩影文件的详情
    RemoteIterator locatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/"), true);
    //遍历迭代器集合，分别获取文件的信息
    while (locatedFileStatusRemoteIterator.hasNext()){
        LocatedFileStatus next = locatedFileStatusRemoteIterator.next();
        //获取具体的文件信息
        Path path = next.getPath();
        System.out.println(path);
        //获取每一个文件的block信息
        BlockLocation[] blockLocations = next.getBlockLocations();
        System.out.println(blockLocations.length);//block的数量(文件被切分成几份)

        for (BlockLocation blockLocation : blockLocations) {
            String[] hosts = blockLocation.getHosts();
            for(String host: hosts){
                System.out.println(host);
            }
            System.out.println("#########################################################");
        }

    }

    //关闭filesystem对象
    fileSystem.close();
}

创建文件夹：

@Test
    public  void mkdir() throws URISyntaxException, IOException {
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
        boolean exists = fileSystem.exists(new Path("/xx/yy/zz"));//exists判断文件夹是否存在
        if (!exists){
            System.out.println("不存在创建");
         fileSystem.mkdirs(new Path("/xx/yy/zz"));
        }else System.out.println("存在，不创建");


    }

下载

  //下载方法一
    @Test
    public void dowlo() throws URISyntaxException, IOException {
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
        //获取源文件输入流
        FSDataInputStream inputStream = fileSystem.open(new Path("/anaconda-ks.cfg"));
        //获取本地文件的输出流
        FileOutputStream outputStream = new FileOutputStream("F:\dnxx");
       //使用IOUtils工具实现文件的拷贝
        IOUtils.copy(inputStream, outputStream);
        //关流
        outputStream.close();
        inputStream.close();
    }
    //下载方法二，
    @Test
    public void dowlo2() throws URISyntaxException, IOException {
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
        fileSystem.copyToLocalFile(new Path("/anaconda-ks.cfg"),new Path("F:\dnxx"));
        fileSystem.close();
    }

上传

 //文件上传
    @Test
    public void Sha() throws URISyntaxException, IOException {
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
        fileSystem.copyFromLocalFile(new Path("F:\dnxx"),new Path("/xx/yy/zz"));
        fileSystem.close();
    }

小文件合并（本地小文件，追加）

@Test
    public void Shaheb() throws URISyntaxException, IOException {
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
    //创建HDFS上的文件名
        FSDataOutputStream outputStream = fileSystem.create(new Path("/a.txt"));
        //获取本地文件系统
        LocalFileSystem local = FileSystem.getLocal(new Configuration());
        //通过本地文件系统获取文件列表为一个集合
        FileStatus[] fileStatuses = local.listStatus(new Path("file:///G:\dd"));
        for (FileStatus fileStatus:fileStatuses){
            FSDataInputStream open = local.open(fileStatus.getPath());
            IOUtils.copy(open, outputStream);
            IOUtils.closeQuietly(open);
        }
        IOUtils.closeQuietly(outputStream);
        local.close();
        fileSystem.close();
    }

三.HDFS的访问权限控制

权限总开关：vim hdfs-site.xml
打开为TRUE ，分发，重启hdfs
伪装用户：

四.HDFS其他功能

1.不同集群之间数据复制：
1.1集群内部文件拷贝scp：scp (-r目录)文件名主机名：$pwd
1.2远程复制到本地：scp root@主机名：文件地址名字存放位置
scp -r root@node2:zookeeper.out dir33/
1.3跨集群之间的数据拷贝：hadoop distcp hdfs://node1:8020/jdk-8u hdsfs://cluste2:8020/
1.4Archive档案的使用：hdfs不擅长存储小文件，每个文件最少一个block，每个block都会在namenode占用内存，存在大量的小文件，就会吃掉namenode的大量内存。Archive把多个文件归档称为一个文件，归档成为一个文件后还可以透明的访问每一个文件。
1.4.1：创建Archive

例：你想存档一个目录/config下的所有文件：
hadoop archive -archiveName 名字 -p /config （打包的目录）/存放的目录

1.4.2查看Archive：hadoop fs -cat /output/test.har/part-0
1.4.3查看小文件列表： hadoop fs -ls har://hdfs-node1:8020/output/test.har

1.4.4访问单独的小文件： hadoop fs -cat har://hdfs-node1:8020/output/test.har/core-site.xml

1.4.5archive不支持打包
1.4.6解压Achive： hadoop fs -cp har:///output/test.har/* /config2

五.HDFS的快照

数据备份，误操作容灾恢复
开启指定目录的快照功能：hdfs dfsadmin -allowSnapshot 路径
禁用指定路径的快照功能：hdfs dfsadmin -disallowSnapshot 路径
给指定路径创建快照：hdfs dfs -createSnapshot 路径
指定快照名称进行创建快照：hdfs dfs -createSnapshot 路径名称
快照重命名：hdfs dfs -renameSnapshot 路径旧名称新名称
列出当前用户所有可快照目录：hdfs lsSnapshottableDir
恢复快照：hdfs dfs -cp -ptopax 快照位置 /恢复位置
删除快照：hdfs dfs -deleteSnapshort /快照目录的快照名

六.HDFS的Trash回收站功能

在/user/用户名/.Trash目录
参数
强删除：hadoop fs -rm -skipTrash /dir1/a.txt

七.HDFS高可用机制

2.HDFS高可用集群

2.1修改core-site.xml



         
                   ha.zookeeper.quorum
                   node1:2181,node2:2181,node3:2181
         
 
         
                   fs.defaultFS
                   hdfs://ns
         
 
        
                  hadoop.tmp.dir
                  /opt/server/hadoop-2.7.5/data/tmp
        
         
        
                 fs.trash.interval
                 10080

2.2. 修改hdfs-site.xml

	
		dfs.nameservices
		ns
	

	
		dfs.ha.namenodes.ns
		nn1,nn2
	

	
	
		dfs.namenode.rpc-address.ns.nn1
		node1:8020
	
	
	
		dfs.namenode.rpc-address.ns.nn2
		node2:8020
	
	
	
		dfs.namenode.servicerpc-address.ns.nn1
		node1:8022
	
	
	
		dfs.namenode.servicerpc-address.ns.nn2
		node2:8022
	
	
	
	
		dfs.namenode.http-address.ns.nn1
		node1:50070
	
	
	
		dfs.namenode.http-address.ns.nn2
		node2:50070
	
	
	
	
		dfs.namenode.shared.edits.dir
		qjournal://node1:8485;node2:8485;node3:8485/ns1
	
	
	
		dfs.client.failover.proxy.provider.ns
		org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
	
	
	
	
		dfs.ha.fencing.methods
		sshfence
	
	
	
	
		dfs.ha.fencing.ssh.private-key-files
		/root/.ssh/id_rsa
	
	
	
		dfs.journalnode.edits.dir
		/opt/server/hadoop-2.7.5/data/dfs/jn
	
	
	
		dfs.ha.automatic-failover.enabled
		true
	
	
	
		dfs.namenode.name.dir
		file:///opt/server/hadoop-2.7.5/data/dfs/nn/name
	
	
	
		dfs.namenode.edits.dir
		file:///opt/server/hadoop-2.7.5/data/dfs/nn/edits
	
	
	
		dfs.datanode.data.dir
		file:///opt/server/hadoop-2.7.5/data/dfs/dn
	
	
	
		dfs.permissions
		false
	
	
	
		dfs.blocksize
		134217728

2.3 修改yarn-site.xml，注意node03与node02配置不同






	
			yarn.log-aggregation-enable
			true
	
 

 

        yarn.resourcemanager.ha.enabled
        true



        yarn.resourcemanager.cluster-id
        mycluster



        yarn.resourcemanager.ha.rm-ids
        rm1,rm2



        yarn.resourcemanager.hostname.rm1
        node2



        yarn.resourcemanager.hostname.rm2
        node3




        yarn.resourcemanager.address.rm1
        node2:8032


        yarn.resourcemanager.scheduler.address.rm1
        node2:8030


        yarn.resourcemanager.resource-tracker.address.rm1
        node2:8031


        yarn.resourcemanager.admin.address.rm1
        node2:8033


        yarn.resourcemanager.webapp.address.rm1
        node2:8088




        yarn.resourcemanager.address.rm2
        node3:8032


        yarn.resourcemanager.scheduler.address.rm2
        node3:8030


        yarn.resourcemanager.resource-tracker.address.rm2
        node3:8031


        yarn.resourcemanager.admin.address.rm2
        node3:8033


        yarn.resourcemanager.webapp.address.rm2
        node3:8088





        yarn.resourcemanager.recovery.enabled
        true


	       
		yarn.resourcemanager.ha.id
		rm1
       If we want to launch more than one RM in single node, we need this configuration
	
	   
	   

        yarn.resourcemanager.store.class
        org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore


        yarn.resourcemanager.zk-address
        node2:2181,node3:2181,node1:2181
        For multiple zk services, separate them with comma

 

        yarn.resourcemanager.ha.automatic-failover.enabled
        true
        Enable automatic failover; By default, it is enabled only when HA is enabled.


        yarn.client.failover-proxy-provider
        org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider



        yarn.nodemanager.resource.cpu-vcores
        2



        yarn.nodemanager.resource.memory-mb
        2048



        yarn.scheduler.minimum-allocation-mb
        1024



        yarn.scheduler.maximum-allocation-mb
        2048



        yarn.log-aggregation.retain-seconds
        2592000



        yarn.nodemanager.log.retain-seconds
        604800



        yarn.nodemanager.log-aggregation.compression-type
        gz



        yarn.nodemanager.local-dirs
        /opt/server/hadoop-2.7.5/yarn/local



        yarn.resourcemanager.max-completed-applications
        1000



        yarn.nodemanager.aux-services
        mapreduce_shuffle


 

        yarn.resourcemanager.connect.retry-interval.ms
        2000

2.4修改mapred-site.xml先拷贝mapred-site.xml.template


        mapreduce.framework.name
        yarn



        mapreduce.jobhistory.address
        node3:10020



        mapreduce.jobhistory.webapp.address
        node3:19888

hadoop.tmp.dir}/mapred/system -->

        mapreduce.jobtracker.system.dir
        /opt/server/hadoop-2.7.5/data/system/jobtracker



        mapreduce.map.memory.mb
        1024




        mapreduce.reduce.memory.mb
        1024




        mapreduce.task.io.sort.mb
        100

 



        mapreduce.task.io.sort.factor
        10



        mapreduce.reduce.shuffle.parallelcopies
        15


        yarn.app.mapreduce.am.command-opts
        -Xmx1024m



        yarn.app.mapreduce.am.resource.mb
        1536

hadoop.tmp.dir}/mapred/local-->

        mapreduce.cluster.local.dir
        /opt/server/hadoop-2.7.5/data/system/local

2.5修改slaves

node1
node2
node3

2.6修改hadoop-env.sh

export JAVA_HOME=/export/server/jdk1.8.0_241

2.7启动分发
cd /opt/server
scp -r hadoop-2.7.5/ node2:$ PWD
scp -r hadoop-2.7.5/ node3:$PWD
2.8三台机器执行一下命令：

mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/name
mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/edits
mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/name
mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/edits

2.9更改node3的rm2

vim yarn-site.xml

2.10启动
node1上执行

cd   /opt/server/hadoop-2.7.5
bin/hdfs zkfc -formatZK
sbin/hadoop-daemons.sh start journalnode
bin/hdfs namenode -format
bin/hdfs namenode -initializeSharedEdits -force
sbin/start-dfs.sh

node2上执行

cd   /opt/server/hadoop-2.7.5
bin/hdfs namenode -bootstrapStandby
sbin/hadoop-daemon.sh start namenode
sbin/start-yarn.sh
bin/yarn rmadmin -getServiceState rm1(查看resourceManager状态)

node3上面执行

cd   /export/servers/hadoop-2.7.5
sbin/start-yarn.sh
bin/yarn rmadmin -getServiceState rm2
sbin/mr-jobhistory-daemon.sh start historyserver

day09HDFS

大数据系统相关栏目本月热门文章