- namenode的作用:管理元数据Fsimage文件(镜像,存储几乎所有的元数据,不会立刻更新) Edits文件(日志文件,存储最近一段时间元数据,数据格式不一样慢)
- SecondaryNameNode辅助管理元数据:隔段时间将fsimage和edits文件拷贝到所在主机,将两个文件合并,合并成新的fsimage.ckpt文件替换旧的fsimage,生成edits.new文件最后到edits。触发条件:每隔一小时,或edits文件大于64m。
- SecondaryNameNode在合并edits和fsimage时需要消耗的内存和那么node差不多,所以一般不把namenode和secondarynode放在一台机器上
- namenode元数据恢复:可以通过SecondaryNameNode恢复
- 配置windows的hadoop
1.1 第一步:将已经编译好的Windows版本Hadoop解压到到一个没有中文没有空格的路径下面
1.2第二步:在windows上面配置hadoop的环境变量: HADOOP_HOME,并将%HADOOP_HOME%bin添加到path中
1.3第三步:把hadoop2.7.5文件夹中bin目录下的hadoop.dll文件放到系统盘: C:WindowsSystem32 目录
1.4第四步:关闭windows重启 - 导入maven依赖
org.apache.hadoop hadoop-common 2.7.5 org.apache.hadoop hadoop-client 2.7.5 org.apache.hadoop hadoop-hdfs 2.7.5 org.apache.hadoop hadoop-mapreduce-client-core 2.7.5 junit junit 4.12
- 使用文件系统方式访问数据
3.1 涉及的主要类:configuration:封装了客户端或服务器的配置,filesystem是一个文件系统对象,可以用该对象的一些方法对文件进行操作(get方法)
package com.hlzq.hdfs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.junit.Test;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
public class TestDemo1 {
@Test
public void meth01GetFileSystem() throws IOException {
//1.创建configuration对象
Configuration configuration = new Configuration();
//2.指定创建文件系统类型
configuration.set("fs.defaultFS", "hdfs://node1:8020");
//3.获取指定的文件系统
FileSystem fileSystem = FileSystem.get(configuration);
System.out.println(fileSystem);
}
@Test
public void meth02GetFileSystem() throws IOException, URISyntaxException {
//1.创建configuration对象
//2.指定创建文件系统类型
//3.获取指定的文件系统
FileSystem jdjsdj = FileSystem.get(new URI( "hdfs://node1:8020"), new Configuration());
System.out.println(jdjsdj);
}
}
//遍历hdfs文件
@Test
public void bianLi() throws URISyntaxException, IOException {
//获取filesystem对象
FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
// 获取指定目录下的缩影文件的详情
RemoteIterator locatedFileStatusRemoteIterator = fileSystem.listFiles(new Path("/"), true);
//遍历迭代器集合,分别获取文件的信息
while (locatedFileStatusRemoteIterator.hasNext()){
LocatedFileStatus next = locatedFileStatusRemoteIterator.next();
//获取具体的文件信息
Path path = next.getPath();
System.out.println(path);
//获取每一个文件的block信息
BlockLocation[] blockLocations = next.getBlockLocations();
System.out.println(blockLocations.length);//block的数量(文件被切分成几份)
for (BlockLocation blockLocation : blockLocations) {
String[] hosts = blockLocation.getHosts();
for(String host: hosts){
System.out.println(host);
}
System.out.println("#########################################################");
}
}
//关闭filesystem对象
fileSystem.close();
}
创建文件夹:
@Test
public void mkdir() throws URISyntaxException, IOException {
FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
boolean exists = fileSystem.exists(new Path("/xx/yy/zz"));//exists判断文件夹是否存在
if (!exists){
System.out.println("不存在创建");
fileSystem.mkdirs(new Path("/xx/yy/zz"));
}else System.out.println("存在,不创建");
}
下载
//下载方法一
@Test
public void dowlo() throws URISyntaxException, IOException {
FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
//获取源文件输入流
FSDataInputStream inputStream = fileSystem.open(new Path("/anaconda-ks.cfg"));
//获取本地文件的输出流
FileOutputStream outputStream = new FileOutputStream("F:\dnxx");
//使用IOUtils工具实现文件的拷贝
IOUtils.copy(inputStream, outputStream);
//关流
outputStream.close();
inputStream.close();
}
//下载方法二,
@Test
public void dowlo2() throws URISyntaxException, IOException {
FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
fileSystem.copyToLocalFile(new Path("/anaconda-ks.cfg"),new Path("F:\dnxx"));
fileSystem.close();
}
上传
//文件上传
@Test
public void Sha() throws URISyntaxException, IOException {
FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
fileSystem.copyFromLocalFile(new Path("F:\dnxx"),new Path("/xx/yy/zz"));
fileSystem.close();
}
小文件合并(本地小文件,追加)
@Test
public void Shaheb() throws URISyntaxException, IOException {
FileSystem fileSystem = FileSystem.get(new URI("hdfs://node1:8020"), new Configuration());
//创建HDFS上的文件名
FSDataOutputStream outputStream = fileSystem.create(new Path("/a.txt"));
//获取本地文件系统
LocalFileSystem local = FileSystem.getLocal(new Configuration());
//通过本地文件系统获取文件列表为一个集合
FileStatus[] fileStatuses = local.listStatus(new Path("file:///G:\dd"));
for (FileStatus fileStatus:fileStatuses){
FSDataInputStream open = local.open(fileStatus.getPath());
IOUtils.copy(open, outputStream);
IOUtils.closeQuietly(open);
}
IOUtils.closeQuietly(outputStream);
local.close();
fileSystem.close();
}
三.HDFS的访问权限控制
- 权限总开关:vim hdfs-site.xml
打开为TRUE ,分发,重启hdfs - 伪装用户:
1.不同集群之间数据复制:
1.1集群内部文件拷贝scp:scp (-r目录)文件名 主机名:$pwd
1.2远程复制到本地:scp root@主机名:文件地址名字 存放位置
scp -r root@node2:zookeeper.out dir33/
1.3跨集群之间的数据拷贝:hadoop distcp hdfs://node1:8020/jdk-8u hdsfs://cluste2:8020/
1.4Archive档案的使用:hdfs不擅长存储小文件,每个文件最少一个block,每个block都会在namenode占用内存,存在大量的小文件,就会吃掉namenode的大量内存。Archive把多个文件归档称为一个文件,归档成为一个文件后还可以透明的访问每一个文件。
1.4.1:创建Archive
例:你想存档一个目录/config下的所有文件:
hadoop archive -archiveName 名字 -p /config (打包的目录)/存放的目录
1.4.2查看Archive:hadoop fs -cat /output/test.har/part-0
1.4.3查看小文件列表: hadoop fs -ls har://hdfs-node1:8020/output/test.har
1.4.4访问单独的小文件: hadoop fs -cat har://hdfs-node1:8020/output/test.har/core-site.xml
1.4.5archive不支持打包
1.4.6解压Achive: hadoop fs -cp har:///output/test.har/* /config2
-
数据备份,误操作容灾恢复
-
开启指定目录的快照功能:hdfs dfsadmin -allowSnapshot 路径
-
禁用指定路径的快照功能:hdfs dfsadmin -disallowSnapshot 路径
-
给指定路径创建快照:hdfs dfs -createSnapshot 路径
-
指定快照名称进行创建快照:hdfs dfs -createSnapshot 路径 名称
-
快照重命名:hdfs dfs -renameSnapshot 路径 旧名称 新名称
-
列出当前用户所有可快照目录:hdfs lsSnapshottableDir
-
恢复快照:hdfs dfs -cp -ptopax 快照位置 /恢复位置
-
删除快照:hdfs dfs -deleteSnapshort /快照目录的 快照名
- 在/user/用户名/.Trash目录
- 参数
- 强删除:hadoop fs -rm -skipTrash /dir1/a.txt
2.HDFS高可用集群
2.1修改core-site.xml
ha.zookeeper.quorum node1:2181,node2:2181,node3:2181 fs.defaultFS hdfs://ns hadoop.tmp.dir /opt/server/hadoop-2.7.5/data/tmp fs.trash.interval 10080
2.2. 修改hdfs-site.xml
dfs.nameservices ns dfs.ha.namenodes.ns nn1,nn2 dfs.namenode.rpc-address.ns.nn1 node1:8020 dfs.namenode.rpc-address.ns.nn2 node2:8020 dfs.namenode.servicerpc-address.ns.nn1 node1:8022 dfs.namenode.servicerpc-address.ns.nn2 node2:8022 dfs.namenode.http-address.ns.nn1 node1:50070 dfs.namenode.http-address.ns.nn2 node2:50070 dfs.namenode.shared.edits.dir qjournal://node1:8485;node2:8485;node3:8485/ns1 dfs.client.failover.proxy.provider.ns org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider dfs.ha.fencing.methods sshfence dfs.ha.fencing.ssh.private-key-files /root/.ssh/id_rsa dfs.journalnode.edits.dir /opt/server/hadoop-2.7.5/data/dfs/jn dfs.ha.automatic-failover.enabled true dfs.namenode.name.dir file:///opt/server/hadoop-2.7.5/data/dfs/nn/name dfs.namenode.edits.dir file:///opt/server/hadoop-2.7.5/data/dfs/nn/edits dfs.datanode.data.dir file:///opt/server/hadoop-2.7.5/data/dfs/dn dfs.permissions false dfs.blocksize 134217728
2.3 修改yarn-site.xml,注意node03与node02配置不同
yarn.log-aggregation-enable true yarn.resourcemanager.ha.enabled true yarn.resourcemanager.cluster-id mycluster yarn.resourcemanager.ha.rm-ids rm1,rm2 yarn.resourcemanager.hostname.rm1 node2 yarn.resourcemanager.hostname.rm2 node3 yarn.resourcemanager.address.rm1 node2:8032 yarn.resourcemanager.scheduler.address.rm1 node2:8030 yarn.resourcemanager.resource-tracker.address.rm1 node2:8031 yarn.resourcemanager.admin.address.rm1 node2:8033 yarn.resourcemanager.webapp.address.rm1 node2:8088 yarn.resourcemanager.address.rm2 node3:8032 yarn.resourcemanager.scheduler.address.rm2 node3:8030 yarn.resourcemanager.resource-tracker.address.rm2 node3:8031 yarn.resourcemanager.admin.address.rm2 node3:8033 yarn.resourcemanager.webapp.address.rm2 node3:8088 yarn.resourcemanager.recovery.enabled true yarn.resourcemanager.ha.id rm1 If we want to launch more than one RM in single node, we need this configuration yarn.resourcemanager.store.class org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore yarn.resourcemanager.zk-address node2:2181,node3:2181,node1:2181 For multiple zk services, separate them with comma yarn.resourcemanager.ha.automatic-failover.enabled true Enable automatic failover; By default, it is enabled only when HA is enabled. yarn.client.failover-proxy-provider org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider yarn.nodemanager.resource.cpu-vcores 2 yarn.nodemanager.resource.memory-mb 2048 yarn.scheduler.minimum-allocation-mb 1024 yarn.scheduler.maximum-allocation-mb 2048 yarn.log-aggregation.retain-seconds 2592000 yarn.nodemanager.log.retain-seconds 604800 yarn.nodemanager.log-aggregation.compression-type gz yarn.nodemanager.local-dirs /opt/server/hadoop-2.7.5/yarn/local yarn.resourcemanager.max-completed-applications 1000 yarn.nodemanager.aux-services mapreduce_shuffle yarn.resourcemanager.connect.retry-interval.ms 2000
2.4修改mapred-site.xml先拷贝mapred-site.xml.template
mapreduce.framework.name yarn mapreduce.jobhistory.address node3:10020 hadoop.tmp.dir}/mapred/system --> mapreduce.jobhistory.webapp.address node3:19888 mapreduce.jobtracker.system.dir /opt/server/hadoop-2.7.5/data/system/jobtracker mapreduce.map.memory.mb 1024 mapreduce.reduce.memory.mb 1024 mapreduce.task.io.sort.mb 100 mapreduce.task.io.sort.factor 10 mapreduce.reduce.shuffle.parallelcopies 15 yarn.app.mapreduce.am.command-opts -Xmx1024m hadoop.tmp.dir}/mapred/local--> yarn.app.mapreduce.am.resource.mb 1536 mapreduce.cluster.local.dir /opt/server/hadoop-2.7.5/data/system/local
2.5修改slaves
node1 node2 node3
2.6修改hadoop-env.sh
export JAVA_HOME=/export/server/jdk1.8.0_241
2.7启动分发
cd /opt/server
scp -r hadoop-2.7.5/ node2:$ PWD
scp -r hadoop-2.7.5/ node3:$PWD
2.8三台机器执行一下命令:
mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/name mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/edits mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/name mkdir -p /opt/server/hadoop-2.7.5/data/dfs/nn/edits
2.9更改node3的rm2
vim yarn-site.xml
2.10启动
node1上执行
cd /opt/server/hadoop-2.7.5 bin/hdfs zkfc -formatZK sbin/hadoop-daemons.sh start journalnode bin/hdfs namenode -format bin/hdfs namenode -initializeSharedEdits -force sbin/start-dfs.sh
node2上执行
cd /opt/server/hadoop-2.7.5 bin/hdfs namenode -bootstrapStandby sbin/hadoop-daemon.sh start namenode sbin/start-yarn.sh bin/yarn rmadmin -getServiceState rm1(查看resourceManager状态)
node3上面执行
cd /export/servers/hadoop-2.7.5 sbin/start-yarn.sh bin/yarn rmadmin -getServiceState rm2 sbin/mr-jobhistory-daemon.sh start historyserver



