一、编译hadoop-3.2.2源码
1.参考官网:2.编译环境
2.1 安装依赖 3. 编译hadoop-3.2.2 二、支持lzo压缩
1. Linux上安装LZO和LZOP2. 获取hadoop-lzo源码,并解压2. 编译hadoop-lzo3. 配置Hadoop的core-site.xml和mapred-site.xml文件4. 测试hadoop-3.2.2是否支持lzo压缩
一、编译hadoop-3.2.2源码 1.参考官网:https://github.com/apache/hadoop/blob/trunk/BUILDING.txt2.编译环境
虚拟机:VM15 Linux系统:Centos7 Jdk版本:jdk1.8 cmake版本:3.20.2 Hadoop版本:3.2.2 Maven版本:3.8.4 Protobuf版本:2.5.0 findbugs版本(可以不用):findbugs-3.0.1 apache-ant版本(可以不用):apache-ant-1.10.12
下载地址:
Hadoop: https://archive.apache.org/dist/hadoop/common/hadoop-3.2.2/hadoop-3.2.2-src.tar.gz cmake: https://cmake.org/files/v3.20/cmake-3.20.2.tar.gz Protobuf: https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz findbugs: http://prdownloads.sourceforge.net/findbugs/findbugs-3.0.1.tar.gz?download apache-ant: https://dlcdn.apache.org//ant/binaries/apache-ant-1.10.12-bin.tar.gz2.1 安装依赖
root用户下运行如下命令,安装依赖:
yum install gcc gcc-c++ gcc-header make autoconf automake libtool curl lzo-devel zlib-devel openssl openssl-devel ncurses-devel snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop libXtst zlib -y
java和maven:之前已经安装好,其中java是在root用户下,maven是在普通用户下。
安装protobuf
tar -zxvf protobuf-2.5.0.tar.gz -C /opt/ cd /opt/ cd protobuf-2.5.0/ ./configure && make && make check && make install ldconfig 验证是否成功 : protoc --version测试是否安装成功
安装cmake
tar -zxvf cmake-3.20.2.tar.gz -C /opt/ cd /opt/cmake-3.20.2/ ./configure make && make install ldconfig 验证是否成功 :cmake --version测试是否安装成功
安装findbugs
tar -zxvf findbugs-3.0.1.tar.gz?download -C /opt/
安装apache-ant
tar -zxvf apache-ant-1.10.12-bin.tar.gz -C /opt/
配置环境变量 /etc/profile
export PROTOBUF_HOME=/opt/protobuf-2.5.0 export ANT_HOME=/opt/apache-ant-1.10.12 export CMAKE_HOME=/opt/cmake-3.20.2 export FIND_BUGS_HOME=/opt/findbugs-3.0.1 export PATH=$PROTOBUF_HOME:$CMAKE_HOME/bin:$ANT_HOME/bin:$FIND_BUGS_HOME/bin:$PATH3. 编译hadoop-3.2.2
进入普通用户,进入源码目录,执行编译命令:
tar -zxvf hadoop-3.2.2-src.tar.gz -C ~/sourcecode/ cd hadoop-3.2.2-src/ mvn clean package -DskipTests -Pdist,native -Dtar
运行结果:
编译成功后,会在hadoop-dist/target/目录下有个hadoop-3.2.2.tar.gz文件。
hadoop-3.2.2.tar.gz如何部署,这里不再叙述。
之前写了一篇,可参考:
hadoop支持LZO压缩文件的分片的过程
Hadoop本身是不支持lzo压缩的,用hadoop checknative可以检查,是没有lzo的:
[ruoze@hadoop001 hadoop]$ hadoop checknative 2022-01-27 13:41:59,618 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native 2022-01-27 13:41:59,620 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 2022-01-27 13:41:59,623 WARN zstd.ZStandardCompressor: Error loading zstandard native libraries: java.lang.InternalError: Cannot load libzstd.so.1 (libzstd.so.1: cannot open shared object file: No such file or directory)! 2022-01-27 13:41:59,629 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable 2022-01-27 13:41:59,684 INFO nativeio.NativeIO: The native code was built without PMDK support. Native library checking: hadoop: true /home/ruoze/app/hadoop-3.2.2/lib/native/libhadoop.so.1.0.0 zlib: true /lib64/libz.so.1 zstd : false snappy: true /lib64/libsnappy.so.1 lz4: true revision:10301 bzip2: true /lib64/libbz2.so.1 openssl: true /lib64/libcrypto.so ISA-L: false libhadoop was built without ISA-L support PMDK: false The native code was built without PMDK support.1. Linux上安装LZO和LZOP
这里已经安装好,没有安装的自行安装:
[ruoze@hadoop001 data]$ which lzop /bin/lzop lzo压缩:lzop -v file lzo解压:lzop -dv file2. 获取hadoop-lzo源码,并解压
wget https://github.com/twitter/hadoop-lzo/archive/master.zip unzip master.zip -d ~/sourcecode/ [ruoze@hadoop001 sourcecode]$ cd hadoop-lzo-master/ [ruoze@hadoop001 hadoop-lzo-master]$ ll total 68 -rw-rw-r--. 1 ruoze ruoze 35151 Mar 5 2021 COPYING -rw-rw-r--. 1 ruoze ruoze 19760 Mar 5 2021 pom.xml -rw-rw-r--. 1 ruoze ruoze 10179 Mar 5 2021 README.md drwxrwxr-x. 2 ruoze ruoze 34 Mar 5 2021 scripts drwxrwxr-x. 4 ruoze ruoze 28 Mar 5 2021 src [ruoze@hadoop001 hadoop-lzo-master]$
修改pom.xml,与自己的hadoop版本一致
[ruoze@hadoop001 hadoop-lzo-master]$ vi pom.xmlUTF-8 3.2.2 1.0.4
root用户下执行:
yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool yum install -y git2. 编译hadoop-lzo
返回普通用户,进入hadoop-lzo-master目录,执行编译命令:
mvn clean package -Dmaven.test.skip=true
把target目录下的hadoop-lzo-0.4.21-SNAPSHOT.jar包拷贝到$HADOOP_HOME/share/hadoop/common/下面
cp hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/3. 配置Hadoop的core-site.xml和mapred-site.xml文件
core-site.xml添加如下的代码:
io.compression.codecs org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.SnappyCodec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec #主要是配置com.hadoop.compression.lzo.LzoCodec、com.hadoop.compression.lzo.LzopCodec压缩类 #io.compression.codec.lzo.class必须指定为LzoCodec非LzopCodec,不然压缩后的文件不会支持分片的 io.compression.codec.lzo.class com.hadoop.compression.lzo.LzoCodec
mapred-site.xml添加如下的代码:
#中间阶段的压缩mapred.compress.map.output true #最终阶段的压缩 mapred.map.output.compression.codec com.hadoop.compression.lzo.LzoCodec mapreduce.output.fileoutputformat.compress true mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.BZip2Codec
core-site.xml 跟 mapred-site.xml 这两个文件如果是集群机器,也要同步修改,然后启动集群。
4. 测试hadoop-3.2.2是否支持lzo压缩这里有一个造的word数据,616M,使用lzo压缩后是190M.
[ruoze@hadoop001 data]$ lzop -v makedatawordcount.txt compressing makedatawordcount.txt into makedatawordcount.txt.lzo [ruoze@hadoop001 data]$ ll -h -rw-r--r--. 1 ruoze ruoze 616M Jan 27 10:39 makedatawordcount.txt -rw-r--r--. 1 ruoze ruoze 190M Jan 27 10:39 makedatawordcount.txt.lzo [ruoze@hadoop001 data]$ hdfs dfs -put makedatawordcount.txt.lzo /data/
190M大于一个块的大小128M。这样可以模拟后面hadoop支持LZO压缩文件的分片的过程。
测试一:
在core-site、mapred-site配置文件中,没有配置lzo之前,使用如下命令,看到的是乱码:
hdfs dfs -text /data/makedatawordcount.txt.lzo
两个配置文件配置了之后,再使用上面命令,就不是乱码了,说明hadoop-3.2.2配置了lzo之后,是支持lzo压缩的。
测试二:
[ruoze@hadoop001 hadoop]$ find ./ -name *example*.jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar ./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.2.2-test-sources.jar ./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.2.2-sources.jar [ruoze@hadoop001 hadoop]$ [ruoze@hadoop001 hadoop]$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar wordcount /data/makedatawordcount.txt.lzo /output 2022-01-27 13:25:58,656 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 2022-01-27 13:25:58,973 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ruoze/.staging/job_1643252767325_0001 2022-01-27 13:25:59,108 INFO input.FileInputFormat: Total input files to process : 1 2022-01-27 13:25:59,122 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries 2022-01-27 13:25:59,123 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 26dc7b4620ff16bb6f1fdd48f915ce5fb8222d6f] 2022-01-27 13:25:59,999 INFO mapreduce.JobSubmitter: number of splits:1 2022-01-27 13:26:00,528 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1643252767325_0001 2022-01-27 13:26:00,529 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2022-01-27 13:26:00,696 INFO conf.Configuration: resource-types.xml not found 2022-01-27 13:26:00,696 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2022-01-27 13:26:00,854 INFO impl.YarnClientImpl: Submitted application application_1643252767325_0001 2022-01-27 13:26:00,920 INFO mapreduce.Job: The url to track the job: http://hadoop001:8123/proxy/application_1643252767325_0001/ 2022-01-27 13:26:00,921 INFO mapreduce.Job: Running job: job_1643252767325_0001 2022-01-27 13:26:06,005 INFO mapreduce.Job: Job job_1643252767325_0001 running in uber mode : false 2022-01-27 13:26:06,006 INFO mapreduce.Job: map 0% reduce 0% 2022-01-27 13:26:22,142 INFO mapreduce.Job: map 38% reduce 0% 2022-01-27 13:26:28,166 INFO mapreduce.Job: map 56% reduce 0% 2022-01-27 13:26:33,234 INFO mapreduce.Job: map 100% reduce 0% 2022-01-27 13:26:40,280 INFO mapreduce.Job: map 100% reduce 100% 2022-01-27 13:26:40,289 INFO mapreduce.Job: Job job_1643252767325_0001 completed successfully 2022-01-27 13:26:40,341 INFO mapreduce.Job: Counters: 54 File System Counters FILE: Number of bytes read=46027310 FILE: Number of bytes written=51351200 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=162500380 HDFS: Number of bytes written=1639998 HDFS: Number of read operations=8 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 HDFS: Number of bytes read erasure-coded=0 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=25572 Total time spent by all reduces in occupied slots (ms)=3491 Total time spent by all map tasks (ms)=25572 Total time spent by all reduce tasks (ms)=3491 Total vcore-milliseconds taken by all map tasks=25572 Total vcore-milliseconds taken by all reduce tasks=3491 Total megabyte-milliseconds taken by all map tasks=26185728 Total megabyte-milliseconds taken by all reduce tasks=3574784 Map-Reduce framework Map input records=13100000 Map output records=13100000 Map output bytes=567665192 Map output materialized bytes=4853171 Input split bytes=117 Combine input records=17818436 Combine output records=5249877 Reduce input groups=531441 Reduce shuffle bytes=4853171 Reduce input records=531441 Reduce output records=531441 Spilled Records=5781318 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=214 CPU time spent (ms)=31570 Physical memory (bytes) snapshot=763604992 Virtual memory (bytes) snapshot=5603123200 Total committed heap usage (bytes)=698351616 Peak Map Physical memory (bytes)=505016320 Peak Map Virtual memory (bytes)=2800398336 Peak Reduce Physical memory (bytes)=258588672 Peak Reduce Virtual memory (bytes)=2802724864 Shuffle Errors BAD_ID=0 ConNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=162500263 File Output Format Counters Bytes Written=1639998 [ruoze@hadoop001 hadoop]$ [ruoze@hadoop001 hadoop]$ hdfs dfs -text /output/* 2022-01-27 13:40:55,134 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries 2022-01-27 13:40:55,137 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 26dc7b4620ff16bb6f1fdd48f915ce5fb8222d6f] 2022-01-27 13:40:55,140 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native 2022-01-27 13:40:55,141 INFO compress.CodecPool: Got brand-new decompressor [.bz2] Apple 10663851 China 10664268 WORD 10674905 beijing 10667757 bigdata 10666953 china 10662630 shanghai 10666100 word 10670000 world 10663536
由测试二可知,输入是lzo压缩文件,文件里面是单词,跑了一个wc案例,得到了一个结果,说明对Hadoop配置了lzo之后,就可以支持压缩了。
但是从上面wc过程可以看到虽然lzo压缩文件是190M,大于一个块的大小,但是分片仍然是1:number of splits:1,说明还不支持分片。
那么怎么支持分片?
过程如下:
对上面的lzo文件用hadoop-lzo-0.4.21-SNAPSHOT.jar这个jar包建立索引,
[ruoze@hadoop001 hadoop]$ hdfs dfs -ls /data/ Found 1 items -rw-r--r-- 1 ruoze supergroup 198480989 2022-01-27 13:38 /data/makedatawordcount.txt.lzo [ruoze@hadoop001 hadoop]$ [ruoze@hadoop001 hadoop]$ hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar > com.hadoop.compression.lzo.LzoIndexer /data/makedatawordcount.txt.lzo 2022-01-27 13:50:01,977 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries 2022-01-27 13:50:01,978 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 26dc7b4620ff16bb6f1fdd48f915ce5fb8222d6f] 2022-01-27 13:50:02,422 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /data/makedatawordcount.txt.lzo, size 0.18 GB... 2022-01-27 13:50:02,683 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.26 seconds (725.23 MB/s). Index size is 19.23 KB. [ruoze@hadoop001 hadoop]$ [ruoze@hadoop001 hadoop]$ hdfs dfs -ls /data/ Found 2 items -rw-r--r-- 1 ruoze supergroup 198480989 2022-01-27 13:38 /data/makedatawordcount.txt.lzo -rw-r--r-- 1 ruoze supergroup 19696 2022-01-27 13:50 /data/makedatawordcount.txt.lzo.index
以看到在同级目录下生成了 makedatawordcount.lzo.index索引文件。
再跑一次wordcount,可以看到分片数是2个了:number of splits:2
[ruoze@hadoop001 hadoop]$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar > wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat > /data/makedatawordcount.txt.lzo /output 2022-01-27 14:11:11,665 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 2022-01-27 14:11:11,999 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ruoze/.staging/job_1643252767325_0007 2022-01-27 14:11:12,529 INFO input.FileInputFormat: Total input files to process : 1 2022-01-27 14:11:13,450 INFO mapreduce.JobSubmitter: number of splits:2 2022-01-27 14:11:13,550 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1643252767325_0007 2022-01-27 14:11:13,551 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2022-01-27 14:11:13,645 INFO conf.Configuration: resource-types.xml not found 2022-01-27 14:11:13,646 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2022-01-27 14:11:13,687 INFO impl.YarnClientImpl: Submitted application application_1643252767325_0007 2022-01-27 14:11:13,722 INFO mapreduce.Job: The url to track the job: http://hadoop001:8123/proxy/application_1643252767325_0007/ 2022-01-27 14:11:13,722 INFO mapreduce.Job: Running job: job_1643252767325_0007 2022-01-27 14:11:17,785 INFO mapreduce.Job: Job job_1643252767325_0007 running in uber mode : false 2022-01-27 14:11:17,785 INFO mapreduce.Job: map 0% reduce 0% 2022-01-27 14:11:33,980 INFO mapreduce.Job: map 66% reduce 0% 2022-01-27 14:11:40,066 INFO mapreduce.Job: map 73% reduce 0% 2022-01-27 14:11:46,108 INFO mapreduce.Job: map 81% reduce 0% 2022-01-27 14:11:48,119 INFO mapreduce.Job: map 100% reduce 0% 2022-01-27 14:11:49,125 INFO mapreduce.Job: map 100% reduce 100% 2022-01-27 14:11:51,144 INFO mapreduce.Job: Job job_1643252767325_0007 completed successfully 2022-01-27 14:11:51,196 INFO mapreduce.Job: Counters: 55 File System Counters FILE: Number of bytes read=4992 FILE: Number of bytes written=711952 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=198561903 HDFS: Number of bytes written=137 HDFS: Number of read operations=11 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 HDFS: Number of bytes read erasure-coded=0 Job Counters Killed map tasks=1 Launched map tasks=3 Launched reduce tasks=1 Data-local map tasks=3 Total time spent by all maps in occupied slots (ms)=54295 Total time spent by all reduces in occupied slots (ms)=11774 Total time spent by all map tasks (ms)=54295 Total time spent by all reduce tasks (ms)=11774 Total vcore-milliseconds taken by all map tasks=54295 Total vcore-milliseconds taken by all reduce tasks=11774 Total megabyte-milliseconds taken by all map tasks=55598080 Total megabyte-milliseconds taken by all reduce tasks=12056576 Map-Reduce framework Map input records=16000000 Map output records=96000000 Map output bytes=1013322815 Map output materialized bytes=260 Input split bytes=234 Combine input records=96000279 Combine output records=297 Reduce input groups=9 Reduce shuffle bytes=260 Reduce input records=18 Reduce output records=9 Spilled Records=432 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=676 CPU time spent (ms)=51250 Physical memory (bytes) snapshot=1249132544 Virtual memory (bytes) snapshot=8401858560 Total committed heap usage (bytes)=1059061760 Peak Map Physical memory (bytes)=506130432 Peak Map Virtual memory (bytes)=2801958912 Peak Reduce Physical memory (bytes)=239939584 Peak Reduce Virtual memory (bytes)=2801532928 Shuffle Errors BAD_ID=0 ConNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=198561669 File Output Format Counters Bytes Written=137 [ruoze@hadoop001 hadoop]$
总结:需要用如下命令对lzo文件建索引,才能支持分片:
hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar > com.hadoop.compression.lzo.DistributedLzoIndexer /data/makedatawordcount.txt.lzo 或者 hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar > com.hadoop.compression.lzo.LzoIndexer /data/makedatawordcount.txt.lzo
并且用的时候还要加:参数: -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat
如:
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar > wordcount -Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat > /data/makedatawordcount.txt.lzo /output
如果不加的话分片数还是1。
总结:Hadoop 的native库不支持lzo压缩文件,查看lzo文件也是乱码,需要编译hadoop-lzo,然后把jar包放到hadoop中,并且修改core-site.xml文件,添加lzo相关配置,才能支持lzo文件。并且本身不支持对lzo文件的分片,需要对lzo文件创建了索引之后才能支持分片。



