栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

hadoop分布式集群搭建实验报告_熟悉常用的hadoop操作实验报告?

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

hadoop分布式集群搭建实验报告_熟悉常用的hadoop操作实验报告?

HiBench 一、简介

HiBench 是一个大数据基准套件,可帮助评估不同的大数据框架的速度、吞吐量和系统资源利用率。

它包含一组 Hadoop、Spark 和流式工作负载,包括
Sort、WordCount、TeraSort、Repartition、Sleep、SQL、PageRank、 Nutch indexing、Bayes、Kmeans、NWeight 和增强型 DFSIO 等。

它还包含多个用于 Spark Streaming 的流式工作负载、Flink、Storm 和 Gearpump。

工作负载类别:micro, ml(machine learning), sql, graph, websearch and streaming

支持的框架:Hadoop、Spark、Flink、Storm、Gearpum

二、检查环境

我的集群(ubuntu16)上已经安装了Hadoop所以本文只测试hadoopbench。

检查是否安装好了环境。如果缺少的话可以参考本文的 “前置准备” 进行安装。如果已经准备好环境,就直接从 “安装HiBench” 开始。

我的环境(供参考):

软件版本
hadoop2.10(官方要求Apache Hadoop 3.0.x, 3.1.x, 3.2.x, 2.x, CDH5, HDP)
maven3.3.9
java8
python2.7
三、前置准备

(说明:安装这些软件时我是在CentOS上测试的,如果你的机器不适用请参考其他教程来安装。)

安装hadoop

可以参考这篇文章来安装:https://www.jianshu.com/p/4e0dc91ad86e

安装java

下载java8的rpm

wget https://mirrors.huaweicloud.com/java/jdk/8u181-b13/jdk-8u181-linux-x64.rpm

rpm安装

rpm -ivh jdk-8u181-linux-x64.rpm

配java环境

vim /etc/profile
JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
CLASSPATH=%JAVA_HOME%/lib:%JAVA_HOME%/jre/lib
PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
export PATH CLASSPATH JAVA_HOME

使环境变量生效

source /etc/profile
安装maven
wget https://dlcdn.apache.org/maven/maven-3/3.8.5/binaries/apache-maven-3.8.5-bin.zip --no-check-certificate
unzip apache-maven-3.8.5-bin.zip -d /usr/local/
cd
vim .bashrc
source .bashrc
mvn -v
# set maven environment
export M3_HOME=/usr/local/apache-maven-3.5.0
export PATH=$M3_HOME/bin:$PATH

换阿里云镜像,加快下载速度

vi /usr/local/apache-maven-3.8.5-bin/conf/setting.xml

         
       	  alimaven  
       	  aliyun maven  
       	  http://maven.aliyun.com/nexus/content/groups/public/  
       	  central  
   	


安装python

原本有python3.7.3,想换成2.7,所以安装pyenv用于管理多个python

如果原本就是2.7就不需要换了

yum -y install git
git clone https://gitee.com/krypln/pyenv.git   ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo -e 'if command -v pyenv 1>/dev/null 2>&1; thenn eval "$(pyenv init -)"nfi' >> ~/.bashrc
exec $SHELL
mkdir $PYENV_ROOT/cache && cd $PYENV_ROOT/cache
sudo yum install zlib-devel bzip2 bzip2-devel readline-devel sqlite sqlite-devel openssl-devel xz xz-devel libffi-devel git wget
wget https://mirrors.huaweicloud.com/python/2.7.2/Python-2.7.2.tar.xz
cd /root/.pyenv/plugins/python-build/share/python-build
vim 2.7.2
pyenv install 2.7.2

2.7.2的内容(这里改成本地文件是为了加快安装速度,不然下载是很慢的)

#install_package "Python-2.7.2" "https://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz#1d54b7096c17902c3f40ffce7e5b84e0072d0144024184fff184a84d563abbb3" ldflags_dirs standard verify_py27 copy_python_gdb ensurepip

install_package "Python-2.7.2" /root/.pyenv/cache/Python-2.7.2.tar.xz ldflags_dirs standard verify_py27 copy_python_gdb ensurepip

查看并切换python版本

pyenv versions
pyenv global 2.7.2
安装bc
# 安装 bc 用于生成 report 信息
yum install bc
四、安装HiBench

下载hibench

git clone https://github.com/Intel-bigdata/HiBench.git

构建需要的模块

mvn -Phadoopbench -Dmodules -Psql -Dscala=2.11 clean package

或者也可以构建全部模块(时间会比较长,我用了一个多小时)

mvn -Dspark=2.4 -Dscala=2.11 clean package

五、配置Hibench

HiBench/conf文件夹下有几个配置文件需要配置:

hibench.confhadoop.confframeworks.lstbenchmark.lst

以下逐个来配置:

    hibench.conf,配置数据集大小和并行度
# Data scale profile. Available value is tiny, small, large, huge, gigantic and bigdata.
# The definition of these profiles can be found in the workload's conf file i.e. conf/workloads/micro/wordcount.conf

hibench.scale.profile                tiny

# Mapper number in hadoop, partition number in Spark

hibench.default.map.parallelism         8

# Reducer nubmer in hadoop, shuffle partition number in Spark

hibench.default.shuffle.parallelism     8

    hadoop.conf,配置hadoop集群的相关信息,这一步要搞清楚自己机器上hadoop的安装目录,不能照抄
cp conf/hadoop.conf.template conf/hadoop.conf

然后修改hadoop.conf配置文件:

vi hadoop.conf

填写以下内容(要根据自己的机器修改):

# Hadoop home   hadoop的家目录
hibench.hadoop.home     /usr/local/hadoop

# The path of hadoop executable
hibench.hadoop.executable     ${hibench.hadoop.home}/bin/hadoop

# Hadoop configraution directory
hibench.hadoop.configure.dir  ${hibench.hadoop.home}/etc/hadoop

# The root HDFS path to store HiBench data
hibench.hdfs.master       hdfs://master:9000


# Hadoop release provider. Supported value: apache
hibench.hadoop.release    apache

上面HDFS的path是怎么得到的呢?需要到hadoop的安装目录下找到etc/hadoop/core-site.xml,就能看到hdfs的命名空间

amax@master:/usr/local/hadoop/etc/hadoop$ vi core-site.xml







    
    
        fs.defaultFS
        hdfs://master:9000
    
    
    
        io.file.buffer.size
        4096
    
    
    
        hadoop.tmp.dir
        /usr/local/hadoop/tmp
    

~

    修改frameworks.lst和benchmark.lst,指定要使用的benchmark和在哪个平台上运行

我使用hadoop

amax@master:~/Hibench/Hibench-master/conf$ vi frameworks.lst
hadoop
# spark

先测试一下wordcount,其他注释掉

amax@master:~/Hibench/Hibench-master/conf$ vi benchmarks.lst
#micro.sleep
#micro.sort
#micro.terasort
micro.wordcount
#micro.repartition
#micro.dfsioe

#sql.aggregation
#sql.join
#sql.scan

#websearch.nutchindexing
#websearch.pagerank

#ml.bayes
#ml.kmeans
#ml.lr
#ml.als
#ml.pca
#ml.gbt
#ml.rf
#ml.svd
#ml.linear
#ml.lda
#ml.svm
#ml.gmm
#ml.correlation
#ml.summarizer

#graph.nweight

六、运行Hibench

要在hadoop的安装目录下启动hadoop

./start-all.sh

增加执行权限

amax@master:~/Hibench/Hibench-master/bin$ chmod +x -R functions/
amax@master:~/Hibench/Hibench-master/bin$ chmod +x -R workloads/
amax@master:~/Hibench/Hibench-master/bin$ chmod +x run_all.sh

在HiBench的bin目录下开始运行

amax@master:~/Hibench/Hibench-master/bin$ ./run_all.sh
Prepare micro.wordcount ...
Exec script: /home/amax/Hibench/Hibench-master/bin/workloads/micro/wordcount/prepare/prepare.sh
patching args=
Parsing conf: /home/amax/Hibench/Hibench-master/conf/hadoop.conf
Parsing conf: /home/amax/Hibench/Hibench-master/conf/hibench.conf
Parsing conf: /home/amax/Hibench/Hibench-master/conf/workloads/micro/wordcount.conf
probe sleep jar: /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.1-tests.jar
start HadoopPrepareWordcount bench
hdfs rm -r: /usr/local/hadoop/bin/hadoop --config /usr/local/hadoop/etc/hadoop fs -rm -r -skipTrash hdfs://master:9000/HiBench/Wordcount/Input
Deleted hdfs://master:9000/HiBench/Wordcount/Input
Submit MapReduce Job: /usr/local/hadoop/bin/hadoop --config /usr/local/hadoop/etc/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=32000 -D mapreduce.randomtextwriter.bytespermap=4000 -D mapreduce.job.maps=8 -D mapreduce.job.reduces=8 hdfs://master:9000/HiBench/Wordcount/Input
The job took 14 seconds.
finish HadoopPrepareWordcount bench
Run micro/wordcount/hadoop
Exec script: /home/amax/Hibench/Hibench-master/bin/workloads/micro/wordcount/hadoop/run.sh
patching args=
Parsing conf: /home/amax/Hibench/Hibench-master/conf/hadoop.conf
Parsing conf: /home/amax/Hibench/Hibench-master/conf/hibench.conf
Parsing conf: /home/amax/Hibench/Hibench-master/conf/workloads/micro/wordcount.conf
probe sleep jar: /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.1-tests.jar
start HadoopWordcount bench
hdfs rm -r: /usr/local/hadoop/bin/hadoop --config /usr/local/hadoop/etc/hadoop fs -rm -r -skipTrash hdfs://master:9000/HiBench/Wordcount/Output
rm: `hdfs://master:9000/HiBench/Wordcount/Output': No such file or directory
hdfs du -s: /usr/local/hadoop/bin/hadoop --config /usr/local/hadoop/etc/hadoop fs -du -s hdfs://master:9000/HiBench/Wordcount/Input
Submit MapReduce Job: /usr/local/hadoop/bin/hadoop --config /usr/local/hadoop/etc/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount -D mapreduce.job.maps=8 -D mapreduce.job.reduces=8 -D mapreduce.inputformat.class=org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat -D mapreduce.outputformat.class=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat -D mapreduce.job.inputformat.class=org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat -D mapreduce.job.outputformat.class=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat hdfs://master:9000/HiBench/Wordcount/Input hdfs://master:9000/HiBench/Wordcount/Output
         Bytes Written=22168
finish HadoopWordcount bench
Run all done!

这样就运行成功了,可以自己换别的benchmark尝试。

七、查看报告

执行完成后的报告都在Hibench的report文件夹下,可以自行查看

amax@master:~/Hibench/Hibench-master/report$ vi hibench.report
Type         Date       Time     Input_data_size      Duration(s)          Throughput(bytes/s)  Throughput/node
HadoopWordcount 2022-03-27 15:17:33 35706                23.176               1540                 256

查看report/wordcount/prepare里的log

amax@master:~/Hibench/Hibench-master/report/wordcount/prepare$ vi bench.log
2022-03-27 15:16:48 INFO Connecting to ResourceManager at master/172.31.58.2:8032
Running 8 maps.
Job started: Sun Mar 27 15:16:49 CST 2022
2022-03-27 15:16:49 INFO Connecting to ResourceManager at master/172.31.58.2:8032
2022-03-27 15:16:49 INFO number of splits:8
2022-03-27 15:16:50 INFO Submitting tokens for job: job_1641806957654_0004
2022-03-27 15:16:50 INFO resource-types.xml not found
2022-03-27 15:16:50 INFO Unable to find 'resource-types.xml'.
2022-03-27 15:16:50 INFO Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
2022-03-27 15:16:50 INFO Adding resource type - name = vcores, units = , type = COUNTABLE
2022-03-27 15:16:50 INFO Submitted application application_1641806957654_0004
2022-03-27 15:16:50 INFO The url to track the job: http://master:8088/proxy/application_1641806957654_0004/
2022-03-27 15:16:50 INFO Running job: job_1641806957654_0004
2022-03-27 15:16:57 INFO Job job_1641806957654_0004 running in uber mode : false
2022-03-27 15:16:57 INFO  map 0% reduce 0%
2022-03-27 15:17:02 INFO  map 100% reduce 0%
2022-03-27 15:17:03 INFO Job job_1641806957654_0004 completed successfully
2022-03-27 15:17:03 INFO Counters: 33
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=1675976
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=968
                HDFS: Number of bytes written=35706
                HDFS: Number of read operations=32
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=16
        Job Counters
                Killed map tasks=1
                Launched map tasks=8
                Other local map tasks=8
                Total time spent by all maps in occupied slots (ms)=237250
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=23725
                Total vcore-milliseconds taken by all map tasks=23725
                Total megabyte-milliseconds taken by all map tasks=242944000
        Map-Reduce framework
                Map input records=8
                Map output records=48
                Input split bytes=968
                                        

查看report/wordcount/hadoop里的log

amax@master:~/Hibench/Hibench-master/report/wordcount/hadoop$ vi bench.log
2022-03-27 15:17:12 INFO Connecting to ResourceManager at master/172.31.58.2:8032
2022-03-27 15:17:13 INFO Total input files to process : 8
2022-03-27 15:17:13 INFO number of splits:8
2022-03-27 15:17:13 INFO mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
2022-03-27 15:17:13 INFO mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
2022-03-27 15:17:13 INFO Submitting tokens for job: job_1641806957654_0005
2022-03-27 15:17:13 INFO resource-types.xml not found
2022-03-27 15:17:13 INFO Unable to find 'resource-types.xml'.
2022-03-27 15:17:13 INFO Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
2022-03-27 15:17:13 INFO Adding resource type - name = vcores, units = , type = COUNTABLE
2022-03-27 15:17:13 INFO Submitted application application_1641806957654_0005
2022-03-27 15:17:13 INFO The url to track the job: http://master:8088/proxy/application_1641806957654_0005/
2022-03-27 15:17:13 INFO Running job: job_1641806957654_0005
2022-03-27 15:17:20 INFO Job job_1641806957654_0005 running in uber mode : false
2022-03-27 15:17:20 INFO  map 0% reduce 0%
2022-03-27 15:17:26 INFO  map 100% reduce 0%
2022-03-27 15:17:32 INFO  map 100% reduce 88%
2022-03-27 15:17:33 INFO  map 100% reduce 100%
2022-03-27 15:17:33 INFO Job job_1641806957654_0005 completed successfully
2022-03-27 15:17:33 INFO Counters: 51
        File System Counters
                FILE: Number of bytes read=40236
                FILE: Number of bytes written=3443888
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=36666
                HDFS: Number of bytes written=22168
                HDFS: Number of read operations=56
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=16
        Job Counters
                Killed reduce tasks=1
                Launched map tasks=8
                Launched reduce tasks=8
                Data-local map tasks=7
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=239320
                Total time spent by all reduces in occupied slots (ms)=481640
                Total time spent by all map tasks (ms)=23932
                Total time spent by all reduce tasks (ms)=24082
                Total vcore-milliseconds taken by all map tasks=23932

关于这个log里面的字段可以查阅相关文档:

http://hadoopmania.blogspot.com/2015/10/performance-monitoring-testing-and.html

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/786879.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号