Hive tpcds - 3000 测试

1. 下载资源

我用的是 hortonworks 的版本，其他的也应该类似。

git clone git@github.com:hortonworks/hive-testbench.git

2. 编译

./tpcds-build.sh

如果目标服务器不能上网，或者不想环境再配置一遍，可以把编译之后的整个目录打包，上传到目标服务器上进行解压。

3. 生成数据

sh tpcds-setup.sh 3000 /tmp/tpds-gen

第1个参数是数据规模，3000 代表 3000G，最小是 2，代表 2 G。此参数是必须的。
第 2 个参数是临时数据的位置。默认是 ${fs.defaultFS}/tmp/tpcds-generate

生成数据时，默认用beeline 连接 hiveserver2，用 zookeeper 做服务发现。如果我们没有配置，可以修改 tpcds-setup.sh 直接连接 hiveserver2。

默认连接 hiveserver2 的方式。

HIVE="beeline -n hive -u 'jdbc:hive2://localhost:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2?tez.queue.name=default' "

直接连接 hiveserver2 的方式

HIVE="beeline -n hive -u 'jdbc:hive2://localhost:10000 "

创建集群

1 个 master 节点，8 vcores, 32 G memory。
20 个 core 节点，4 vcores, 16G memory，高性能云磁盘 500GB。为 nodemanager 配置 4 vcores, 12 G 内存。

执行 query

进入 sample-queries-tpcds 目录，可以看到 99 个查询语句。
我们编写 query-all.sh 可以执行 99 个查询。

SCALE=3000
LOGFILE=query-all-`date +"%Y%m%d-%H%M%S"`.log 
for ((i=1;i < 100; i++)); do
   echo ""
   echo ""
   echo ""
   echo ""
   echo "Begin exec query${i} at `date +"%Y%m%d-%H:%M.%S"`" >> ${LOGFILE} 2>&1
   echo hive -e "use tpcds_bin_partitioned_orc_${SCALE}; source query${i}.sql" >> ${LOGFILE} 2>&1
   hive -e "use tpcds_bin_partitioned_orc_${SCALE}; source query${i}.sql" >> ${LOGFILE} 2>&1
       if [ $? != 0]; then
      echo "End exec query${i} failed" >> ${LOGFILE} 2>&1 
          exit 1;
       fi
   echo "End exec query${i} successfully at `date +"%Y%m%d-%H:%M.%S"`" >> ${LOGFILE} 2>&1
done

修改 tez-site.xml

tez.counters 默认是 1200，比较小，改成 200000


      tez.counters.max
      200000

以 TEZ 的方式执行

由于 core 节点 4 vcores，16 G memory，每个容器可以 4G 内存。修改 hive-site.xml，把 container 的容量改为 4 G，并修改 jvm 参数。
hive-site.xml 添加以下内容：


      hive.tez.container.size
      4096


      hive.tez.java.opts
      -server -Xmx3545m -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps

执行 99 个 query。

nohup sh query-all.sh &

LLAP 测试生成 LLAP 程序

由于 20 个 core 节点，每个 4 vcores，总共可以运行 80 个容器。hive 自动启动两个 TEZ APP，去除 Application Master的开销和 LLAP AM 的开销。
因为每台服务器仅能启动一个 LLAP 后台进程，设置启动 20 个容器，每个容器 8 G 内存，3 个 executors。

hive --service llap --name llap-demo --instances 20 --cache 1280m --executors 3 --iothreads 1 --size 8000m --xmx 5000m --queue default --loglevel INFO

启动 LLAP 服务

llap-yarn-13Oct2021/run.sh

执行 query

nohup sh query-all.sh &

可以看到，每次执行都生成新的日志文件。

Query	Tez Container Mode	LLAP Mode
query1

Hive tpcds - 3000 测试

大数据系统相关栏目本月热门文章