栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

hive textfile orc parquet格式简单比较

hive textfile orc parquet格式简单比较

使用datafaker 在虚拟机生成100万数据,2列的值重复较多,2列较少。
表结构

create table if not exists test_parquet(
  advertiser_id string,
  ad_plan_id string,
  gen_time TIMESTAMP ,
  cnt BIGINT
) 

使用datafaker先生成10万数据,meta文件如下

advertiser_id||string||advertiser_id[:ipv4]
ad_plan_id||string||ad_plan_id[:color_name]
gen_time||timestamp||gen_time[:datetime(1,'%Y-%m-%d %R:%S')]
cnt||int||cnt[:int(1, 10000)] 

再采用修改timestamp及rand 生成剩余数据,然后一次性插入到textfile、orc及parquet,后存储差异比较
[root@140 opt]# hdfs dfs -ls /user/hive/warehouse/test_orc
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
Found 1 items
-rw-r–r-- 3 root supergroup 3008038 2022-01-14 07:21 /user/hive/warehouse/test_orc/000000_0
[root@140 opt]# hdfs dfs -ls /user/hive/warehouse/test_text
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
Found 1 items
-rw-r–r-- 3 root supergroup 56741450 2022-01-14 07:20 /user/hive/warehouse/test_text/000000_0
[root@140 opt]# hdfs dfs -ls /user/hive/warehouse/test_parquet
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
Found 1 items
-rw-r–r-- 3 root supergroup 3555302 2022-01-14 07:23 /user/hive/warehouse/test_parquet/000000_0

可以看到ORC< Parquet

执行sql也可以看到io小很多(Parquet hive> select count(distinct advertiser_id ), count(distinct ad_plan_id) ,count(),sum(cnt) from test_parquet;
Query ID = root_20220114072522_afd53a4a-b43a-42b6-a9d6-3a7931aba67f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1641954266240_0045, Tracking URL = http://resourcemanager:8088/proxy/application_1641954266240_0045/
Kill Command = /opt/hadoop-3.2.1/bin/mapred job -kill job_1641954266240_0045
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-01-14 07:25:34,363 Stage-1 map = 0%, reduce = 0%
2022-01-14 07:25:43,604 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.76 sec
2022-01-14 07:25:51,818 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.6 sec
MapReduce Total cumulative CPU time: 9 seconds 600 msec
Ended Job = job_1641954266240_0045
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 9.6 sec HDFS Read: 3491912 HDFS Write: 118 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 600 msec
OK
9893 140 1200000 0
Time taken: 30.467 seconds, Fetched: 1 row(s)
hive> select count(distinct advertiser_id ), count(distinct ad_plan_id) ,count(
),sum(cnt) from test_text;
Query ID = root_20220114072614_8e62cd3a-5d0b-46fc-9396-1e8a6c76ff6d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1641954266240_0046, Tracking URL = http://resourcemanager:8088/proxy/application_1641954266240_0046/
Kill Command = /opt/hadoop-3.2.1/bin/mapred job -kill job_1641954266240_0046
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-01-14 07:26:26,777 Stage-1 map = 0%, reduce = 0%
2022-01-14 07:26:34,992 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.03 sec
2022-01-14 07:26:43,196 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.74 sec
MapReduce Total cumulative CPU time: 8 seconds 740 msec
Ended Job = job_1641954266240_0046
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.74 sec HDFS Read: 56752177 HDFS Write: 118 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 740 msec
OK
9893 140 1200000 0
Time taken: 29.53 seconds, Fetched: 1 row(s)
hive> select count(distinct advertiser_id ), count(distinct ad_plan_id) ,count(*),sum(cnt) from test_orc;
Query ID = root_20220114072652_25eb02c7-bc0c-4940-a1cb-d8cce9225c39
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1641954266240_0047, Tracking URL = http://resourcemanager:8088/proxy/application_1641954266240_0047/
Kill Command = /opt/hadoop-3.2.1/bin/mapred job -kill job_1641954266240_0047
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-01-14 07:27:04,426 Stage-1 map = 0%, reduce = 0%
2022-01-14 07:27:12,638 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.1 sec
2022-01-14 07:27:21,872 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 9.49 sec
MapReduce Total cumulative CPU time: 9 seconds 490 msec
Ended Job = job_1641954266240_0047
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 9.49 sec HDFS Read: 3008544 HDFS Write: 118 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 490 msec
OK
9893 140 1200000 0
Time taken: 31.727 seconds, Fetched: 1 row(s)
hive> select sum(cnt) from test_parquet;
Query ID = root_20220114072742_99fec7cf-716f-4bd0-a555-b7797962bc03
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1641954266240_0048, Tracking URL = http://resourcemanager:8088/proxy/application_1641954266240_0048/
Kill Command = /opt/hadoop-3.2.1/bin/mapred job -kill job_1641954266240_0048
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2022-01-14 07:27:53,304 Stage-1 map = 0%, reduce = 0%
2022-01-14 07:28:01,557 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.67 sec
2022-01-14 07:28:09,745 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.45 sec
MapReduce Total cumulative CPU time: 7 seconds 450 msec
Ended Job = job_1641954266240_0048
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.45 sec HDFS Read: 15242 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 450 msec
OK
0
Time taken: 28.518 seconds, Fetched: 1 row(s)

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/706808.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号