首先我们来看下hive数仓的表格式储存速度和大小:
数据来源于https://blog.csdn.net/qq_30868737/article/details/106146066
很明显ORC经过压缩后,速度和数据大小都有很大的优势。
再来看下sqoop原生抽取TEXTFILE和HCatalog抽取ORC格式的区别。
sqoop的原生抽取方式与HCatalog抽取的区别
| 抽取方式 | sqoop原生抽取 | HCatalog抽取 |
|---|---|---|
| 支持格式 | TEXTFILE | RCFile,ORCFile |
| 匹配方式 | 分隔符 | 字段名字 |
| 允许覆盖 | 允许 | 不允许 |
| 允许load inpath | 允许 | 不允许 |
| 参数区别 | –hive-database | –hcatalog-database |
| 参数区别 | –hive-table | –hcatalog-table |
| 参数区别 | –hive-partition-key | –hcatalog-partition-key |
| 参数区别 | –hive-partition-value | –hcatalog-partition-value |
| 参数区别 | –hcatalog-storage-stanza’stored as orcfile tblproperties("“orc.compress”"="“SNAPPY”")’ |
写两个demo记录一下:
sqoop方式
1.oracle数据库
sqoop import -Dhadoop.security.credential.provider.path=jceks://hdfs/user/password/oracledb_password.jceks --hive-import --connect jdbc:oracle:thin:@10.1.88.88:1521:oracledb的sid --username root --password-alias oracledb_password.alias --table XXXXXX.T_AGENT --hive-database dw --hive-table t_agent --null-string '\N' --null-non-string '\N' --num-mappers 1 --delete-target-dir --relaxed-isolation ;;
2.postgresql数据库
sqoop import -Dhadoop.security.credential.provider.path=jceks://hdfs/user/root/password/gdb_password.jceks --connect jdbc:postgresql://10.29.88.88:5432/库名 --username root --password-alias pgdb_password.alias --hive-overwrite --table t_agent --hive-import --map-column-java contact=String --map-column-hive contact=string --hive-table dw.t_agent --num-mappers 1 --hive-drop-import-delims --hive-partition-key 'dt' --hive-partition-value "2021-11-24" --null-string '\N' --null-non-string '\N' --fields-terminated-by ' 01' -- --schema public --delete-target-dir --target-dir /user/root/tmp/sqoop/ebaocore-mo-dw-cd/src/t_agent --relaxed-isolation ;
Hcatalog方式
1.oracle数据库
sqoop import
-Dhadoop.security.credential.provider.path=jceks://hdfs/user/password/oracledb_password.jceks
--connect jdbc:oracle:thin:@10.1.88.88:1521:xxxx
--username root
--password-alias oracledb_password.alias
--table XXXXXX.T_AGENT
--hcatalog-database dw
--hcatalog-table t_agent
--hcatalog-partition-keys dt
--hcatalog-partition-values 20211122
--hcatalog-storage-stanza 'stored as orcfile tblproperties("orc.compress"="SNAPPY")'
-num-mappers 1
2.postgresql数据库
sqoop import
-Dhadoop.security.credential.provider.path=jceks://hdfs/user/root/password/pgdb_password.jceks
--connect jdbc:postgresql://10.29.88.88:5432/t_agent
--username root
--password-alias pgdb_password.alias
--table t_agent
--map-column-java contact=String
--map-column-hive contact=string
--hcatalog-database dw
--hcatalog-table t_agent
--num-mappers 1
--hive-drop-import-delims
--hcatalog-partition-keys dt
--hcatalog-partition-values 2021-11-24
--hcatalog-storage-stanza 'stored as orcfile tblproperties("orc.compress"="SNAPPY")'
--null-string '\N' --null-non-string '\N'
--fields-terminated-by ' 01'
-- --schema public
--delete-target-dir
--target-dir /user/tplhkhdpuser01/tmp/sqoop/src_tplmo_pg/t_agent
--relaxed-isolation ;



