Sqoop（一）：MySQL导入Hive数据库时NULL值处理

场景：

sqoop将mysql的一张表导入到hive中，发现以前is null的字段导入到hive的时候，被转换为了字符串’NULL’或’null’。

例：

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true
–connect jdbc:mysql://(IP地址):3306/interface
–username root
–password root
–split-by id
–target-dir /user/hive/warehouse/cfdp.db/etl_test1
–delete-target-dir
–fields-terminated-by “t”
–query “select *,now() as sync_date from etl_test1 where $CONDITIONS”

示例是使用第二种解决方案进行解决的此问题

解决方案：

提供两种方法解决数据库中的字段值为NULl导入到HIVE中后变成空字符串的方法，使用以下方法可以保障在mysql中存储的是NULL，导入到HIVE表后也是NULL

一. 直接修改hive表的属性，让hive表中为空的值显示为NULL

alter table ${table_name} SET SERDEPROPERTIES(‘serialization.null.format’ = ‘’);
${table_name}填写你实际的hive表名

使用限制: 若原始数据中有本身为空的值在HIVE表中也会显示为NULL。根据HIVE的设计原理，这是不可避免的情况，在HIVE中必须要指定一种方式来表示NULL值，若空值需要存储，则根据情况修改为其他的存储格式

二. 添加sqoop参数

添加以下参数：

--null-string '\N'  
--null-non-string '\N'

根据示例中原语句修改如下：

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true
–connect jdbc:mysql://(IP地址):3306/interface
–username root
–password root
–split-by id
–target-dir /user/hive/warehouse/cfdp.db/etl_test1
–delete-target-dir
–null-string ‘N’
–null-non-string ‘N’
–fields-terminated-by “t”
–query “select *,now() as sync_date from etl_test1 where $CONDITIONS”

使用限制: 导入的hive目标表需要提前建好，sqoop的方式是设定了’N’来表示NULL值，若本身源数据中存了’N’，则不能使用"N"来代替NULL，需要修改–null-string，–null-non-string，以及serialization.null.format的值为其他代替值

Sqoop（一）：MySQL导入Hive数据库时NULL值处理

大数据系统相关栏目本月热门文章