首先,了解什么是笛卡尔积,笛卡尔积就是大多数情况下在我们没有注意的情况下产生了错误的关联条件,然后造成了笛卡尔积,如下图
笛卡尔积
在hive的严格模式下,是不允许笛卡尔积的操作的,因为在大数据领域笛卡尔积的存在会产生很大的数据量暴增,对于集群的伤害很大。
但是如果在严格模式下必须要使用到笛卡尔积怎么办?
比如表A中有一条数据,是jn1表的班主任,表A中只有一个字段,现在要把这条记录加入到jn1中,那么该怎么做呢?
非严格模式下
非严格模式下就很容易操作了
select * from jn1,scri;
jn1.name jn1.id scri.log tom 1 i am your teacher! Be Careful! lilly 8 i am your teacher! Be Careful! lilly 7 i am your teacher! Be Careful! jey 2 i am your teacher! Be Careful!
除此之外还可以
select * from jn1 full join scri;
jn1.name jn1.id scri.log tom 1 i am your teacher! Be Careful! lilly 8 i am your teacher! Be Careful! lilly 7 i am your teacher! Be Careful! jey 2 i am your teacher! Be Careful!
除此之外还可以这样
select * from jn1 join scri on 1=1;
jn1.name jn1.id scri.log tom 1 i am your teacher! Be Careful! lilly 8 i am your teacher! Be Careful! lilly 7 i am your teacher! Be Careful! jey 2 i am your teacher! Be Careful!
严格模式下的笛卡尔积
严格模式下造成了笛卡尔积就会出现这种报错
FAILED: SemanticException Cartesian products are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.cartesian.product to false and that hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features.
但是如果非要用怎么办,比如在最近的需求中,需要用到一个udf函数,这个函数传入的参数有一个是从其他表处理得到的,但是在这张表中又得不到,因为没处理一行就要用到这个值,所以考虑将这个值添加到这个表中。这个做法就会产生笛卡尔积,但是hive严格模式又不能产生笛卡尔积,无奈之下找了同事大佬,他分分钟搞定。
这样解决
select j1.name
,j1.id
,s1.log
from (
select *
,1 as joinkey
from jn1
) j1
join (
select *
,1 as joinkey
from scri
) s1
on j1.joinkey = s1.joinkey;
j1.name j1.id s1.log tom 1 i am your teacher! Be Careful! lilly 8 i am your teacher! Be Careful! lilly 7 i am your teacher! Be Careful! jey 2 i am your teacher! Be Careful!



