168大数据

标题: hive中sql优化解决策略 [打印本页]

作者: 168主编    时间: 2019-6-26 20:54
标题: hive中sql优化解决策略
Hive中的sql优化


如果是按时间分区的表,查询时一定要使用分区限制,如果没有分区限制,会从该表的所有数据里面遍历。

注意sql中or的使用,or 这个逻辑必须单独括起来,否则可能引起无分区限制,下面举个例子,想查询到的是gd或gx的某天的数据。
Select x from t where ds=d1 and province=’gd’ or province=’gx’
该语句会从所有的分区里面查询!也没有得到自己想要的数据!这里的意思是某天gd的或者所有日期里面gx的数据 ,正确的写法是:
Select x from t where ds=d1 and (province=’gd’ or province=’gx’)


很容易出错的地方
1、  只支持等值连接,不支持非等值连接
2、  内连接时小表放前面,大表放后面 a Join b on  a.x=b.x  a为小表
3、  连接小表时使用map join 条件
SELECT /*+ MAPJOIN(b) */ a.key, a.value  FROM a join b on a.key = b.key
l   小表可以放进内存(维度表、行数<2w 为宜)
l   小表不能为驱动表 (eg:left outer join 时的左表)


Full outer join 在on中过滤分区有问题,解决方法是将分区过滤条件放到左表和右表子查询里面
比如:
Select a.x,b.x from a full|left|right outer join b on (a.key=b.key and a.ds=d1 and b.ds=d1)
应该写成
Select t1.x ,t2.x from


(select * from a where ds=d1) t1
full outer join
(select * from  b where ds=d1) t2


on t1.key=t2.key


内连接注意事项
要使用 a join b on a.key=b.key,
下面写法在mysql/oracle等价于内连接,但是在Hive中会导致笛卡尔积,会跑不出来的
From a ,b where a.key=b.key


  join的两个表中有一个表应该是distinct的,但如果实际的数据并非distinct的,这个时候应该做一次select distinct放到子查询,再join


慎用Order By


尽量不用order by,因为order by是全局的,会只有一个reduce


一些用法
1、结果压缩:(默认不压缩)
set hive.exec.compress.output=true;
2、压缩算法:(这个一般不用设,默认是gzip)
set mapred.output.compression.codec=org.apache.hadoop.io
.compress.GzipCodec;

3、特殊情况下设置reduce的个数
set mapred.reduce.tasks=n

4、数据严重倾斜时在sql前面加上
Set hive.groupby.skewindata = true
数据严重倾斜通常发生在 group by,其中维度值少,维度对应的记录数特别大的时候。如计算男女比例。


优化思路:
减少生成的mapreduce步骤。
1.      使用CASE WHEN代替子查询。
2.      由于日志数据量不大,直接用distinct处理。
优化前SQL:
Total MapReduce jobs = 20
insert overwrite table t_md_soft_wp7_dload partition(ds=20120820)
select g_f,dload_count,dload_user,tensoft_dload_count,tensoft_dload_user,outsoft_dload_count,outsoft_dload_user
from
(select
temp1.g_f,
temp1.dload_count,
temp1.dload_user,
temp2.tensoft_dload_count,
temp2.tensoft_dload_user,
temp3.outsoft_dload_count,
temp3.outsoft_dload_user
from
(select
g_f,
count(1) as dload_user,
sum(t1.pv) as dload_count
from
(select
g_f,
cookie_id,
count(1) as pv
from
t_od_soft_wp7_dload
where
ds=20120820
group by g_f,cookie_id) t1
group by g_f) temp1 left outer join
(select
g_f, count(1) as tensoft_dload_user,
sum(tt3.login_pv) as tensoft_dload_count
from
(select
g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt1 join t_rd_soft_wp7_app tt2 on tt1.ds=tt2.ds and tt1.ios_soft_id = tt2.appid
where
tt1.ds=20120820 and tt2.is_self_rd = 1
group by g_f,cookie_id) tt3
group by g_f) temp2 on temp1.g_f = temp2.g_f
left outer join
(select
g_f,
count(1) as outsoft_dload_user,
sum(tt6.login_pv) as outsoft_dload_count
from
(select
g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt4 join t_rd_soft_wp7_app tt5 on tt4.ds=tt5.ds and tt4.ios_soft_id = tt5.appid
where
tt4.ds=20120820 and tt5.is_self_rd = 0
group by g_f,cookie_id) tt6
group by g_f) temp3 on temp1.g_f = temp3.g_f
union all
select
temp4.g_f,
temp4.dload_count,
temp4.dload_user,
temp5.tensoft_dload_count,
temp5.tensoft_dload_user,
temp6.outsoft_dload_count,
temp6.outsoft_dload_user
from
(select
cast('-1' as bigint) as g_f,
count(1) as dload_user,
sum(tt7.pv) as dload_count
from
(select
cast('-1' as bigint) as g_f,
cookie_id,
count(1) as pv
from
t_od_soft_wp7_dload
where
ds=20120820
group by g_f,cookie_id) tt7
group by g_f) temp4 left outer join
(select
cast('-1' as bigint) as g_f,
count(1) as tensoft_dload_user,
sum(tt10.login_pv) as tensoft_dload_count
from
(select
cast('-1' as bigint) as g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt8 join t_rd_soft_wp7_app tt9 on tt8.ds=tt9.ds and tt8.ios_soft_id = tt9.appid
where
tt8.ds=20120820 and tt9.is_self_rd = 1
group by g_f,cookie_id) tt10
group by g_f) temp5 on temp4.g_f = temp5.g_f
left outer join
(select
cast('-1' as bigint) as g_f,
count(1) as outsoft_dload_user,
sum(tt13.login_pv) as outsoft_dload_count
from
(select
cast('-1' as bigint) as g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt11 join t_rd_soft_wp7_app tt12 on tt11.ds=tt12.ds and tt11.ios_soft_id = tt12.appid
where
tt11.ds=20120820 and tt12.is_self_rd = 0
group by g_f,cookie_id) tt13
group by g_f) temp6 on temp4.g_f = temp6.g_f ) t;


优化后SQL:
Total MapReduce jobs = 3
insert overwrite table t_md_soft_wp7_dload partition(ds=20120820)
select
g_f,
count(cookie_id) dload_count,
count(distinct cookie_id) dload_user,
count(case when is_self_rd=1 then cookie_id end ) tensoft_dload_count,
count(distinct case when is_self_rd=1 then cookie_id end ) tensoft_dload_user,
count( case when is_self_rd=0 then cookie_id end ) outsoft_dload_count,
count(distinct case when is_self_rd=0 then cookie_id end ) outsoft_dload_user
from
(select g_f,cookie_id,is_self_rd from
t_od_soft_wp7_dload t1 left outer join (select appid,is_self_rd from t_rd_soft_wp7_app where ds=20120820)  t2 on  t1.ios_soft_id = t2.appid where t1.ds=20120820
union all
select cast('-1' as bigint) as g_f,cookie_id,is_self_rd
from
t_od_soft_wp7_dload t1 left outer join (select appid,is_self_rd from t_rd_soft_wp7_app where ds=20120820)  t2 on  t1.ios_soft_id = t2.appid where t1.ds=20120820
) t
group by g_f;
优化效果:同样方式优化了2个作业,从 2-3小时 降到 2分钟。
---------------------
作者:宁哥说
来源:CSDN
原文:https://blog.csdn.net/qq_35036995/article/details/80298449






欢迎光临 168大数据 (http://www.bi168.cn/) Powered by Discuz! X3.2