最具影响力的数字化技术在线社区

168大数据

 找回密码
 立即注册

QQ登录

只需一步,快速开始

1 2 3 4 5
开启左侧

[Hive] hive中sql优化解决策略

[复制链接]
发表于 2019-6-26 20:54:58 | 显示全部楼层 |阅读模式

马上注册,结交更多数据大咖,获取更多知识干货,轻松玩转大数据

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
Hive中的sql优化


如果是按时间分区的表,查询时一定要使用分区限制,如果没有分区限制,会从该表的所有数据里面遍历。

注意sql中or的使用,or 这个逻辑必须单独括起来,否则可能引起无分区限制,下面举个例子,想查询到的是gd或gx的某天的数据。
Select x from t where ds=d1 and province=’gd’ or province=’gx’
该语句会从所有的分区里面查询!也没有得到自己想要的数据!这里的意思是某天gd的或者所有日期里面gx的数据 ,正确的写法是:
Select x from t where ds=d1 and (province=’gd’ or province=’gx’)


很容易出错的地方
1、  只支持等值连接,不支持非等值连接
2、  内连接时小表放前面,大表放后面 a Join b on  a.x=b.x  a为小表
3、  连接小表时使用map join 条件
SELECT /*+ MAPJOIN(b) */ a.key, a.value  FROM a join b on a.key = b.key
l   小表可以放进内存(维度表、行数<2w 为宜)
l   小表不能为驱动表 (eg:left outer join 时的左表)


Full outer join 在on中过滤分区有问题,解决方法是将分区过滤条件放到左表和右表子查询里面
比如:
Select a.x,b.x from a full|left|right outer join b on (a.key=b.key and a.ds=d1 and b.ds=d1)
应该写成
Select t1.x ,t2.x from


(select * from a where ds=d1) t1
full outer join
(select * from  b where ds=d1) t2


on t1.key=t2.key


内连接注意事项
要使用 a join b on a.key=b.key,
下面写法在mysql/Oracle等价于内连接,但是在Hive中会导致笛卡尔积,会跑不出来的
From a ,b where a.key=b.key


  join的两个表中有一个表应该是distinct的,但如果实际的数据并非distinct的,这个时候应该做一次select distinct放到子查询,再join


慎用Order By


尽量不用order by,因为order by是全局的,会只有一个reduce


一些用法
1、结果压缩:(默认不压缩)
set hive.exec.compress.output=true;
2、压缩算法:(这个一般不用设,默认是gzip)
set mapred.output.compression.codec=org.apache.hadoop.io
.compress.GzipCodec;

3、特殊情况下设置reduce的个数
set mapred.reduce.tasks=n

4、数据严重倾斜时在sql前面加上
Set hive.groupby.skewindata = true
数据严重倾斜通常发生在 group by,其中维度值少,维度对应的记录数特别大的时候。如计算男女比例。


优化思路:
减少生成的mapreduce步骤。
1.      使用CASE WHEN代替子查询。
2.      由于日志数据量不大,直接用distinct处理。
优化前SQL:
Total MapReduce jobs = 20
insert overwrite table t_md_soft_wp7_dload partition(ds=20120820)
select g_f,dload_count,dload_user,tensoft_dload_count,tensoft_dload_user,outsoft_dload_count,outsoft_dload_user
from
(select
temp1.g_f,
temp1.dload_count,
temp1.dload_user,
temp2.tensoft_dload_count,
temp2.tensoft_dload_user,
temp3.outsoft_dload_count,
temp3.outsoft_dload_user
from
(select
g_f,
count(1) as dload_user,
sum(t1.pv) as dload_count
from
(select
g_f,
cookie_id,
count(1) as pv
from
t_od_soft_wp7_dload
where
ds=20120820
group by g_f,cookie_id) t1
group by g_f) temp1 left outer join
(select
g_f, count(1) as tensoft_dload_user,
sum(tt3.login_pv) as tensoft_dload_count
from
(select
g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt1 join t_rd_soft_wp7_app tt2 on tt1.ds=tt2.ds and tt1.ios_soft_id = tt2.appid
where
tt1.ds=20120820 and tt2.is_self_rd = 1
group by g_f,cookie_id) tt3
group by g_f) temp2 on temp1.g_f = temp2.g_f
left outer join
(select
g_f,
count(1) as outsoft_dload_user,
sum(tt6.login_pv) as outsoft_dload_count
from
(select
g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt4 join t_rd_soft_wp7_app tt5 on tt4.ds=tt5.ds and tt4.ios_soft_id = tt5.appid
where
tt4.ds=20120820 and tt5.is_self_rd = 0
group by g_f,cookie_id) tt6
group by g_f) temp3 on temp1.g_f = temp3.g_f
union all
select
temp4.g_f,
temp4.dload_count,
temp4.dload_user,
temp5.tensoft_dload_count,
temp5.tensoft_dload_user,
temp6.outsoft_dload_count,
temp6.outsoft_dload_user
from
(select
cast('-1' as bigint) as g_f,
count(1) as dload_user,
sum(tt7.pv) as dload_count
from
(select
cast('-1' as bigint) as g_f,
cookie_id,
count(1) as pv
from
t_od_soft_wp7_dload
where
ds=20120820
group by g_f,cookie_id) tt7
group by g_f) temp4 left outer join
(select
cast('-1' as bigint) as g_f,
count(1) as tensoft_dload_user,
sum(tt10.login_pv) as tensoft_dload_count
from
(select
cast('-1' as bigint) as g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt8 join t_rd_soft_wp7_app tt9 on tt8.ds=tt9.ds and tt8.ios_soft_id = tt9.appid
where
tt8.ds=20120820 and tt9.is_self_rd = 1
group by g_f,cookie_id) tt10
group by g_f) temp5 on temp4.g_f = temp5.g_f
left outer join
(select
cast('-1' as bigint) as g_f,
count(1) as outsoft_dload_user,
sum(tt13.login_pv) as outsoft_dload_count
from
(select
cast('-1' as bigint) as g_f,
cookie_id,
count(1) as login_pv
from
t_od_soft_wp7_dload tt11 join t_rd_soft_wp7_app tt12 on tt11.ds=tt12.ds and tt11.ios_soft_id = tt12.appid
where
tt11.ds=20120820 and tt12.is_self_rd = 0
group by g_f,cookie_id) tt13
group by g_f) temp6 on temp4.g_f = temp6.g_f ) t;


优化后SQL:
Total MapReduce jobs = 3
insert overwrite table t_md_soft_wp7_dload partition(ds=20120820)
select
g_f,
count(cookie_id) dload_count,
count(distinct cookie_id) dload_user,
count(case when is_self_rd=1 then cookie_id end ) tensoft_dload_count,
count(distinct case when is_self_rd=1 then cookie_id end ) tensoft_dload_user,
count( case when is_self_rd=0 then cookie_id end ) outsoft_dload_count,
count(distinct case when is_self_rd=0 then cookie_id end ) outsoft_dload_user
from
(select g_f,cookie_id,is_self_rd from
t_od_soft_wp7_dload t1 left outer join (select appid,is_self_rd from t_rd_soft_wp7_app where ds=20120820)  t2 on  t1.ios_soft_id = t2.appid where t1.ds=20120820
union all
select cast('-1' as bigint) as g_f,cookie_id,is_self_rd
from
t_od_soft_wp7_dload t1 left outer join (select appid,is_self_rd from t_rd_soft_wp7_app where ds=20120820)  t2 on  t1.ios_soft_id = t2.appid where t1.ds=20120820
) t
group by g_f;
优化效果:同样方式优化了2个作业,从 2-3小时 降到 2分钟。
---------------------
作者:宁哥说
来源:CSDN
原文:https://blog.csdn.net/qq_35036995/article/details/80298449

楼主热帖
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐上一条 /1 下一条

关于我们|小黑屋|Archiver|168大数据 ( 京ICP备14035423号|申请友情链接

GMT+8, 2024-4-19 10:02

Powered by BI168大数据社区

© 2012-2014 168大数据

快速回复 返回顶部 返回列表