马上注册,结交更多数据大咖,获取更多知识干货,轻松玩转大数据
您需要 登录 才可以下载或查看,没有帐号?立即注册
x
本帖最后由 168主编 于 2019-8-11 18:40 编辑
总结工作中遇到的 Hive-sql 难点问题。 目录: Hive 查询性能优化 求两组数据的并集、交集、差集 Hive 中查询用户留存率 Hive 中的窗口函数
Hive查询性能优化写法: [AppleScript] 纯文本查看 复制代码 # 求客户端每日的去重 uv
with a1 as (
select
hit_date,
user_account
from
android_data
where
hit_date between '2018-10-01' and'2018-10-03'
group by
hit_date, user_account
)
select
hit_date,
count(user_account) as uv
from
a1
group by
hit_date
order by
hit_date
将条目少的表/子查询放在 Join 操作符的左边。原因是在 Join 操作的 Reduce 阶段,位于 Join 操作符左边的表的内容会被加载进内存,将条目少的表放在左边,可以有效减少发生错误的几率。
写法: [AppleScript] 纯文本查看 复制代码 # a, b表中, b 表最小
select
a.price_close, b.price_close
from
b join a on b.ymd = a.ymd AND b.symbol = a.symbol
where
a.symbol = 'apple'
写法: [AppleScript] 纯文本查看 复制代码 use computer_view;
with a1 as (
select
user_account,
hit_date
from
client_android_log_view
where
hit_date between '2018-12-01' and '2018-12-13'
and
nbtn_name like "%支付宝%"
union all
select
user_account,
hit_date
from
client_ios_log_view
where
hit_date between '2018-12-01' and '2018-12-13'
and
nbtn_name like "%支付宝%")
select
hit_date,
count(user_account) as pv
from
a1
group by
hit_date
使用 CAST 函数对数据类型进行转换,语法为 cast(value AS TYPE)
写法: [AppleScript] 纯文本查看 复制代码 select
a.price_close,
b.price_close
from
a join b on a.user_id = cast(b.user_id as string)
where
hit_date between '2018-11-01' and '2018-11-02'
and
a.symbol = 'apple'
把空值的 key 变成一个字符串加上随机数,就能把倾斜的数据分到不同的 reduce 上 ,解决数据倾斜问题。
需要用到 case When … Else…End 语法
写法 1: [AppleScript] 纯文本查看 复制代码 Select
*
From
a Join b
On
a.user_id is not null
And
a.user_id = b.user_id
Union all
Select
*
froM
a
where
a.user_id is null 写法 2: [AppleScript] 纯文本查看 复制代码 Select
*
From
a left out Join b
On
Case when
a.user_id is null
then
concat(‘dp_hive’,rand() )
else
a.user_id = b.user_id end;
写法 1: [AppleScript] 纯文本查看 复制代码 SELECT
a.price_close, b.price_close
FROM
b JOIN a ON b.ymd = a.ymd AND b.symbol = a.symbol
WHERE
s.symbol = 'APPLE' 写法 2: [AppleScript] 纯文本查看 复制代码 # 正确的写法是将 where 条件卸载 on 后面
SELECT
a.price_close, b.price_close
FROM
b JOIN a ON ( b.ymd = a.ymd AND b.symbol = a.symbol and s.symbol = 'APPLE') 求两组数据的交集, 并集, 差集union, 结果包含所有行, 并删除重复行
unoin all, 结果包含所有行, 但不删除重复行
union 语句用法: [AppleScript] 纯文本查看 复制代码 use computer_view;
with a1 as (
select
user_account
from
client_android_log_view
where
hit_date between '2018-12-01' and '2018-12-02'
and
nbtn_name like "%支付宝%"
union
select
user_account
from
client_android_log_view
where
hit_date between '2018-12-01' and '2018-12-02'
and
nbtn_name like "%手淘%")
select
count(user_account) as pv
from
a1 点击支付宝或者手淘活动的人数总共有 435499 人
union all 语句用法: [AppleScript] 纯文本查看 复制代码 use computer_view;
with a1 as (
select
user_account
from
client_android_log_view
where
hit_date between '2018-12-01' and '2018-12-02'
and
nbtn_name like "%支付宝%"
union all
select
user_account
from
client_android_log_view
where
hit_date between '2018-12-01' and '2018-12-02'
and
nbtn_name like "%手淘%")
select
count(user_account) as pv
from
a1 点击支付宝或者手淘活动的次数为 665935
[AppleScript] 纯文本查看 复制代码 use computer_view;
with a1 as (
select
user_account
from
client_android_log_view
where
hit_date between '2018-12-01' and '2018-12-02'
and
nbtn_name like "%支付宝%"
intersect
select
user_account
from
client_ios_log_view
where
hit_date between '2018-12-01' and '2018-12-02'
and
nbtn_name like "%手淘%")
select
count(user_account) as pv
from
a1 点击支付宝又点击手淘活动的人数为 66174
写法 1: [AppleScript] 纯文本查看 复制代码 use computer_view;
with a1 as (
select
user_account
from
client_android_log_view
where
hit_date between '2018-12-01' and '2018-12-25'
and
nbtn_name like "%支付宝%"
except
select
user_account
from
client_android_log_view
where
hit_date between '2018-12-01' and '2018-12-25'
and
nbtn_name like "%手淘%")
select
count(user_account) as pv
from
a1
写法 2: [AppleScript] 纯文本查看 复制代码 use computer_view;
with a1 as (
select
user_account
from
client_android_log_view
where
hit_date between '2018-12-01' and '2018-12-25'
and
nbtn_name like "%支付宝%"),
a2 as (
select
user_account
from
client_android_log_view
where
hit_date between '2018-12-20' and '2018-12-25'
and
nbtn_name like "%支付宝%")
select
count(distinct a1.user_account) as pv
from
a1 left outer join a2
on a1.user_account = a2.user_account
and a2.user_account is null 只参加支付宝活动, 没有参加手淘活动的人数为 369325
在求差集时, 需要注意前后顺序, 否则会出现逻辑错误
可以发现, 差集 + 交集 =并集, 369325 + 66174 = 435499
HIVE中查询留存率求11月10-15号每天的 1、3、7日留存率
方法 1. 统计每天的 uv 统计上一天与本天 uv 的交集用户数 算出留存率
方法 2: [AppleScript] 纯文本查看 复制代码 ---
# 统计10-15号每天 uv
SELECT
hit_date,
count(distinct user_account) as uv
FROM
computer_view.client_android_log_view
WHERE
hit_date between '2018-11-10' and '2018-11-15'
group BY
hit_date
order BY
hit_date
[AppleScript] 纯文本查看 复制代码 # 统计 10-15 号每天的次日留存数, 统计次 3、7 日留存只需将 1 换为 3、7
with a1 as (
select
user_account,
hit_date
from
computer_view.client_android_log_view
where
hit_date between '2018-11-10' and '2018-11-15'
),
a2 as (
select
user_account,
hit_date
from
computer_view.client_android_log_view
where
hit_date between '2018-11-10' and '2018-11-25'
)
select
a1.hit_date,
count(distinct a1.user_account) as uv
from
a1 join a2 on a1.user_account = a2.user_account
WHERE
a2.hit_date = date_add(a1.hit_date, 1)
group by
a1.hit_date
order BY
a1.hit_date HIVE中的窗口函数语法: over(partition by ….)
作用: 与聚合函数 sum(), count(), avg() 等结合使用, 实现分组聚合的功能
[AppleScript] 纯文本查看 复制代码 # 根据日期和 mac_id 进行分组求每组的数量和, 并按日期排序
select
hit_date,
mac_id,
mac_color,
day_num,
sum(day_num) over(partition by hit_date, mac_id order by hit_date) as sum_num
from
test.datas
结果: hit_date | mac_id | mac_color | day_num | sum_num | 20171011 | 1292 | 金色 | 11 | 89 | 20171011 | 1292 | 黑色 | 19 | 89 | 20171011 | 1292 | 粉金 | 58 | 89 | 20171011 | 1292 | 金色 | 1 | 89 | 20171011 | 2013 | 金色 | 9 | 22 | 20171011 | 2013 | 金色 | 3 | 22 | 20171012 | 1292 | 金色 | 5 | 18 | 20171012 | 1292 | 粉金 | 1 | 18 | 20171012 | 2013 | 粉金 | 1 | 7 | 20171012 | 2013 | 金色 | 6 | 7 | 20171013 | 1292 | 黑色 | 1 | 1 | 20171013 | 2013 | 粉金 | 2 | 2 |
与 group by 语句的区别: grou by 字段只能显示与分组聚合相关的字段, 而 over(partition by) 可以显示所有字段
[AppleScript] 纯文本查看 复制代码 # group by 语句
select
hit_date,
mac_id,
sum(day_num)
from
test.data
group by
hit_date,
mac_id
order by
hit_date
结果: day_id | mac_id | sum_num | 20171011 | 124609 | 1 | 20171011 | 20130 | 22 | 20171011 | 12922 | 89 | 20171012 | 12922 | 18 | 20171012 | 20130 | 7 | 20171013 | 12922 | 1 | 20171013 | 20130 | 2 |
语法:
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值;
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值 # 计算 11 月 1-10 号, 不同日期同一用户登陆客户端 pv 量对比
[AppleScript] 纯文本查看 复制代码
with a1 as (select
user_account,
count(user_account) as pv,
hit_date
from
computer_view.client_android_log_view
where
hit_date between '2018-11-01' and'2018-11-10'
group by
user_account, hit_date)
select
user_account,
a1.hit_date,
a1.pv,
lag(a1.pv, 1) over (partition by user_account order by user_account, a1.hit_date) as pv1,
lead(a1.pv, 1) over(partition by user_account order by user_account, a1.hit_date) as pv2
from
a1
limit 100
说明:
first_value() :比较每个用户浏览次数与第一天浏览次数进行比较,查询返回当前浏览次数以及第一天浏览次数
last_value() : 比较每个用户浏览次数与最新一天浏览次数进行比较,查询返回当前浏览次数以及最新一天浏览次数
[AppleScript] 纯文本查看 复制代码 with a1 as (select
distinct user_account,
count(user_account) as pv,
hit_date
from
computer_view.client_android_log_view
where
hit_date between '2018-11-01' and'2018-11-10'
group by
user_account, hit_date)
select
distinct user_account,
a1.hit_date,
a1.pv,
first_value(a1.pv) over (partition by user_account order by user_account, a1.hit_date) as pv1,
last_value(a1.pv) over(partition by user_account order by user_account, a1.hit_date) as pv2
from
a1
limit 100
说明:
rank 函数, 返回数据项在分组中的排名, 排名相等的会留下空位, 如1、2、2、4
dense_rank 函数, 返回数据项在分组中的排名, 排名相等的不会留下空位, 如1、2、2、3
row_number 函数, 返回数据项在分组中的排名, 排名不管数据是否相等, 如1、2、3、4
[AppleScript] 纯文本查看 复制代码 select
a,
row_number() over(order by b) row_number,
rank() over(order by b) rank,
dense_rank() over(order by b) dense_rank
from
lijie.test_rank
结果:
a | row_number | rank | dense_rank | A | 1 | 1 | 1 | C | 2 | 2 | 2 | D | 3 | 3 | 3 | B | 4 | 3 | 3 | E | 5 | 5 | 4 | F | 6 | 6 | 5 | G | 7 | 7 | 6 |
以上就是对自己工作中常见的 Hive-sql 查询语句总结, 希望能够给你带来一些启发。 参考资料:
博客:过往记忆 - hive 来源:怪兽宇微信公众号
|