# hadoop入门第六篇:Hive实例

jialiguo 发表于 2017-12-17 19:34:20

前言
　　前面已经讲了如何部署在hadoop集群上部署hive，现在我们就做一个很小的实例去熟悉HIVE QL.使用的数据是视频播放数据包括视频编码，播放设备编码，用户账号编码等，我们在这个数据基础上做一些简单查询统计等。
　　
点击此处下载实例样本数据
　　
这是20170901 14点的部分播放日志

动起来

同步数据
　　实际上我这块数据是通过flume收集日志到hdfs上的，后续我也会简单介绍一下怎么通过flume收集日志到hdfs。当然，下载我们的样例数据以后也可以通过${HADOOP_HOME}/bin/hdfs dfs -put命令

[*]建立相关目录:比如我的放在${HADOOP_HOME}/bin/hdfs dfs -mkdir /user/admin/logs/video_play/20170901/14 每层建立，最好两层是对应的表分区day ,hour
[*]　　建表：
create external table log_video_play_request (logindex string,request_date string,video_auiddigest string,puiddigest string ,　　ver int,auiddigest string comment 'account>　　device_sign string ,xy_app_key string,ip string,port bigint,user_agent string, fromparameter string,
　　zone bigint,sns_name string,sns_type bigint,country_code string,consume_country_code string,
　　play_duration bigint,video_duration bigint,trace_id string,review_state int)
　　partitioned by (day string ,hour string) row format delimited
　　fields terminated by '&'
　　stored as textfile
　　location '/user/admin/logs/video_play'

[*]　　接下来就是hive表加载数据了，大家可以参考这篇博文Hive数据加载（内部表，外部表，分区表）
　　
在这里大家在hive里面执行alter table log_video_play_request add partition(day='20170901',hour='14');
　　
注:select * from .. limit 10;试一下，如果结果为空，使用Load data inpath '/user/admin/logs/vide_play/20170901/14' overwrite into table log_video_play_request partition(day='20170901',hour='14')

hive QL DDL语句

表操作语句

[*]　　通用建表语句
CREATE TABLE table\_name　　
[(col\_name data\_type ,...)]
　　

　　
, col\_name data\_type ,...)]
　　

　　

　　

[*]重命名表:>
[*]添加字段:ALTER TABLE table_name ADD COLUMNS(col_name data_type ,...)
[*]添加或者删除分区:>
ALTER TABLE table_name DROP PARTITION(....)

[*]　　删除表: DROP TABLE table_name

其他操作语句

[*]创建/删除视图 hive不支持物化视图，而从数仓的角度来说视图应用场景基本没有 CREATE VIEW as SELECT ...
[*]创建/删除函数 udf udaf等后续会专门介绍
[*]show/describe: show paratitios table_name describe table_name describle table_name partition_spec
hive QL DML语句

插入数据到表

[*]　　向数据表中加载文件：
LOAD DATA INPATH 'filepath' 　　
INOT TABLE table\_name
　　

[*]　　将查询结果插入数据表中
INSERT OVERWRITE TABLE tablename 　　
select ....

SQL操作

[*]基本语法:select where groupby distinct having join 等
[*]　　多路插入: multi insert
FROM src　　
insert overwrite table1 select ... where ...
　　
insert overwrite table2 select ... where ...
　　多路插入还是很常见并且非常好的应用，一张日志表往往有多次的计算，用multi insert 可以节省多次的IO开销

实例
　　根据我们上面的log_video_play_request
select * from log\_video\_play\_request where day = 20170901 limit 10;　　
#查看各个模块播放
　　
select count(1) as total ,fromparameter from log\_video\_play\_request where day = 20170901 group by fromparameter order by total desc limit 100;
　　
#查看top创作者(视频被播放次数最多的用户)
　　
select count(1) as total,video\_auiddigest from log\_video\_play\_request where day = 20170901 group by video\_auiddigest order by total desc limit 100;

页: [1]

运维网's Archiver

# hadoop入门第六篇:Hive实例