Hive_3

PARTITIONED BY

分区表
用于操作日志表：记录日志、查询操作
who when what
op_log_20300501 <==
op_log_20300502 <==

普通表:

/user/hive/warehouse/ruozedata_hive.db/access
/1.log
/2.log
/…log

分区表:

/user/hive/warehouse/ruozedata_hive.db/access/day=20300501
/1.log
/2.log
/…log

/user/hive/warehouse/ruozedata_hive.db/access/day=20300502
/1.log
/2.log
/…log

全表扫描 VS 分区扫描

语法：

创建分区表加上PARTITIONED BY (col_name data_type [COMMENT col_comment], …)

导入数据指定分区PARTITION(col_name=value)

查询where col_name=value

删除/增加某个分区 alter table dept_partition drop/add partition(day=‘20190203’)

数据准备

hive (test)> create table dept_partition(
           > deptno int,
           > dname string,
           > loc string
           > )
           > PARTITIONED BY (day string)
           > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.427 seconds
hive (test)>

查看表结构

hive (test)> desc formatted dept_partition ;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
deptno                  int                                         
dname                   string                                      
loc                     string                                      
                 
# Partition Information          
# col_name              data_type               comment             
                 
day                     string            【分区信息】                          
                 
# Detailed Table Information             
Database:               test                     
OwnerType:              USER                     
Owner:                  bigdata                  
CreateTime:             Sat Dec 19 07:42:11 CST 2020     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop001:9000/user/hive/warehouse/test.db/dept_partition         
Table Type:             MANAGED_TABLE            
...

导入数据

hive (test)> LOAD DATA LOCAL INPATH '/home/bigdata/data/dept20190201.txt' OVERWRITE INTO TABLE dept_partition
           > partition(day=20190201);
Loading data to table test.dept_partition partition (day=20190201)
Partition test.dept_partition{day=20190201} stats: [numFiles=1, numRows=0, totalSize=79, rawDataSize=0]
OK
Time taken: 0.327 seconds
hive (test)> select * from dept_partition where day = '20190201'; 【加上分区条件】
OK
dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.day
10      ACCOUNTING      NEW YORK        20190201
20      RESEARCH        DALLAS  20190201
30      SALES   CHICAGO 20190201
40      OPERATIONS      BOSTON  20190201
Time taken: 0.224 seconds, Fetched: 4 row(s)

再导入其他分区数据，查看hdfs上的目录，dept_partition下有多个文件

1608335811485

查看MySQL的表数据：partitions

【注意】手动上传到HDFS的分区数据，元数据信息没有，直接查询没有数据

# 在HDFS上创建一个文件夹，路径与分表的分区路径一样
[bigdata@hadoop001 data]$ hadoop fs -mkdir -p /user/hive/warehouse/test.db/dept_partition/day=20190203
 # 上传数据
 [bigdata@hadoop001 data]$ hadoop fs -put dept20190203.txt /user/hive/warehouse/test.db/dept_partition/day=20190203/
# Hive查询
hive (test)> select * from dept_partition where day = '20190203';
OK
dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.day
Time taken: 0.055 seconds

修复元数据信息

【生产不用，会刷整个分区表，性能差】

MSCK REPAIR TABLE tablename;

hive (test)> MSCK REPAIR TABLE dept_partition;
OK
Partitions not in metastore:    dept_partition:day=20190203
Repair: Added partition to metastore dept_partition:day=20190203
Time taken: 0.096 seconds, Fetched: 2 row(s)
hive (test)> select * from dept_partition where day = '20190203';
OK
dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.day
10      ACCOUNTING      NEW YORK        20190203
20      RESEARCH        DALLAS  20190203
30      SALES   CHICAGO 20190203
40      OPERATIONS      BOSTON  20190203
Time taken: 0.053 seconds, Fetched: 4 row(s)
hive (test)>

增加分区

使用ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION （col_name=value） ;

hive (test)> ALTER TABLE dept_partition ADD IF NOT EXISTS PARTITION (day='20190204') ;
OK
Time taken: 0.041 seconds
hive (test)> select * from dept_partition where day = '20190204';
OK
dept_partition.deptno   dept_partition.dname    dept_partition.loc      dept_partition.day
10      ACCOUNTING      NEW YORK        20190204
20      RESEARCH        DALLAS  20190204
30      SALES   CHICAGO 20190204
40      OPERATIONS      BOSTON  20190204
Time taken: 0.061 seconds, Fetched: 4 row(s)

删除分区

ALTER TABLE dept_partition DROP PARTITION(day=‘20300503’),PARTITION(day=‘20300502’);

hive (test)> alter table dept_partition drop partition(day='20190201');
Dropped the partition day=20190201
OK
Time taken: 0.057 seconds
hive (test)> show partitions dept_partition;
OK
partition
day=20190202
day=20190203
day=20190204
day=20190205

查看分区信息

hive (test)> show partitions dept_partition;
OK
partition
day=20190201
day=20190202
day=20190203
day=20190204
day=20190205
Time taken: 0.041 seconds, Fetched: 5 row(s)

新增分区数据

1 2	insert into table dept_partition partition(day='20190202') select * from dept;

多级分区

创建表

hive (test)> create table dept_partition_d_h(
           > deptno int,
           > dname string,
           > loc string
           > )
           > PARTITIONED BY (day string, hour string)
           > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

导入数据

hive (test)> LOAD DATA LOCAL INPATH '/home/bigdata/data/dept20190203.txt' OVERWRITE INTO TABLE dept_partition_d_h PARTITION (day='20190203', hour='21');
Loading data to table test.dept_partition_d_h partition (day=20190203, hour=21)
Partition test.dept_partition_d_h{day=20190203, hour=21} stats: [numFiles=1, numRows=0, totalSize=79, rawDataSize=0]
OK
Time taken: 0.585 seconds

查看

hive (test)> select * from dept_partition_d_h where day='20190203'and hour='21';
OK
dept_partition_d_h.deptno       dept_partition_d_h.dname        dept_partition_d_h.loc  dept_partition_d_h.day  dept_partition_d_h.hour
10      ACCOUNTING      NEW YORK        20190203        21
20      RESEARCH        DALLAS  20190203        21
30      SALES   CHICAGO 20190203        21
40      OPERATIONS      BOSTON  20190203        21
Time taken: 0.055 seconds, Fetched: 4 row(s)

静态分区

动态分区

应用:将emp表中的数据按照deptno将数据插入到emp对应的分区表中去

普通分区表插入数据

hive (test)> create table emp_partition(
           >   `empno` int, 
           >   `ename` string, 
           >   `job` string, 
           >   `mgr` int, 
           >   `hiredate` string, 
           >   `sal` double, 
           >   `comm` double)
           > partitioned by(deptno int)  
           > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
 ##根据分区条件一次一次插入
 hive (test)> insert into table emp_partition partition(deptno=30)
           > select empno,ename,job,mgr,hiredate,sal,comm from emp where deptno=30;

动态分区插入数据

hive (test)> create table emp_dynamic_partition(
           >   `empno` int, 
           >   `ename` string, 
           >   `job` string, 
           >   `mgr` int, 
           >   `hiredate` string, 
           >   `sal` double, 
           >   `comm` double)
           > partitioned by(deptno int)  
           > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive (test)> 
           > insert into table emp_dynamic_partition partition(deptno)
           > select empno,ename,job,mgr,hiredate,sal,comm, deptno from emp;
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
#解决
hive (test)> set hive.exec.dynamic.partition.mode=nonstrict;

查看

hive (test)> select * from emp_dynamic_partition;
OK
emp_dynamic_partition.empno     emp_dynamic_partition.ename     emp_dynamic_partition.job       emp_dynamic_partition.mgremp_dynamic_partition.hiredate   emp_dynamic_partition.sal       emp_dynamic_partition.comm      emp_dynamic_partition.deptno
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
7902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30
8888    HIVE    PROGRAM 7839    1988-1-23       10300.0 NULL    NULL

函数

build-in 内置
UDF User-Defined Function 用户自定义的函数

日期类

unix_timestamp
from_unixtime
to_date
year month day hour minute second
weekofyear
dayofmonth
months_between
add_months
datediff
date_add
date_sub
last_day

小数

round
ceil
floor

字符串

upper/lower
length
trim
lpad/rpad
regexp_replace
substr
concat
concat_ws

其他处理

cast（** as bigint）
json_tuple
parse_url_tuple

URL: http://www.google.com/test/film?param1=value1&param2=value2

http/https: 协议
www.google.com: domain
test/film: path
param1=value1&param2=value2: param

CASE WHEN

case sex when ‘1’ then 1 else 0 end

IF

if(sex = ‘1’, 1, 0)

应用

domain time traffic(单位为T)
gifshow.com 2019/01/01 5
yy.com 2019/01/01 4
huya.com 2019/01/01 1
gifshow.com 2019/01/20 6
gifshow.com 2019/02/01 8
yy.com 2019/01/20 5
gifshow.com 2019/02/02 7

要求使用SQL统计出每个用户的累积访问次数，结果如下表所示

域名月份小计累计
gifshow.com 2019-01 11 11
gifshow.com 2019-02 15 26
yy.com 2019-01 9 9
huya.com 2019-01 1 1

select A.domain,A.month,max(A.traffic) ,sum(B.traffic)   
from   
(select domain,substr(regexp_replace(time,'/','-'),1,7) as month,sum(traffic) as traffic from ip_info group by domain,substr(regexp_replace(time,'/','-'),1,7)) A   
inner join   
(select domain,substr(regexp_replace(time,'/','-'),1,7) as month,sum(traffic) as traffic from ip_info group by domain,substr(regexp_replace(time,'/','-'),1,7)) B  
on  
A.domain=B.domain  
where B.month <= A.month  
group by A.domain,A.month  
order by A.domain,A.month;

行转列collect_set collect_list

数据准备

hive (test)> create table emp_info(
           > id string,
           > name string,
           > dept string,
           > sex string
           > )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
OK
Time taken: 0.424 seconds
hive (test)> LOAD DATA LOCAL INPATH '/home/bigdata/data/emp_info.txt' OVERWRITE INTO TABLE emp_info;
Loading data to table test.emp_info
Table test.emp_info stats: [numFiles=1, totalSize=92]
OK
Time taken: 0.965 seconds
hive (test)> select *  from  emp_info;
OK
emp_info.id     emp_info.name   emp_info.dept   emp_info.sex
1       JIM     RD      1
2       PETTER  RD      1
3       XIAOHONG        RD      2
4       XIAOZHANG       QA      1
5       XIAOLI  QA      2
6       XIAOFANG        QA      2
Time taken: 0.218 seconds, Fetched: 6 row(s)

需求：相同部门、性别的人合在一起

将部门和性别字段合并

hive (test)> select name,concat_ws(',',dept,sex)dept_sex from  emp_info ;
OK
name    dept_sex
JIM     RD,1
PETTER  RD,1
XIAOHONG        RD,2
XIAOZHANG       QA,1
XIAOLI  QA,2
XIAOFANG        QA,2
Time taken: 0.09 seconds, Fetched: 6 row(s)

根据合并的字段分组，并且将姓名合在一起

1
2
3

hive (test)> select dept_sex,concat_ws('|',collect_set(t.name)) from 			  
			>(select name,concat_ws(',',dept,sex)dept_sex from  emp_info)t  			
			>group by dept_sex ;

列转行explode(split(column_name,’,’))

数据准备

hive (test)> create table column2row(name string, courses string)
           > row format delimited fields terminated by '\t';
OK
Time taken: 0.038 seconds
hive (test)> LOAD DATA LOCAL INPATH '/home/bigdata/data/column2row.txt' OVERWRITE INTO TABLE  column2row;
Loading data to table test.column2row
Table test.column2row stats: [numFiles=1, totalSize=57]
OK
Time taken: 0.167 seconds
hive (test)> select * from  column2row;
OK
column2row.name column2row.courses
JIM     MapReduce,Hive,Spark,Flink
PETTER  Hadoop,HBase,Kafka

将姓名和课程分别展示

hive (test)> select 
           > name, course
           > from
           > column2row
           > lateral view explode(split(courses,',')) course_tmp as course;
OK
name    course
JIM     MapReduce
JIM     Hive
JIM     Spark
JIM     Flink
PETTER  Hadoop
PETTER  HBase
PETTER  Kafka
Time taken: 0.07 seconds, Fetched: 7 row(s)

UDF

UDF: 一进一出 upper lower(ename)
UDAF: 多进一出 sum avg
UDTF：一进多出

需求：用Hive完成wc统计

准备数据

hive (test)> create table wordcount(sentence string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.081 seconds
hive (test)> LOAD DATA LOCAL INPATH '/home/bigdata/data/wordcount.txt' OVERWRITE INTO TABLE wordcount;
Loading data to table test.wordcount
Table test.wordcount stats: [numFiles=1, totalSize=41]
OK
Time taken: 0.802 seconds
hive (test)> select * from wordcount;
OK
wordcount.sentence
hello,hello,hello
world,world
welcome

拆分字段，分组查询统计

hive (test)> select 
           > word, count(1) cnt
           > from
           > (
           > select
           > explode(split(sentence,',')) as word 
           > from
           > wordcount
           > ) t
           > group by word
           > order by cnt desc;
word    cnt
hello   3
world   2
welcome 1

Hive_3

this is about Hive

Hive_3

PARTITIONED BY

全表扫描 VS 分区扫描

修复元数据信息

增加分区

删除分区

查看分区信息

新增分区数据

多级分区

动态分区

函数

日期类

小数

字符串

其他处理

CASE WHEN

IF

应用

行转列collect_set collect_list

列转行explode(split(column_name,’,’))

UDF

FEATURED TAGS

FRIENDS