hive入门

This is about hive

Posted by PsycheLee on 2015-12-06

hive入门

hive

  • 是什么

    ​ hive是基于Hadoop的⼀个数据仓库⼯具,⽤来进⾏数据提取、转化、加载,这是⼀种可以存储、查询和分析存储在Hadoop中的⼤规模数据的机制。

  • hive vs rdbms

    1、分布式:⼆者皆可
    2、节点数:rdbms节点数⼩于hive
    3、成本:rdbms成本⾼
    4、数据量
    5、update和delete:都⽀持,hive要求0.14版本以后,不过不建议hive中使⽤6、事务:都⽀持,但是hive中事务意义不⼤
    7、延时性:hive延时性⼤ hive批处理、离线处理
    8、执⾏速度慢
    9、hive sql与mysql的sql⽆关系

  • 核心组件

    image-20210305201009288

  • 4个by

    ⼀:order by

    order by会对输⼊做全局排序,因此只有⼀个Reducer(多个Reducer⽆法保证全局有序),然⽽只有⼀个Reducer,会导致当输⼊规模较⼤时,消耗较⻓的计算时间。

    ⼆:sort by

    sort by不是全局排序,其在数据进⼊reducer前完成排序,因此,如果⽤sort by进⾏排序,并且设置mapred.reduce.tasks>1,则sort by只会保证每个reducer的输出有序,并不保证全局有序。sort by不同于order by,它不受
    hive.mapred.mode属性的影响,sort by的数据只能保证在同⼀个reduce中的数据可以按指定字段排序。使⽤sort by你可以指定执⾏的reduce个数(通过set mapred.reduce.tasks=n来指定),对输出的数据再执⾏归并排序,即可得到全部结果。

    三:distribute by

    distribute by是控制在map端如何拆分数据给reduce端的。类似MR中的partition. hive会根据distribute by后⾯列,对应reduce的个数进⾏分发,默认是采⽤hash算法。sort by为每个reduce产⽣⼀个排序⽂件。在有些情况下,你需要控制某个特定⾏应该到哪个reducer,这通常是为了进⾏后续的聚集操作。distribute by刚好可以做这件事。因此,distribute by经常和sort by配合使⽤。
    注:Distribute by和sort by的使⽤场景

    四:cluster by

    cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是逆序排序,不能指定排序规则为ASC或者DESC

hive安装

解压

1
2
3
4
5
6
7
[bigdata@hadoop001 ~]$ cd software/
[bigdata@hadoop001 software]$ ll
total 677416
-rw-r--r-- 1 bigdata bigdata 434354462 Sep 15 21:05 hadoop-2.6.0-cdh5.16.2.tar.gz
-rw-r--r-- 1 bigdata bigdata 128800223 Sep 19 17:00 hive-1.1.0-cdh5.16.2.tar.gz
[bigdata@hadoop001 software]$ tar -xzvf hive-1.1.0-cdh5.16.2.tar.gz -C ../app
[bigdata@hadoop001 app]$ ln -s hive-1.1.0-cdh5.16.2 hive

配置环境变量

1
2
3
4
[root@hadoop001 ~]# vi /etc/profile
export HIVE_HOME=/home/bigdata/app/hive
export PATH=${HIVE_HOME}/bin:$PATH
[root@hadoop001 ~]# source /etc/profile

启动

1
[bigdata@hadoop001 hive]$ HIVE_HOME/bin/hive

错误

1
2
hive> show databases;
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
  • hive已经启动, 新增两个文件 derby.log, metastore_db
    derby是一个单session

    查看报错日志: hive-log4j.properties   
    ${java.io.tmpdir}/${user.name}
    /tmp/hadoop/hive.log
    

整合MySQL

上传mysql连接包到lib

1
2
[bigdata@hadoop001 lib]$ ll
-rw-r--r-- 1 bigdata bigdata 1007502 Sep 19 18:06 mysql-connector-java-5.1.47.jar

配置文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[bigdata@hadoop001 conf]$ vi hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/lxl_hive?createDatabaseIfNotExist=true&amp;useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>

1607265465688

如果报错

1607845076614

在mysql执行

1
2
3
4
5
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'222.209.157.41' IDENTIFIED BY '123456' ;  
Query OK, 0 rows affected, 1 warning (0.01 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

查看元数据

1
2
3
DBS: 数据库信息
COLUMNS_V2:表字段信息
TBLS:表信息

hive上建库建表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
##默认有一个default库
hive> show databases;
OK
default
Time taken: 2.181 seconds, Fetched: 1 row(s)
##创建DB1数据库
hive> create database db1;
OK
Time taken: 2.84 seconds
hive> show databases;
OK
db1
default
Time taken: 0.157 seconds, Fetched: 2 row(s)
hive> use db1;
OK
Time taken: 0.013 seconds
##在DB1里面创建一个user表
hive> create table user(id int,name string);
OK
Time taken: 0.24 seconds
hive> insert into user values(1,'xiaoming');
Query ID = bigdata_20201206224646_549765c3-8cb8-4138-8d36-b2a7f6f639ee
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606543264596_0001, Tracking URL = http://hadoop001:7776/proxy/application_1606543264596_0001/
Kill Command = /home/bigdata/app/hadoop/bin/hadoop job -kill job_1606543264596_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-06 22:46:39,228 Stage-1 map = 0%, reduce = 0%
2020-12-06 22:46:43,506 Stage-1 map = 100%, reduce = 0%
MapReduce Total cumulative CPU time: 1 seconds 600 msec
Ended Job = job_1606543264596_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/db1.db/user/.hive-staging_hive_2020-12-06_22-46-29_182_5728351694700927030-1/-ext-10000
Loading data to table db1.user
Table db1.user stats: [numFiles=1, numRows=1, totalSize=11, rawDataSize=10]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.6 sec HDFS Read: 3753 HDFS Write: 75 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 600 msec
OK
Time taken: 16.701 seconds
hive> select * from user;
OK
1 xiaoming
Time taken: 0.059 seconds, Fetched: 1 row(s)

在hdfs上查看

http://hadoop001:50071/explorer.html#/user/hive/warehouse/db1.db/user

hive参数设置

hive-site.xml设置

  • hive.metastore.warehouse.dir【生产上默认就是,不用修改】
    • Default Value: /user/hive/warehouse
    • Added In: Hive 0.2.0

    Location of default database for the warehouse.

  • hive.cli.print.header 【显示表的列名,改为true】
    • Default Value: false
    • Added In: Hive 0.7.0

    Whether to print the names of the columns in query output.

  • hive.cli.print.current.db【显示数据库名,改为true】
    • Default Value: false
    • Added In: Hive 0.8.1

    Whether to include the current database in the Hive prompt.

  • hive.exec.mode.local.auto 默认是false,可以设置为true,本地模式,用于测试提升速度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
 [bigdata@hadoop001 conf]$ vi hive-site.xml 
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
[bigdata@hadoop001 ~]$ hive 【重启hive】
hive (default)> show databases;
OK
database_name
db1
default
Time taken: 0.029 seconds, Fetched: 2 row(s)
hive (default)> use db1;
OK
Time taken: 0.025 seconds
hive (db1)> show tables;
OK
tab_name
user
Time taken: 0.018 seconds, Fetched: 1 row(s)
hive (db1)> select * from user;
OK
user.id user.name
1 xiaoming
Time taken: 0.339 seconds, Fetched: 1 row(s)

hive --hiveconf 设置

1
2
3
4
5
6
7
[bigdata@hadoop001 hive]$ hive --hiveconf hive.cli.print.current.db=false  
which: no hbase in (/home/bigdata/app/hive/bin:/home/bigdata/app/hadoop/bin:/home/bigdata/app/hadoop/sbin:/usr/java/jdk1.8.0_181/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/bigdata/.local/bin:/home/bigdata/bin)
20/12/13 17:55:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Logging initialized using configuration in jar:file:/home/bigdata/app/hive-1.1.0-cdh5.16.2/lib/hive-common-1.1.0-cdh5.16.2.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>

set设置

1
2
3
4
5
6
7
8
9
[bigdata@hadoop001 ~]$ hivewhich: no hbase in (/home/bigdata/app/hive/bin:/home/bigdata/app/hadoop/bin:/home/bigdata/app/hadoop/sbin:/usr/java/jdk1.8.0_181/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/bigdata/.local/bin:/home/bigdata/bin)
20/12/16 18:57:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Logging initialized using configuration in jar:file:/home/bigdata/app/hive-1.1.0-cdh5.16.2/lib/hive-common-1.1.0-cdh5.16.2.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive (default)> set hive.cli.print.current.db; 【没有等号就是获取值】
hive.cli.print.current.db=true
hive (default)> set hive.cli.print.current.db = false;【设置值】
hive> [bigdata@hadoop001 ~]$

优先级

hive-site.xml < hive --hiveconf < set a=b

hive命令

–hiveconf:设置参数

-i:定义UDF函数

exit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[bigdata@hadoop001 hive]$ hive -help 
which: no hbase in (/home/bigdata/app/hive/bin:/home/bigdata/app/hadoop/bin:/home/bigdata/app/hadoop/sbin:/usr/java/jdk1.8.0_181/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/bigdata/.local/bin:/home/bigdata/bin)
20/12/13 17:58:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
usage: hive
-d,--define <key=value> Variable subsitution to apply to hive
commands. e.g. -d A=B or --define A=B
--database <databasename> Specify the database to use
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable subsitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
console)

-e:SQL语句[不用进入hive]

1
2
3
4
5
6
7
8
9
[bigdata@hadoop001 hive]$ hive -e "select * from db1.user"
which: no hbase in (/home/bigdata/app/hive/bin:/home/bigdata/app/hadoop/bin:/home/bigdata/app/hadoop/sbin:/usr/java/jdk1.8.0_181/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/bigdata/.local/bin:/home/bigdata/bin)
20/12/13 18:03:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Logging initialized using configuration in jar:file:/home/bigdata/app/hive-1.1.0-cdh5.16.2/lib/hive-common-1.1.0-cdh5.16.2.jar!/hive-log4j.properties
OK
user.id user.name
1 xiaoming
Time taken: 3.573 seconds, Fetched: 1 row(s)

-f:SQL文件

1
2
3
4
5
6
7
8
9
10
11
[bigdata@hadoop001 hive]$ vi hive.sql
select * from db1.user
[bigdata@hadoop001 hive]$ hive -f hive.sql
which: no hbase in (/home/bigdata/app/hive/bin:/home/bigdata/app/hadoop/bin:/home/bigdata/app/hadoop/sbin:/usr/java/jdk1.8.0_181/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/bigdata/.local/bin:/home/bigdata/bin)
20/12/13 18:08:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Logging initialized using configuration in jar:file:/home/bigdata/app/hive-1.1.0-cdh5.16.2/lib/hive-common-1.1.0-cdh5.16.2.jar!/hive-log4j.properties
OK
user.id user.name
1 xiaoming
Time taken: 3.384 seconds, Fetched: 1 row(s)

在hive中查看hdfs下面的文件

1
2
3
4
5
6
hive (default)> dfs -ls /user/hive/warehouse;
Found 1 items
drwxr-xr-x - bigdata supergroup 0 2020-12-13 18:46 /user/hive/warehouse/db1.db
hive (default)> dfs -ls /user/hive/warehouse/db1.db/user;
Found 1 items
-rwxr-xr-x 1 bigdata supergroup 11 2020-12-06 22:46 /user/hive/warehouse/db1.db/user/000000_0

DDL数据库操作

CREATE

语法:CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, …)];

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
hive (default)> show databases;
OK
database_name
db1
default
Time taken: 0.236 seconds, Fetched: 2 row(s)
hive (default)> create database db2;
OK
Time taken: 0.073 seconds
hive (default)> show databases;
OK
database_name
db1
db2
default
Time taken: 0.018 seconds, Fetched: 3 row(s)
  • 默认的default数据库:/user/hive/warehouse
  • 自己创建的数据库:/user/hive/warehouse/db2.db
1
2
3
4
5
6
hive (default)> create database db2;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database db2 already exists
【数据库已经存在】
【加上IF NOT EXISTS】
hive (default)> create database if not exists db2;
OK
  • 加上语法location创建的数据库后缀无.db

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    hive (default)> create database if not exists db3 location '/user/hive/warehouse'/db3; 【hdfs的路径加上单引号】
    OK
    Time taken: 0.016 seconds
    hive (default)> show databases;
    OK
    database_name
    db1
    db2
    db3
    default
    Time taken: 0.016 seconds, Fetched: 4 row(s)
  • 加上comment和dbproperties

    1
    2
    3
    hive (default)> CREATE DATABASE IF NOT EXISTS test2 COMMENT 'it is my hive database' WITH DBPROPERTIES('creator'='psyche','date'='2100-10-11');
    OK
    Time taken: 0.027 seconds

    这些元信息存在mysql中的数据库表中,dbs、database_params

  • 查看数据库信息 :desc database extended + 数据库名

    1
    2
    3
    4
    hive (default)> desc database extended test2;
    OK
    db_name comment location owner_name owner_type parameters
    test2 it is my hive database hdfs://hadoop001:9000/user/hive/warehouse/test2.db bigdata USER {date=2100-10-11, creator=psyche}
  • 切换数据库:use + 数据库名

    1
    2
    3
    4
    hive (default)> use test2;
    OK
    Time taken: 0.01 seconds
    hive (test2)>

ALTER 【生产上用得少】

语法:

​ ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, …); – (Note: SCHEMA added in Hive 0.14.0)
​ ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role; – (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)
​ ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path; – (Note: Hive 2.2.1, 2.4.0 and later)
​ ALTER (DATABASE|SCHEMA) database_name SET MANAGEDLOCATION hdfs_path; – (Note: Hive 4.0.0 and later)

1
hive (default)> ALTER DATABASE test2 SET DBPROPERTIES('year'='2088');

DROP

语法:DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

1
hive (default)> DROP DATABASE IF EXISTS db2;

数据库不为空,直接删除报错,【CASCADE】级联删除 【慎用】

1
2
3
4
5
hive (test2)> create table user(id int);
OK
Time taken: 0.07 seconds
hive (test2)> drop database test2;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database test2 is not empty. One or more tables exist.)

DDL表操作

Hive数据类型

int: 10 20 30
bigint: long
double:
float:
string

Hive分隔符

常用: \001 ^A

工作中列之间的分隔符:\t 空格
3|xiaohong|18

CREATE

语法:

CREATE [TEMPORARY][EXTERNAL] TABLE [IF NOT EXISTS][db_name.]table_name

[(col_name data_type [column_constraint_specification][COMMENT col_comment], … [constraint_specification])]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]

[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]

[ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

[AS select_statement];

说明:

CREATE TABLE:指定要创建的表的名字
col_name data_type:列名以及对应的数据类型,多个列之间使用逗号分隔
PARTITIONED BY:指定分区
CLUSTERED BY: 排序、分桶
ROW FORMAT:指定数据的分隔符等信息
STORED AS:指定表存储的数据格式:textfile rc orc parquet
LOCATION:指定表在文件系统上的存储路径
AS select_statement: 通过select的sql语句的结果来创建表

创建表1

1
2
3
4
5
6
7
8
9
10
11
12
13
hive (test2)> 
> create table emp_managed(
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal double,
> comm double,
> deptno int
> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.051 seconds

查看表结构信息 desc [formatted] emp_managed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
hive (test2)> desc formatted emp_managed;
OK
col_name data_type comment
# col_name data_type comment

empno int
ename string
job string
mgr int
hiredate string
sal double
comm double
deptno int

# Detailed Table Information
Database: test2
OwnerType: USER
Owner: bigdata
CreateTime: Wed Dec 16 20:21:37 CST 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoop001:9000/user/hive/warehouse/test2.db/emp_managed
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1608121297

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.107 seconds, Fetched: 35 row(s)

创建表2:create table emp2 like emp;

​ 【只拷贝表结构,不跑mapreduce作业】

创建表3:create table emp3 as select * from emp;

​ 【拷贝表结构和数据,要跑mapreduce作业】

内部表与外部表

内部表/MANAGED_TABLE:创建语法create table xxx;【默认就是创建内部表】
删除表之后:HDFS和MySQL的数据都被删除了
外部表:创建语法create EXTERNAL table xxx;
删除表之后:MySQL的数据都被删除了,但是HDFS的数据是还存在的

1
2
3
4
5
6
7
8
9
10
11
12
create external table emp_external(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/dw_directory/emp_external'
;

内部表和外部表的相互转换

ALTER TABLE emp_external SET TBLPROPERTIES (‘EXTERNAL’ = ‘false’);

ALTER

语法:

ALTER TABLE table_name
ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], …)
[CASCADE|RESTRICT]

ALTER TABLE dept ADD COLUMNS(info string);

ALTER TABLE table_name CHANGE col_old_name col_new_name column_type
[COMMENT col_comment][FIRST|AFTER column_name] [CASCADE|RESTRICT];

ALTER TABLE dept CHANGE info infos string;

ALTER TABLE dept REPLACE COLUMNS(deptno int, dname string, loc string);