Sqoop
说明
在关系型数据库和Hadoop之上建立了一个桥梁
让我们方便的完成数据的导入导出操作
Sqoop的底层使用MR来完成的
RDBMS ==> Hadoop(HDFS/Hive/HBase…)
emp
MR : code
Hadoop(HDFS/Hive/HBase…) ==> RDBMS
版本
1.4.7 Sqoop1
1.99.7 Sqoop2
V1 V2 没有任何关系 不兼容
struts1 vs struts2
CDH5的:sqoop-1.4.6-cdh5.16.2.tar.gz
hive-1.1.0-cdh5.16.2 不是apache的hive 1.1.0
安装
1 | [bigdata@hadoop001 software]$ ll |
修改配置
1 | [bigdata@hadoop001 sqoop]$ cd conf |
配置这三个路径
1 |
|
配置系统环境变量
1 | [root@hadoop001 ~]# vi /etc/profile |
连接MySQL, 拷贝驱动包到lib目录
1 | [bigdata@hadoop001 lib]$ rz |
查看命令帮助
1 | [bigdata@hadoop001 ~]$ sqoop help |
查看数据库
1 | [bigdata@hadoop001 ~]$ sqoop list-databases --connect jdbc:mysql://localhost:3306 --username root --password '123456' |
查看表
1 | sqoop list-tables \ |
IMPORT导入
语法
sqoop import
–connect jdbc:mysql://localhost:3306/lxl
–username root
–password ‘123456’
-m 1
–table emp \ 【与-e不能一起使用】
–where ‘SAL>2000’ \
–delete-target-dir
–columns “EMPNO,ENAME,JOB,SAL,COMM” \ 【选择需要导入的字段名】
–mapreduce-job-name EmpFromMySQL2HDFS \ 【设定mapreduce作业名】
–null-string ‘’
–null-non-string 0
–fields-terminated-by ‘\t’ \ 【设置导入的分隔符】
–target-dir EMP_COLUMN 【指定导入的文件夹路径】
-
默认导入的hdfs路径:/user/bigdata[用户名]/表名
-
默认导入是4个文件,同时是个task在运行的,通过-m 更改
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24执行:
sqoop import \
--connect jdbc:mysql://localhost:3306/lxl \
--username root \
--password '123456' \
-m 1 \
--table emp \
--delete-target-dir \
--columns "EMPNO,ENAME,JOB,SAL,COMM" \
--mapreduce-job-name EmpFromMySQL2HDFS \
--null-string '' \
--null-non-string 0 \
--fields-terminated-by '\t' \
--where 'SAL>2000' \
--target-dir EMP_COLUMN
[bigdata@hadoop001 lib]$ hadoop fs -text EMP_COLUMN/*
20/12/19 23:39:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7566 JONES MANAGER 2975.00 0
7698 BLAKE MANAGER 2850.00 0
7782 CLARK MANAGER 2450.00 0
7788 SCOTT ANALYST 3000.00 0
7839 KING PRESIDENT 5000.00 0
7902 FORD ANALYST 3000.00 0
问题
-
【没有主键】
1 | [bigdata@hadoop001 ~]$ sqoop import \ |
- 解决: 加上 -m 1 ;
- 解决: 加上 --split-by product_id
-
【缺少包】
1 | [bigdata@hadoop001 ~]$ sqoop import \ |
-
解决:将 java-json.jar包传到lib目录
1
2
3
4
5
6[bigdata@hadoop001 ~]$ cd app/sqoop/lib/
[bigdata@hadoop001 lib]$ rz
rz waiting to receive.
Starting zmodem transfer. Press Ctrl+C to cancel.
Transferring java-json.jar...
100% 82 KB 82 KB/sec 00:00:01 0 Errors
-
【hdfs路径已经存在】
1 | 再次导入 |
- 解决:加入–delete-target-dir 先删除再导入
-
–table emp \ --where ‘SAL>2000’ \ 【与-e不能一起使用】
【-e “SELECT * FROM emp WHERE EMPNO>=7900 AND $CONDITIONS” \】
【表名大小写正确】
1
Cannot specify --query and --table together.
-
【where条件缺少’$CONDITIONS’】
1
2
3
4
5
6
7
8
9
10
11
12
13
14[bigdata@hadoop001 ~]$ sqoop import \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> --delete-target-dir \
> -m 1 \
> --mapreduce-job-name EmpFromMySQL2HDFS \
> --null-string '' \
> --null-non-string 0 \
> --fields-terminated-by '\t' \
> -e "SELECT * FROM emp WHERE EMPNO>=7900" \
> --target-dir EMP_COLUMN_QUERY
20/12/19 23:48:38 ERROR tool.ImportTool: Import failed: java.io.IOException: Query [SELECT * FROM emp WHERE EMPNO>=7900] must contain '$CONDITIONS' in WHERE clause.
-
解决:
1
-e "SELECT * FROM emp WHERE EMPNO>=7900 AND \$CONDITIONS"
- 如果使用双引号,必须再$CONDITIONS前加上反斜杠转义
- 使用单引号,不加
EVAL导入
1 | [bigdata@hadoop001 ~]$ sqoop eval \ |
–options-file命令
准备文件,按照格式
1 | [bigdata@hadoop001 data]$ vi emp.opt |
执行报错
1 | [bigdata@hadoop001 ~]$ sqoop --options-file emp.opt |
赋权再执行
1 | [bigdata@hadoop001 data]$ chmod 744 emp.opt |
job命令
创建job
1 | [bigdata@hadoop001 ~]$ sqoop job --create lxl_job -- \ |
查看
1 | [bigdata@hadoop001 ~]$ sqoop job --list |
删除
1 | [bigdata@hadoop001 ~]$ sqoop job --delete lxl_job |
执行
1 | [bigdata@hadoop001 ~]$ sqoop job --exec lxl_job |
RDBMS ==> Hive
- 准备数据
1 | hive (test)> create table emp_column like emp; |
- 导入
1 | [bigdata@hadoop001 ~]$ sqoop import \ |
注意: --create-hive-table \ 【生产上不使用,会改变字段类型】
如果是分区表, 加上
–hive-partition-key ‘day’
–hive-partition-value ‘yyyyMMdd’ \
- 执行报错
1 | 20/12/20 09:44:25 ERROR util.SqlTypeMap: exist in the table. Please ensure that you've specified |
注意: 字段名区分大小写
- 修改大小写后执行, 报错
1 | 20/12/20 09:46:58 ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly. |
解决:
检查HIVE_CONF_DIR是否有设置:
1 | [bigdata@hadoop001 ~]$ echo $HIVE_CONF_DIR |
如果输出为空, 去系统环境变量配置
1 | [root@hadoop001 ~]# vi /etc/profile |
- 执行报错
1 | FAILED: SemanticException [Error 10072]: Database does not exist: test |
解决:
1 | [bigdata@hadoop001 conf]$ cp /home/bigdata/app/hive/conf/hive-site.xml /home/bigdata/app/sqoop/conf |
-
执行成功,查看结果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18hive (test)> select * from emp_column;
OK
emp_column.empno emp_column.ename emp_column.job emp_column.mgr emp_column.hiredate emp_column.sal emp_column.comm emp_column.deptno
7369 SMITH CLERK 800 0 NULL NULL NULL
7499 ALLEN SALESMAN 1600 300.00 NULL NULL NULL
7521 WARD SALESMAN 1250 500.00 NULL NULL NULL
7566 JONES MANAGER 2975 0 NULL NULL NULL
7654 MARTIN SALESMAN 1250 1400.00 NULL NULL NULL
7698 BLAKE MANAGER 2850 0 NULL NULL NULL
7782 CLARK MANAGER 2450 0 NULL NULL NULL
7788 SCOTT ANALYST 3000 0 NULL NULL NULL
7839 KING PRESIDENT 5000 0 NULL NULL NULL
7844 TURNER SALESMAN 1500 0.00 NULL NULL NULL
7876 ADAMS CLERK 1100 0 NULL NULL NULL
7900 JAMES CLERK 950 0 NULL NULL NULL
7902 FORD ANALYST 3000 0 NULL NULL NULL
7934 MILLER CLERK 1300 0 NULL NULL NULL
Time taken: 0.048 seconds, Fetched: 14 row(s)
Hive==>RDBMS
数据准备
1 | mysql> select * from emp3; |
执行报错
1 | sqoop export \ |
解决: 导出需要制定分隔符
1 | sqoop export \ |
查看结果
1 | mysql> mysql> select * from emp3; |
TopN问题
城市表:MySQL city_info.sql
商品表:MySQL product_info.sql
用户行为日志:Hive user_click.txt
-
数据准备
1
2
3
4
5[bigdata@hadoop001 data]$ rz
rz waiting to receive.
Starting zmodem transfer. Press Ctrl+C to cancel.
Transferring user_click.txt...
100% 674 KB 674 KB/sec 00:00:01 0 ErrorsMySQL执行两个.sql文件
在Hive上创建三个表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23create table city_info(
city_id int,
city_name string,
area string
)
row format delimited fields terminated by '\t';
create table product_info(
product_id int,
product_name string,
extend_info string
)
row format delimited fields terminated by '\t';
create table user_click(
user_id string,
session_id string,
time string,
city_id int,
product_id int
) partitioned by(day string)
row format delimited fields terminated by ','; -
将txt文件导入Hive
1
load data local inpath '/home/bigdata/data/user_click.txt' overwrite into table user_click partition(day='2016-05-05');
-
将MySQL的数据导入Hive
1
2
3
4
5
6
7sqoop import --connect jdbc:mysql://localhost:3306/lxl \
--username root --password '123456' \
--table product_info \
--delete-target-dir \
--hive-import --hive-database test --hive-table product_info --hive-overwrite \
--fields-terminated-by '\t' \
--split-by product_id1
2
3
4
5
6
7
8sqoop import --connect jdbc:mysql://localhost:3306/lxl \
--username root --password '123456' \
--table city_info \
--delete-target-dir \
--hive-import --hive-database test \
--hive-table city_info --hive-overwrite \
--fields-terminated-by '\t' \
--split-by city_id报错
1
220/12/20 11:49:30 INFO mapreduce.Job: Task Id : attempt_1606543264596_0067_m_000003_1, Status : FAILED
Error: java.io.IOException: Cannot run program "mysqldump": error=2, No such file or directory解决
-
找到mysqldump
1
2[root@hadoop001 ~]# find / -name mysqldump
/usr/local/mysql/bin/mysqldump -
拷贝到sqoop节点下sqoop/bin
1
[bigdata@hadoop001 ~]$ cp /usr/local/mysql/bin/mysqldump /home/bigdata/app/sqoop/bin
-
给mysqldump 加X(执行)权限
1
-