Sqoop

说明

在关系型数据库和Hadoop之上建立了一个桥梁
让我们方便的完成数据的导入导出操作
Sqoop的底层使用MR来完成的

RDBMS ==> Hadoop(HDFS/Hive/HBase…)
emp
MR : code

Hadoop(HDFS/Hive/HBase…) ==> RDBMS

版本

1.4.7 Sqoop1
1.99.7 Sqoop2

V1 V2 没有任何关系不兼容
struts1 vs struts2

CDH5的：sqoop-1.4.6-cdh5.16.2.tar.gz

hive-1.1.0-cdh5.16.2 不是apache的hive 1.1.0

安装

[bigdata@hadoop001 software]$ ll
total 707564
...
-rw-r--r-- 1 bigdata bigdata  30867985 Dec 18 22:51 sqoop-1.4.6-cdh5.16.2.tar.gz
[bigdata@hadoop001 software]$ tar -xzvf sqoop-1.4.6-cdh5.16.2.tar.gz -C /home/bigdata/app
[bigdata@hadoop001 app]$ ll
total 16
lrwxrwxrwx  1 bigdata bigdata   23 Nov 28 10:28 hadoop -> hadoop-2.6.0-cdh5.16.2/
drwxr-xr-x 18 bigdata bigdata 4096 Nov 28 13:30 hadoop-2.6.0-cdh5.16.2
lrwxrwxrwx  1 bigdata bigdata   20 Dec  6 17:22 hive -> hive-1.1.0-cdh5.16.2
drwxr-xr-x 12 bigdata bigdata 4096 Dec 13 23:09 hive-1.1.0-cdh5.16.2
drwxrwxr-x  9 bigdata bigdata 4096 Dec 19 18:22 rundeck
lrwxrwxrwx  1 bigdata bigdata   21 Dec 19 19:03 sqoop -> sqoop-1.4.6-cdh5.16.2
drwxr-xr-x 10 bigdata bigdata 4096 Jun  3  2019 sqoop-1.4.6-cdh5.16.2

修改配置

[bigdata@hadoop001 sqoop]$ cd conf
[bigdata@hadoop001 conf]$ ll
total 28
-rw-rw-r-- 1 bigdata bigdata 3895 Jun  3  2019 oraoop-site-template.xml
-rw-rw-r-- 1 bigdata bigdata 1404 Jun  3  2019 sqoop-env-template.cmd
-rwxr-xr-x 1 bigdata bigdata 1345 Jun  3  2019 sqoop-env-template.sh
-rw-rw-r-- 1 bigdata bigdata 6044 Jun  3  2019 sqoop-site-template.xml
-rw-rw-r-- 1 bigdata bigdata 6044 Jun  3  2019 sqoop-site.xml
[bigdata@hadoop001 conf]$ cp sqoop-env-template.sh sqoop-env.sh

配置这三个路径


#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/home/bigdata/app/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/home/bigdata/app/hadoop

Set the path to where bin/hive is available
export HIVE_HOME=/home/bigdata/app/hive

配置系统环境变量

[root@hadoop001 ~]# vi /etc/profile 
export SQOOP_HOME=/home/bigdata/app/sqoop
export PATH=${SQOOP_HOME}/bin:$PATH
[root@hadoop001 ~]# source /etc/profile

连接MySQL，拷贝驱动包到lib目录

[bigdata@hadoop001 lib]$ rz
rz waiting to receive.
Starting zmodem transfer.  Press Ctrl+C to cancel.
Transferring mysql-connector-java-5.1.47.jar...
  100%     983 KB     983 KB/sec    00:00:01       0 Errors

查看命令帮助

[bigdata@hadoop001 ~]$ sqoop help
Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastor

查看数据库

[bigdata@hadoop001 ~]$ sqoop list-databases --connect jdbc:mysql://localhost:3306 --username root --password '123456'
...
20/12/19 19:37:48 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.16.2
20/12/19 19:37:48 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/12/19 19:37:48 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Sat Dec 19 19:37:48 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
information_schema
lxl_hive
mysql
performance_schema
sys

查看表

sqoop list-tables \
--connect jdbc:mysql://localhost:3306/lxl_hive \
--username root \
--password '123456'

IMPORT导入

语法

sqoop import
–connect jdbc:mysql://localhost:3306/lxl
–username root
–password ‘123456’
-m 1
–table emp \ 【与-e不能一起使用】

–where ‘SAL>2000’ \

–delete-target-dir
–columns “EMPNO,ENAME,JOB,SAL,COMM” \ 【选择需要导入的字段名】
–mapreduce-job-name EmpFromMySQL2HDFS \ 【设定mapreduce作业名】
–null-string ‘’
–null-non-string 0
–fields-terminated-by ‘\t’ \ 【设置导入的分隔符】
–target-dir EMP_COLUMN 【指定导入的文件夹路径】

默认导入的hdfs路径：/user/bigdata[用户名]/表名

默认导入是4个文件，同时是个task在运行的，通过-m 更改

执行：
sqoop import \
--connect jdbc:mysql://localhost:3306/lxl \
--username root \
--password '123456' \
-m 1 \
--table emp \
--delete-target-dir \
--columns "EMPNO,ENAME,JOB,SAL,COMM" \
--mapreduce-job-name EmpFromMySQL2HDFS \
--null-string '' \
--null-non-string 0 \
--fields-terminated-by '\t' \
--where 'SAL>2000' \
--target-dir EMP_COLUMN

[bigdata@hadoop001 lib]$ hadoop fs -text EMP_COLUMN/*
20/12/19 23:39:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7566    JONES   MANAGER 2975.00 0
7698    BLAKE   MANAGER 2850.00 0
7782    CLARK   MANAGER 2450.00 0
7788    SCOTT   ANALYST 3000.00 0
7839    KING    PRESIDENT       5000.00 0
7902    FORD    ANALYST 3000.00 0

问题

【没有主键】

[bigdata@hadoop001 ~]$ sqoop import \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> --table emp 
20/12/19 19:48:12 ERROR tool.ImportTool: Import failed: No primary key could be found for table emp. Please specify one with --split-by or perform a sequential import with '-m 1'.

解决：加上 -m 1 ;
解决: 加上 --split-by product_id

【缺少包】

[bigdata@hadoop001 ~]$ sqoop import \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> -m 1 \
> --table emp
20/12/19 19:52:19 INFO mapreduce.ImportJobBase: Beginning import of emp
Exception in thread "main" java.lang.NoClassDefFoundError: org/json/JSONObject

解决：将 java-json.jar包传到lib目录

[bigdata@hadoop001 ~]$ cd app/sqoop/lib/
[bigdata@hadoop001 lib]$ rz
rz waiting to receive.
Starting zmodem transfer.  Press Ctrl+C to cancel.
Transferring java-json.jar...
  100%      82 KB      82 KB/sec    00:00:01       0 Errors

【hdfs路径已经存在】

1 2	再次导入 20/12/19 23:33:06 ERROR tool.ImportTool: Import failed: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://hadoop001:9000/user/bigdata/emp already exists

解决：加入–delete-target-dir 先删除再导入

–table emp \ --where ‘SAL>2000’ \ 【与-e不能一起使用】

【-e “SELECT * FROM emp WHERE EMPNO>=7900 AND $CONDITIONS” \】

【表名大小写正确】
1
Cannot specify --query and --table together.

【where条件缺少’$CONDITIONS’】

[bigdata@hadoop001 ~]$ sqoop import \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> --delete-target-dir \
> -m 1 \
> --mapreduce-job-name EmpFromMySQL2HDFS \
> --null-string '' \
> --null-non-string 0 \
> --fields-terminated-by '\t' \
> -e "SELECT * FROM emp WHERE EMPNO>=7900" \
> --target-dir EMP_COLUMN_QUERY

20/12/19 23:48:38 ERROR tool.ImportTool: Import failed: java.io.IOException: Query [SELECT * FROM emp WHERE EMPNO>=7900] must contain '$CONDITIONS' in WHERE clause.

解决：
1
-e "SELECT * FROM emp WHERE EMPNO>=7900 AND \$CONDITIONS"
- 如果使用双引号，必须再$CONDITIONS前加上反斜杠转义
- 使用单引号，不加

EVAL导入

[bigdata@hadoop001 ~]$ sqoop eval \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> --query 'select * from product_info'

–options-file命令

准备文件,按照格式

[bigdata@hadoop001 data]$ vi emp.opt
import
--connect
jdbc:mysql://localhost:3306/lxl
--username
root
--password
'123456'
--delete-target-dir
--columns
"EMPNO,ENAME,JOB,SAL,COMM"
--mapreduce-job-name
EmpFromMySQL2HDFS
--table
emp
--target-dir
EMP_COLUMN
--split-by
empno

执行报错

[bigdata@hadoop001 ~]$ sqoop --options-file emp.opt
...
20/12/20 09:13:51 ERROR sqoop.Sqoop: Error while expanding arguments
java.lang.Exception: Unable to read options file: emp.opt

赋权再执行

1 2	[bigdata@hadoop001 data]$ chmod 744 emp.opt [bigdata@hadoop001 data]$ sqoop --options-file emp.opt

job命令

创建job

[bigdata@hadoop001 ~]$ sqoop job --create lxl_job -- \
> import --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> -P \
> --delete-target-dir \
> --table product_info \
> --split-by product_id
...
Enter password:

查看

[bigdata@hadoop001 ~]$ sqoop job --list
...
Available jobs:
  lxl_job

删除

1	[bigdata@hadoop001 ~]$ sqoop job --delete lxl_job

执行

1	[bigdata@hadoop001 ~]$ sqoop job --exec lxl_job

RDBMS ==> Hive

准备数据

1 2	hive (test)> create table emp_column like emp; OK

导入

[bigdata@hadoop001 ~]$ sqoop import \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> --delete-target-dir \
> --hive-database lxl_hive \
> --hive-import \
> --hive-overwrite \
> --hive-table emp_column \
> --columns "EMPNO,ENAME,JOB,SAL,COMM" \
> --mapreduce-job-name EmpFromMySQL2Hive \
> --table emp \
> --null-string '' \
> --null-non-string 0 \
> --m 1 \
> --fields-terminated-by '\t'

注意: --create-hive-table \ 【生产上不使用,会改变字段类型】

如果是分区表, 加上

–hive-partition-key ‘day’
–hive-partition-value ‘yyyyMMdd’ \

执行报错

1
2
3

20/12/20 09:44:25 ERROR util.SqlTypeMap: exist in the table. Please ensure that you've specified
20/12/20 09:44:25 ERROR util.SqlTypeMap: correct column names in Sqoop options.
20/12/20 09:44:25 ERROR tool.ImportTool: Import failed: column not found: EMPNO

注意: 字段名区分大小写

修改大小写后执行, 报错

1
2

20/12/20 09:46:58 ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly.
20/12/20 09:46:58 ERROR tool.ImportTool: Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

解决:

检查HIVE_CONF_DIR是否有设置：

1	[bigdata@hadoop001 ~]$ echo $HIVE_CONF_DIR

如果输出为空, 去系统环境变量配置

1
2
3

[root@hadoop001 ~]# vi /etc/profile
export HIVE_CONF_DIR=$HIVE_HOME/conf  
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*

执行报错

1	FAILED: SemanticException [Error 10072]: Database does not exist: test

解决:

1	[bigdata@hadoop001 conf]$ cp /home/bigdata/app/hive/conf/hive-site.xml /home/bigdata/app/sqoop/conf

执行成功,查看结果

hive (test)> select * from emp_column;
OK
emp_column.empno        emp_column.ename        emp_column.job  emp_column.mgr  emp_column.hiredate     emp_column.sal  emp_column.comm        emp_column.deptno
7369    SMITH   CLERK   800     0       NULL    NULL    NULL
7499    ALLEN   SALESMAN        1600    300.00  NULL    NULL    NULL
7521    WARD    SALESMAN        1250    500.00  NULL    NULL    NULL
7566    JONES   MANAGER 2975    0       NULL    NULL    NULL
7654    MARTIN  SALESMAN        1250    1400.00 NULL    NULL    NULL
7698    BLAKE   MANAGER 2850    0       NULL    NULL    NULL
7782    CLARK   MANAGER 2450    0       NULL    NULL    NULL
7788    SCOTT   ANALYST 3000    0       NULL    NULL    NULL
7839    KING    PRESIDENT       5000    0       NULL    NULL    NULL
7844    TURNER  SALESMAN        1500    0.00    NULL    NULL    NULL
7876    ADAMS   CLERK   1100    0       NULL    NULL    NULL
7900    JAMES   CLERK   950     0       NULL    NULL    NULL
7902    FORD    ANALYST 3000    0       NULL    NULL    NULL
7934    MILLER  CLERK   1300    0       NULL    NULL    NULL
Time taken: 0.048 seconds, Fetched: 14 row(s)

Hive==>RDBMS

数据准备

1 2	mysql> select * from emp3; Empty set (0.00 sec)

执行报错

sqoop export \
--connect jdbc:mysql://localhost:3306/lxl \
--username root \
--password '123456' \
--table emp3 \
--export-dir /user/bigdata/emp

20/12/20 10:19:51 ERROR tool.ExportTool: Error during export: 
Export job failed!

解决: 导出需要制定分隔符

sqoop export \
--connect jdbc:mysql://localhost:3306/lxl \
--username root \
--password '123456' \
--table emp3 \
--fields-terminated-by '\t' \
--export-dir /user/bigdata/emp

查看结果

mysql> mysql> select  * from emp3;
+-------+--------+-----------+------+---------------------+---------+---------+--------+
| empno | ename  | job       | mgr  | hiredate            | sal     | comm    | deptno |
+-------+--------+-----------+------+---------------------+---------+---------+--------+
|  7900 | JAMES  | CLERK     | 7698 | 1981-12-03 00:00:00 |  950.00 |    0.00 |     30 |
|  7902 | FORD   | ANALYST   | 7566 | 1981-12-03 00:00:00 | 3000.00 |    0.00 |     20 |
|  7934 | MILLER | CLERK     | 7782 | 1982-01-23 00:00:00 | 1300.00 |    0.00 |     10 |
|  7369 | SMITH  | CLERK     | 7902 | 1980-12-17 00:00:00 |  800.00 |    0.00 |     20 |
|  7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 00:00:00 | 1600.00 |  300.00 |     30 |
|  7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 00:00:00 | 1250.00 |  500.00 |     30 |
|  7566 | JONES  | MANAGER   | 7839 | 1981-04-02 00:00:00 | 2975.00 |    0.00 |     20 |
|  7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 00:00:00 | 1250.00 | 1400.00 |     30 |
|  7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 00:00:00 | 2850.00 |    0.00 |     30 |
|  7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 00:00:00 | 2450.00 |    0.00 |     10 |
|  7788 | SCOTT  | ANALYST   | 7566 | 1982-12-09 00:00:00 | 3000.00 |    0.00 |     20 |
|  7839 | KING   | PRESIDENT |    0 | 1981-11-17 00:00:00 | 5000.00 |    0.00 |     10 |
|  7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 00:00:00 | 1500.00 |    0.00 |     30 |
|  7876 | ADAMS  | CLERK     | 7788 | 1983-01-12 00:00:00 | 1100.00 |    0.00 |     20 |
+-------+--------+-----------+------+---------------------+---------+---------+--------+
14 rows in set (0.00 sec)

TopN问题

城市表：MySQL city_info.sql
商品表：MySQL product_info.sql
用户行为日志：Hive user_click.txt

数据准备

[bigdata@hadoop001 data]$ rz
rz waiting to receive.
Starting zmodem transfer.  Press Ctrl+C to cancel.
Transferring user_click.txt...
  100%     674 KB     674 KB/sec    00:00:01       0 Errors

MySQL执行两个.sql文件

在Hive上创建三个表

create table city_info(
city_id int,
city_name string,
area string
)
row format delimited fields terminated by '\t';	


create table product_info(
product_id int,
product_name string,
extend_info string
)
row format delimited fields terminated by '\t';

create table user_click(
user_id string,
session_id string,
time string,
city_id int,
product_id int
) partitioned by(day string)
row format delimited fields terminated by ',';

将txt文件导入Hive

1	load data local inpath '/home/bigdata/data/user_click.txt' overwrite into table user_click partition(day='2016-05-05');

将MySQL的数据导入Hive

sqoop import --connect jdbc:mysql://localhost:3306/lxl \
--username root --password '123456' \
--table product_info \
--delete-target-dir \
--hive-import --hive-database test --hive-table product_info --hive-overwrite \
--fields-terminated-by '\t' \
--split-by product_id

sqoop import --connect jdbc:mysql://localhost:3306/lxl \
--username root --password '123456' \
--table city_info \
--delete-target-dir \
--hive-import  --hive-database test \
--hive-table city_info --hive-overwrite  \
--fields-terminated-by '\t' \
--split-by city_id

报错

1 2	20/12/20 11:49:30 INFO mapreduce.Job: Task Id : attempt_1606543264596_0067_m_000003_1, Status : FAILED Error: java.io.IOException: Cannot run program "mysqldump": error=2, No such file or directory

解决

找到mysqldump

1 2	[root@hadoop001 ~]# find / -name mysqldump /usr/local/mysql/bin/mysqldump

拷贝到sqoop节点下sqoop/bin

1	[bigdata@hadoop001 ~]$ cp /usr/local/mysql/bin/mysqldump /home/bigdata/app/sqoop/bin

给mysqldump 加X（执行）权限
1

Sqoop

this is about Sqoop

Sqoop

说明

版本

安装

查看命令帮助

查看数据库

查看表

IMPORT导入

语法

问题

【没有主键】

【缺少包】

【hdfs路径已经存在】

–table emp \ --where ‘SAL>2000’ \ 【与-e不能一起使用】

【where条件缺少’$CONDITIONS’】

EVAL导入

–options-file命令

job命令

创建job

查看

删除

执行

RDBMS ==> Hive

Hive==>RDBMS

TopN问题

FEATURED TAGS

FRIENDS