Sqoop

this is about Sqoop

Posted by PsycheLee on 2019-09-08

Sqoop

说明

​ 在关系型数据库和Hadoop之上建立了一个桥梁
​ 让我们方便的完成数据的导入导出操作
​ Sqoop的底层使用MR来完成的

RDBMS ==> Hadoop(HDFS/Hive/HBase…)
emp
MR : code

Hadoop(HDFS/Hive/HBase…) ==> RDBMS

版本

1.4.7 Sqoop1
1.99.7 Sqoop2

V1 V2 没有任何关系 不兼容
struts1 vs struts2

CDH5的:sqoop-1.4.6-cdh5.16.2.tar.gz

​ hive-1.1.0-cdh5.16.2 不是apache的hive 1.1.0

安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[bigdata@hadoop001 software]$ ll
total 707564
...
-rw-r--r-- 1 bigdata bigdata 30867985 Dec 18 22:51 sqoop-1.4.6-cdh5.16.2.tar.gz
[bigdata@hadoop001 software]$ tar -xzvf sqoop-1.4.6-cdh5.16.2.tar.gz -C /home/bigdata/app
[bigdata@hadoop001 app]$ ll
total 16
lrwxrwxrwx 1 bigdata bigdata 23 Nov 28 10:28 hadoop -> hadoop-2.6.0-cdh5.16.2/
drwxr-xr-x 18 bigdata bigdata 4096 Nov 28 13:30 hadoop-2.6.0-cdh5.16.2
lrwxrwxrwx 1 bigdata bigdata 20 Dec 6 17:22 hive -> hive-1.1.0-cdh5.16.2
drwxr-xr-x 12 bigdata bigdata 4096 Dec 13 23:09 hive-1.1.0-cdh5.16.2
drwxrwxr-x 9 bigdata bigdata 4096 Dec 19 18:22 rundeck
lrwxrwxrwx 1 bigdata bigdata 21 Dec 19 19:03 sqoop -> sqoop-1.4.6-cdh5.16.2
drwxr-xr-x 10 bigdata bigdata 4096 Jun 3 2019 sqoop-1.4.6-cdh5.16.2

修改配置

1
2
3
4
5
6
7
8
9
[bigdata@hadoop001 sqoop]$ cd conf
[bigdata@hadoop001 conf]$ ll
total 28
-rw-rw-r-- 1 bigdata bigdata 3895 Jun 3 2019 oraoop-site-template.xml
-rw-rw-r-- 1 bigdata bigdata 1404 Jun 3 2019 sqoop-env-template.cmd
-rwxr-xr-x 1 bigdata bigdata 1345 Jun 3 2019 sqoop-env-template.sh
-rw-rw-r-- 1 bigdata bigdata 6044 Jun 3 2019 sqoop-site-template.xml
-rw-rw-r-- 1 bigdata bigdata 6044 Jun 3 2019 sqoop-site.xml
[bigdata@hadoop001 conf]$ cp sqoop-env-template.sh sqoop-env.sh

配置这三个路径

1
2
3
4
5
6
7
8
9

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/home/bigdata/app/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/home/bigdata/app/hadoop

Set the path to where bin/hive is available
export HIVE_HOME=/home/bigdata/app/hive

配置系统环境变量

1
2
3
4
[root@hadoop001 ~]# vi /etc/profile 
export SQOOP_HOME=/home/bigdata/app/sqoop
export PATH=${SQOOP_HOME}/bin:$PATH
[root@hadoop001 ~]# source /etc/profile

连接MySQL, 拷贝驱动包到lib目录

1
2
3
4
5
[bigdata@hadoop001 lib]$ rz
rz waiting to receive.
Starting zmodem transfer. Press Ctrl+C to cancel.
Transferring mysql-connector-java-5.1.47.jar...
100% 983 KB 983 KB/sec 00:00:01 0 Errors

查看命令帮助

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[bigdata@hadoop001 ~]$ sqoop help
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastor

查看数据库

1
2
3
4
5
6
7
8
9
10
11
[bigdata@hadoop001 ~]$ sqoop list-databases --connect jdbc:mysql://localhost:3306 --username root --password '123456'
...
20/12/19 19:37:48 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.16.2
20/12/19 19:37:48 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
20/12/19 19:37:48 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Sat Dec 19 19:37:48 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
information_schema
lxl_hive
mysql
performance_schema
sys

查看表

1
2
3
4
sqoop list-tables \
--connect jdbc:mysql://localhost:3306/lxl_hive \
--username root \
--password '123456'

IMPORT导入

语法

sqoop import
–connect jdbc:mysql://localhost:3306/lxl
–username root
–password ‘123456’
-m 1
–table emp \ 【与-e不能一起使用】

–where ‘SAL>2000’ \

–delete-target-dir
–columns “EMPNO,ENAME,JOB,SAL,COMM” \ 【选择需要导入的字段名】
–mapreduce-job-name EmpFromMySQL2HDFS \ 【设定mapreduce作业名】
–null-string ‘’
–null-non-string 0
–fields-terminated-by ‘\t’ \ 【设置导入的分隔符】
–target-dir EMP_COLUMN 【指定导入的文件夹路径】

  1. 默认导入的hdfs路径:/user/bigdata[用户名]/表名

  2. 默认导入是4个文件,同时是个task在运行的,通过-m 更改

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    执行:
    sqoop import \
    --connect jdbc:mysql://localhost:3306/lxl \
    --username root \
    --password '123456' \
    -m 1 \
    --table emp \
    --delete-target-dir \
    --columns "EMPNO,ENAME,JOB,SAL,COMM" \
    --mapreduce-job-name EmpFromMySQL2HDFS \
    --null-string '' \
    --null-non-string 0 \
    --fields-terminated-by '\t' \
    --where 'SAL>2000' \
    --target-dir EMP_COLUMN

    [bigdata@hadoop001 lib]$ hadoop fs -text EMP_COLUMN/*
    20/12/19 23:39:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    7566 JONES MANAGER 2975.00 0
    7698 BLAKE MANAGER 2850.00 0
    7782 CLARK MANAGER 2450.00 0
    7788 SCOTT ANALYST 3000.00 0
    7839 KING PRESIDENT 5000.00 0
    7902 FORD ANALYST 3000.00 0

问题

  1. 【没有主键】
1
2
3
4
5
6
[bigdata@hadoop001 ~]$ sqoop import \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> --table emp
20/12/19 19:48:12 ERROR tool.ImportTool: Import failed: No primary key could be found for table emp. Please specify one with --split-by or perform a sequential import with '-m 1'.
  • 解决: 加上 -m 1 ;
  • 解决: 加上 --split-by product_id
  1. 【缺少包】
1
2
3
4
5
6
7
8
[bigdata@hadoop001 ~]$ sqoop import \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> -m 1 \
> --table emp
20/12/19 19:52:19 INFO mapreduce.ImportJobBase: Beginning import of emp
Exception in thread "main" java.lang.NoClassDefFoundError: org/json/JSONObject
  • 解决:将 java-json.jar包传到lib目录

    1
    2
    3
    4
    5
    6
    [bigdata@hadoop001 ~]$ cd app/sqoop/lib/
    [bigdata@hadoop001 lib]$ rz
    rz waiting to receive.
    Starting zmodem transfer. Press Ctrl+C to cancel.
    Transferring java-json.jar...
    100% 82 KB 82 KB/sec 00:00:01 0 Errors
  1. 【hdfs路径已经存在】
1
2
再次导入
20/12/19 23:33:06 ERROR tool.ImportTool: Import failed: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://hadoop001:9000/user/bigdata/emp already exists
  • 解决:加入–delete-target-dir 先删除再导入
  1. –table emp \ --where ‘SAL>2000’ \ 【与-e不能一起使用】

    【-e “SELECT * FROM emp WHERE EMPNO>=7900 AND $CONDITIONS” \】

    【表名大小写正确】

    1
    Cannot specify --query and --table together.
  2. 【where条件缺少’$CONDITIONS’】
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    [bigdata@hadoop001 ~]$ sqoop import \
    > --connect jdbc:mysql://localhost:3306/lxl \
    > --username root \
    > --password '123456' \
    > --delete-target-dir \
    > -m 1 \
    > --mapreduce-job-name EmpFromMySQL2HDFS \
    > --null-string '' \
    > --null-non-string 0 \
    > --fields-terminated-by '\t' \
    > -e "SELECT * FROM emp WHERE EMPNO>=7900" \
    > --target-dir EMP_COLUMN_QUERY

    20/12/19 23:48:38 ERROR tool.ImportTool: Import failed: java.io.IOException: Query [SELECT * FROM emp WHERE EMPNO>=7900] must contain '$CONDITIONS' in WHERE clause.
  • 解决:

    1
    -e "SELECT * FROM emp WHERE EMPNO>=7900 AND \$CONDITIONS"
    • 如果使用双引号,必须再$CONDITIONS前加上反斜杠转义
    • 使用单引号,不加

EVAL导入

1
2
3
4
5
[bigdata@hadoop001 ~]$ sqoop eval \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> --query 'select * from product_info'

–options-file命令

准备文件,按照格式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[bigdata@hadoop001 data]$ vi emp.opt
import
--connect
jdbc:mysql://localhost:3306/lxl
--username
root
--password
'123456'
--delete-target-dir
--columns
"EMPNO,ENAME,JOB,SAL,COMM"
--mapreduce-job-name
EmpFromMySQL2HDFS
--table
emp
--target-dir
EMP_COLUMN
--split-by
empno

执行报错

1
2
3
4
[bigdata@hadoop001 ~]$ sqoop --options-file emp.opt
...
20/12/20 09:13:51 ERROR sqoop.Sqoop: Error while expanding arguments
java.lang.Exception: Unable to read options file: emp.opt

赋权再执行

1
2
[bigdata@hadoop001 data]$ chmod 744 emp.opt 
[bigdata@hadoop001 data]$ sqoop --options-file emp.opt

job命令

创建job

1
2
3
4
5
6
7
8
9
[bigdata@hadoop001 ~]$ sqoop job --create lxl_job -- \
> import --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> -P \
> --delete-target-dir \
> --table product_info \
> --split-by product_id
...
Enter password:

查看

1
2
3
4
[bigdata@hadoop001 ~]$ sqoop job --list
...
Available jobs:
lxl_job

删除

1
[bigdata@hadoop001 ~]$ sqoop job --delete lxl_job

执行

1
[bigdata@hadoop001 ~]$ sqoop job --exec lxl_job

RDBMS ==> Hive

  • 准备数据
1
2
hive (test)> create table emp_column like emp;
OK
  • 导入
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[bigdata@hadoop001 ~]$ sqoop import \
> --connect jdbc:mysql://localhost:3306/lxl \
> --username root \
> --password '123456' \
> --delete-target-dir \
> --hive-database lxl_hive \
> --hive-import \
> --hive-overwrite \
> --hive-table emp_column \
> --columns "EMPNO,ENAME,JOB,SAL,COMM" \
> --mapreduce-job-name EmpFromMySQL2Hive \
> --table emp \
> --null-string '' \
> --null-non-string 0 \
> --m 1 \
> --fields-terminated-by '\t'

注意: --create-hive-table \ 【生产上不使用,会改变字段类型】

如果是分区表, 加上

–hive-partition-key ‘day’
–hive-partition-value ‘yyyyMMdd’ \

  • 执行报错
1
2
3
20/12/20 09:44:25 ERROR util.SqlTypeMap: exist in the table. Please ensure that you've specified
20/12/20 09:44:25 ERROR util.SqlTypeMap: correct column names in Sqoop options.
20/12/20 09:44:25 ERROR tool.ImportTool: Import failed: column not found: EMPNO

注意: 字段名区分大小写

  • 修改大小写后执行, 报错
1
2
20/12/20 09:46:58 ERROR hive.HiveConfig: Could not load org.apache.hadoop.hive.conf.HiveConf. Make sure HIVE_CONF_DIR is set correctly.
20/12/20 09:46:58 ERROR tool.ImportTool: Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf

解决:

​ 检查HIVE_CONF_DIR是否有设置:

1
[bigdata@hadoop001 ~]$ echo $HIVE_CONF_DIR

​ 如果输出为空, 去系统环境变量配置

1
2
3
[root@hadoop001 ~]# vi /etc/profile
export HIVE_CONF_DIR=$HIVE_HOME/conf
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*
  • 执行报错
1
FAILED: SemanticException [Error 10072]: Database does not exist: test

解决:

1
[bigdata@hadoop001 conf]$ cp /home/bigdata/app/hive/conf/hive-site.xml /home/bigdata/app/sqoop/conf
  • 执行成功,查看结果

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    hive (test)> select * from emp_column;
    OK
    emp_column.empno emp_column.ename emp_column.job emp_column.mgr emp_column.hiredate emp_column.sal emp_column.comm emp_column.deptno
    7369 SMITH CLERK 800 0 NULL NULL NULL
    7499 ALLEN SALESMAN 1600 300.00 NULL NULL NULL
    7521 WARD SALESMAN 1250 500.00 NULL NULL NULL
    7566 JONES MANAGER 2975 0 NULL NULL NULL
    7654 MARTIN SALESMAN 1250 1400.00 NULL NULL NULL
    7698 BLAKE MANAGER 2850 0 NULL NULL NULL
    7782 CLARK MANAGER 2450 0 NULL NULL NULL
    7788 SCOTT ANALYST 3000 0 NULL NULL NULL
    7839 KING PRESIDENT 5000 0 NULL NULL NULL
    7844 TURNER SALESMAN 1500 0.00 NULL NULL NULL
    7876 ADAMS CLERK 1100 0 NULL NULL NULL
    7900 JAMES CLERK 950 0 NULL NULL NULL
    7902 FORD ANALYST 3000 0 NULL NULL NULL
    7934 MILLER CLERK 1300 0 NULL NULL NULL
    Time taken: 0.048 seconds, Fetched: 14 row(s)

Hive==>RDBMS

数据准备

1
2
mysql> select  * from emp3;
Empty set (0.00 sec)

执行报错

1
2
3
4
5
6
7
8
9
sqoop export \
--connect jdbc:mysql://localhost:3306/lxl \
--username root \
--password '123456' \
--table emp3 \
--export-dir /user/bigdata/emp

20/12/20 10:19:51 ERROR tool.ExportTool: Error during export:
Export job failed!

解决: 导出需要制定分隔符

1
2
3
4
5
6
7
sqoop export \
--connect jdbc:mysql://localhost:3306/lxl \
--username root \
--password '123456' \
--table emp3 \
--fields-terminated-by '\t' \
--export-dir /user/bigdata/emp

查看结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
mysql> mysql> select  * from emp3;
+-------+--------+-----------+------+---------------------+---------+---------+--------+
| empno | ename | job | mgr | hiredate | sal | comm | deptno |
+-------+--------+-----------+------+---------------------+---------+---------+--------+
| 7900 | JAMES | CLERK | 7698 | 1981-12-03 00:00:00 | 950.00 | 0.00 | 30 |
| 7902 | FORD | ANALYST | 7566 | 1981-12-03 00:00:00 | 3000.00 | 0.00 | 20 |
| 7934 | MILLER | CLERK | 7782 | 1982-01-23 00:00:00 | 1300.00 | 0.00 | 10 |
| 7369 | SMITH | CLERK | 7902 | 1980-12-17 00:00:00 | 800.00 | 0.00 | 20 |
| 7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 00:00:00 | 1600.00 | 300.00 | 30 |
| 7521 | WARD | SALESMAN | 7698 | 1981-02-22 00:00:00 | 1250.00 | 500.00 | 30 |
| 7566 | JONES | MANAGER | 7839 | 1981-04-02 00:00:00 | 2975.00 | 0.00 | 20 |
| 7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 00:00:00 | 1250.00 | 1400.00 | 30 |
| 7698 | BLAKE | MANAGER | 7839 | 1981-05-01 00:00:00 | 2850.00 | 0.00 | 30 |
| 7782 | CLARK | MANAGER | 7839 | 1981-06-09 00:00:00 | 2450.00 | 0.00 | 10 |
| 7788 | SCOTT | ANALYST | 7566 | 1982-12-09 00:00:00 | 3000.00 | 0.00 | 20 |
| 7839 | KING | PRESIDENT | 0 | 1981-11-17 00:00:00 | 5000.00 | 0.00 | 10 |
| 7844 | TURNER | SALESMAN | 7698 | 1981-09-08 00:00:00 | 1500.00 | 0.00 | 30 |
| 7876 | ADAMS | CLERK | 7788 | 1983-01-12 00:00:00 | 1100.00 | 0.00 | 20 |
+-------+--------+-----------+------+---------------------+---------+---------+--------+
14 rows in set (0.00 sec)

TopN问题

城市表:MySQL city_info.sql
商品表:MySQL product_info.sql
用户行为日志:Hive user_click.txt

  1. 数据准备

    1
    2
    3
    4
    5
    [bigdata@hadoop001 data]$ rz
    rz waiting to receive.
    Starting zmodem transfer. Press Ctrl+C to cancel.
    Transferring user_click.txt...
    100% 674 KB 674 KB/sec 00:00:01 0 Errors

    MySQL执行两个.sql文件

    在Hive上创建三个表

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    create table city_info(
    city_id int,
    city_name string,
    area string
    )
    row format delimited fields terminated by '\t';


    create table product_info(
    product_id int,
    product_name string,
    extend_info string
    )
    row format delimited fields terminated by '\t';

    create table user_click(
    user_id string,
    session_id string,
    time string,
    city_id int,
    product_id int
    ) partitioned by(day string)
    row format delimited fields terminated by ',';
  2. 将txt文件导入Hive

    1
    load data local inpath '/home/bigdata/data/user_click.txt'	overwrite into table user_click partition(day='2016-05-05');
  3. 将MySQL的数据导入Hive

    1
    2
    3
    4
    5
    6
    7
    sqoop import --connect jdbc:mysql://localhost:3306/lxl \
    --username root --password '123456' \
    --table product_info \
    --delete-target-dir \
    --hive-import --hive-database test --hive-table product_info --hive-overwrite \
    --fields-terminated-by '\t' \
    --split-by product_id
    1
    2
    3
    4
    5
    6
    7
    8
    sqoop import --connect jdbc:mysql://localhost:3306/lxl \
    --username root --password '123456' \
    --table city_info \
    --delete-target-dir \
    --hive-import --hive-database test \
    --hive-table city_info --hive-overwrite \
    --fields-terminated-by '\t' \
    --split-by city_id

    报错

    1
    2
    20/12/20 11:49:30 INFO mapreduce.Job: Task Id : attempt_1606543264596_0067_m_000003_1, Status : FAILED
    Error: java.io.IOException: Cannot run program "mysqldump": error=2, No such file or directory

    解决

    • 找到mysqldump

      1
      2
      [root@hadoop001 ~]# find / -name mysqldump
      /usr/local/mysql/bin/mysqldump
    • 拷贝到sqoop节点下sqoop/bin

      1
      [bigdata@hadoop001 ~]$ cp /usr/local/mysql/bin/mysqldump /home/bigdata/app/sqoop/bin
    • 给mysqldump 加X(执行)权限

      1