Apache Sqoop Tutorial

4 November 2017

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.This blog post will teach you basic tasks that you can perform with Sqoop.

Check Sqoop Version

[email protected]:~$ sqoop version

17/11/04 16:35:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Sqoop 1.4.6

Connect Sqoop to Mysql database

Copy the Database Connector jar{mysql-connector-java-5.1.44.jar} into the $SQOOP_HOME/lib directory

Ref:http://mvnrepository.com/artifact/mysql/mysql-connector-java

In my case copy the connector jar files to $SQOOP_HOME/lib/

cp /hom/hduser/mysql-connector-java-5.1.44.jar /usr/local/sqoop/lib

List the databases using Sqoop

sqoop list-databases --connect jdbc:mysql://localhost:3306/ --username root -password admin
sqoop list database

List the database of particular table using Sqoop

sqoop list-tables --connect jdbc:mysql://localhost:3306/employees --username root -password admin
sqoop list tables

Import mysql table in HDFS directory

Employee database consists of many tables

mysql> use employees;
Database changed

mysql> show tables;
+----------------------+
| Tables_in_employees  |
+----------------------+
| current_dept_emp     |
| departments          |
| dept_emp             |
| dept_emp_latest_date |
| dept_manager         |
| employees            |
| salaries             |
| titles               |
+----------------------+
8 rows in set (0.00 sec)

Import employees Table from employee database

sqoop import --connect jdbc:mysql://localhost:3306/employees --username root -password admin --table employees --target-dir /database/employees/employees

Once the import is completed you can verify by checking the GUI and hadoop fs command.

Using the GUI

sqoop import table

Using the Hadoop file system command

[email protected]:~$ hadoop fs -ls /database/employees/employees/
17/11/04 23:22:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r--   1 hduser supergroup          0 2017-11-04 23:09 /database/employees/employees/_SUCCESS
-rw-r--r--   1 hduser supergroup    4548041 2017-11-04 23:09 /database/employees/employees/part-m-00000
-rw-r--r--   1 hduser supergroup    2550561 2017-11-04 23:09 /database/employees/employees/part-m-00001
-rw-r--r--   1 hduser supergroup    2086360 2017-11-04 23:09 /database/employees/employees/part-m-00002
-rw-r--r--   1 hduser supergroup    4637031 2017-11-04 23:09 /database/employees/employees/part-m-00003

Read first 10 lines of the file to check the records

[email protected]:~$ hadoop fs -cat /database/employees/employees/part-m-00000 | head -n 10


17/11/04 23:27:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
10001,1953-09-02,Georgi,Facello,M,1986-06-26
10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
10003,1959-12-03,Parto,Bamford,M,1986-08-28
10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
10006,1953-04-20,Anneke,Preusig,F,1989-06-02
10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
10008,1958-02-19,Saniya,Kalloufi,M,1994-09-15
10009,1952-04-19,Sumant,Peac,F,1985-02-18
10010,1963-06-01,Duangkaew,Piveteau,F,1989-08-24
cat: Unable to write to output stream.

Controlling Parallelism in Sqoop

Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument.

sqoop import --connect jdbc:msql://localhost:3306/employees --username root -password admin --table salaries -m 2 --target-dir /database/employees/salaries

Below is the mapreduce counters and other information when running this Job which shows that 2 mappers were used to run this job .

17/11/04 23:19:11 INFO mapreduce.Job: Running job: job_1509849001630_0002
17/11/04 23:19:17 INFO mapreduce.Job: Job job_1509849001630_0002 running in uber mode : false
17/11/04 23:19:17 INFO mapreduce.Job:  map 0% reduce 0%
17/11/04 23:19:27 INFO mapreduce.Job:  map 50% reduce 0%
17/11/04 23:19:28 INFO mapreduce.Job:  map 100% reduce 0%
17/11/04 23:19:28 INFO mapreduce.Job: Job job_1509849001630_0002 completed successfully
17/11/04 23:19:28 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=311846
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=232
                HDFS: Number of bytes written=98781181
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4
        Job Counters
                Launched map tasks=2
                Other local map tasks=2
                Total time spent by all maps in occupied slots (ms)=15309
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=15309
                Total vcore-milliseconds taken by all map tasks=15309
                Total megabyte-milliseconds taken by all map tasks=15676416
        Map-Reduce Framework
                Map input records=2844047
                Map output records=2844047
                Input split bytes=232
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=452
                CPU time spent (ms)=13760
                Physical memory (bytes) snapshot=402960384
                Virtual memory (bytes) snapshot=3858165760
                Total committed heap usage (bytes)=310378496
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=98781181
17/11/04 23:19:28 INFO mapreduce.ImportJobBase: Transferred 94.2051 MB in 19.3542 seconds (4.8674 MB/sec)
17/11/04 23:19:28 INFO mapreduce.ImportJobBase: Retrieved 2844047 records.

To verify the output first check the files in Hadoop file system.

[email protected]:~$ hadoop fs -ls /database/employees/salaries
17/11/04 23:39:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   1 hduser supergroup          0 2017-11-04 23:19 /database/employees/salaries/_SUCCESS
-rw-r--r--   1 hduser supergroup   50698979 2017-11-04 23:19 /database/employees/salaries/part-m-00000
-rw-r--r--   1 hduser supergroup   48082202 2017-11-04 23:19 /database/employees/salaries/part-m-00001

Once files are verified in HDFS we can use following command to check sample records .

[email protected]:~$ hadoop fs -cat /database/employees/salaries/part-m-00000 | head -n 10

17/11/04 23:40:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
10001,60117,1986-06-26,1987-06-26
10001,62102,1987-06-26,1988-06-25
10001,66074,1988-06-25,1989-06-25
10001,66596,1989-06-25,1990-06-25
10001,66961,1990-06-25,1991-06-25
10001,71046,1991-06-25,1992-06-24
10001,74333,1992-06-24,1993-06-24
10001,75286,1993-06-24,1994-06-24
10001,75994,1994-06-24,1995-06-24
10001,76884,1995-06-24,1996-06-23
cat: Unable to write to output stream.

As we can see that data from mysql salaries table was imported into HDFS.

sqoop salary table

Importing RDBMS table in Hadoop using Sqoop when the table doesn’t have a primary key column

Usually, we import an RDBMS table in Hadoop using Sqoop Import when it has a primary key column. If it doesn’t have the primary key column, it will give you the below error-

ERROR tool.ImportTool: Error during import: No primary key could be found for table <table_name>. Please specify one with –split-by or perform a sequential import with ‘-m 1’

If your table doesn’t have the primary key column, you need to specify -m 1 option for importing the data, or you have to provide –split-by argument with some column name.

Here are the scripts which you can use to import an RDBMS table in Hadoop using Sqoop when you don’t have a primary key column.

sqoop import \
–connect jdbc:mysql://localhost/dbname \
–username root \
–password root \
–table user \
–target-dir /user/root/user_data \
–columns “first_name, last_name, created_date”
-m 1

or

sqoop import \
–connect jdbc:mysql://localhost/ dbname\
–username root \
–password root \
–table user \
–target-dir /user/root/user_data \
–columns “first_name, last_name, created_date”

References

[1] Employees Database

Apache Sqoop

Share: Twitter Facebook Google+ LinkedIn
comments powered by Disqus