Sqoop: Import tool to HDFS, Hive, HBase

Sqoop
1. Introduction:
Sqoop is a tool designed to transfer data between Hadoop and relational databases.
You can use Sqoop to
-import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS),
-transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

2. Uses:
With Sqoop, you can import data from a relational database system into HDFS. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table. The import process is performed in parallel. For this reason, the output will be in multiple files. These files may be delimited text files (for example, with commas or tabs separating each field), or binary Avro or SequenceFiles containing serialized record data.

After manipulating the imported records (for example, with MapReduce or Hive) you may have a result data set which you can then export back to the relational database. Sqoop’s export process will read a set of delimited text files from HDFS in parallel, parse them into records, and insert them as new rows in a target database table, for consumption by external applications or users.

Sqoop includes some other commands which allow you to inspect the database you are working with. For example, you can
-list the available database schemas (with the sqoop-list-databases tool) and tables within a schema (with the sqoop-list-tables tool).
For listing database: sqoop-list-databases --connect jdbc:mysql://localhost/ --username root --password 123
for tables : sqoop-list-tables --connect jdbc:mysql://localhost/employees --username root --password 123

-Sqoop also includes a primitive SQL execution shell (the sqoop-eval tool).

Most aspects of the import, code generation, and export processes can be customized.
You can control the specific row range or columns imported.
You can specify particular delimiters and escape characters for the file-based representation of the data, as well as the file format used.
You can also control the class or package names used in generated code. Subsequent sections of this document explain how to specify these and other arguments to Sqoop.

3. Sqoop ships with a help tool.
sqoop help

4. The import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line)

5. It’s important that you do not use the URL localhost if you intend to use Sqoop with a distributed Hadoop cluster. User full url.

6. connect to mysql database
sqoop import --table employees --connect jdbc:mysql://localhost/employees --username root --password 123 --driver com.mysql.jdbc.Drive

7. to connect to a SQLServer database, first download the driver from microsoft.com and install it in your Sqoop lib path.Then run Sqoop. For example:
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver --connect ...

7. Selecting the Data to Import
you can select a subset of columns and control their ordering by using the --columns argument.
This should include a comma-delimited list of columns to import. For example: --columns "name,employee_id,jobtitle"

sqoop import --table employees --connect jdbc:mysql://localhost/employees --username root --password 123 --cloumns"employeeid,employeename" --driver com.mysql.jdbc.Drive

You can control which rows are imported by adding a SQL WHERE clause to the import statement. By default, Sqoop generates statements of the form SELECT FROM . You can append a WHERE clause to this with the --where argument. For example: --where "id > 400". Only rows where the id column has a value greater than 400 will be imported.

sqoop import --table employees --connect jdbc:mysql://localhost/employees --username root --password 123 --cloumns"employeeid,employeename" --where "id > 400" --driver com.mysql.jdbc.Drive

8.
Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument.
if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

9.Controlling type mapping
$ sqoop import ... --map-column-java id=String,value=Integer

10. Import data in one of two file formats: delimited text or SequenceFiles.
Delimited text is appropriate for most non-binary data types. It also readily supports further manipulation by other tools, such as Hive.
SequenceFiles are a binary format that store individual records in custom record-specific data types.

11. When importing to delimited files, the choice of delimiter is important. Delimiters which appear inside string-based fields may cause ambiguous parsing of the imported data by subsequent analysis passes. For example, the string "Hello, pleased to meet you" should not be imported with the end-of-field delimiter set to a comma.

12. Hive arguments
--hive-import Import tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwrite Overwrite existing data in the Hive table.
--create-hive-table If set, then the job will fail if the target hive
--hive-table Sets the table name to use when importing to Hive.
--map-column-hive Override default mapping from SQL type to Hive type for configured columns.

13. Sqoop’s import tool’s main function is to upload your data into files in HDFS. If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.

sqoop import --table employees --connect jdbc:mysql://localhost/employees --username root --password 123 --driver com.mysql.jdbc.Drive --hive-import

Define in Hive a table named emps with a definition based on a database table named employees:

$ sqoop create-hive-table --connect jdbc:mysql://db.example.com/corp \
--table employees --hive-table emps

If the Hive table already exists, you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced
After your data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.

If you do not set any delimiters and do use --hive-import, the field delimiter will be set to ^A and the record delimiter will be set to \n to be consistent with Hive’s defaults.

14. The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the --hive-table option.

15. Importing Data Into HBase
Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import records into a table in HBase.
HBase arguments:
--column-family Sets the target column family for the import
--hbase-create-table If specified, create missing HBase tables
--hbase-row-key Specifies which input column to use as the row key
--hbase-table Specifies an HBase table to use as the target instead of HDFS

-By specifying --hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS.
-Sqoop will import data to the table specified as the argument to --hbase-table.
If the target table and column family do not exist, the Sqoop job will exit with an error. You should create the target table and column family before running an import. If you specify --hbase-create-table, Sqoop will create the target table and column family if they do not exist, using the default parameters from your HBase configuration.

Each row of the input table will be transformed into an HBase Put operation to a row of the output table. The key for each row is taken from a column of the input. By default Sqoop will use the split-by column as the row key column. If that is not specified, it will try to identify the primary key column, if any, of the source table. You can manually specify the row key column with --hbase-row-key. Each output column will be placed in the same column family, which must be specified with --column-family.

16. sqoop-export
The export tool exports a set of files from HDFS back to an RDBMS.
The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.

The default operation is to transform these into a set of INSERT statements that inject the records into the database. In "update mode," Sqoop will generate UPDATE statements that replace existing records in the database.

--export-dir HDFS source path for the export
--table Table to populate
--update-key Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.

17. A basic export to populate a table named bar:

$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data
This example takes the files in /results/bar_data and injects their contents in to the bar table in the foo database on db.example.com. The target table must already exist in the database. Sqoop performs a set of INSERT INTO operations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (for example, a particular primary key value already exists), then the export fails.

Reference:
https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html

Learner and Developer

Search This Blog

Sqoop: Import tool to HDFS, Hive, HBase

Comments

Post a Comment