Sqoop: Import postgreSQL Tables to hdfs

Download psql jdbc connector:
# curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar
# cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop/

Make sure copied to all nodes referencing sqoop (which has /var/lib/sqoop)

# chown sqoop:sqoop /var/lib/sqoop/postgresql-9.2-1002.jdbc4.jar

also copy in:
# cp postgresql-9.2-1002.jdbc4.jar /opt/cloudera/parcels/CDH/lib/sqoop
# cp postgresql-9.2-1002.jdbc4.jar /opt/cloudera/parcels/CDH/lib/sqoop2
Import into the HDFS using Sqoop
With the data source defined, we are now ready to import the data into the HDFS. 
The sqoop command we will examine is listed below, and we will break down each 
argument in the bullet points that follow. Note that the command is supposed to 
be on one complete line or, as shown below, with the backslash (the Linux command
line continuation character) at the end of each line except the last.
sqoop import --connect 'jdbc:postgresql://aaa.bbb.ccc.ddd:5432/Toptal?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory' \
--username 'postgres' -P \
--table 'sales' \
--target-dir 'sales' \
--split-by 'pksales'
sqoop import - The executable is named sqoop, and we are instructing it to import 
the data from a table or view from a database to the HDFS.
--connect - With the --connect argument, we are passing in the JDBC connect string 
for PostgreSQL. In this case, we use the IP address, port number, and database name. 
We also need to specify that SSL is being utilized and need to supply the 
SSLSocketFactory class to be used.
--username - In this example, the username is a PostgreSQL login, not a Windows 
login. The user must have permissions to connect to the specified database and to 
select from the specified table.
-P - This will prompt the command line user for the password. If Sqoop is rarely 
executed, this might be a good option. There are multiple other ways to pass the 
password to the command automatically, but we are trying to keep it simple for 
this article.
--table - This is where we pass in the name of the PostgreSQL table.
--target-dir - This argument specifies the HDFS directory where the data is to be 
stored.
--split-by - We must provide Sqoop with a unique identifier to help it distribute 
the workload. Later in the job output, we will see where Sqoop selects the minimum 
and maximum values to help set split boundaries.
It is a good idea to put the command in a script for repeatability and editing 
purposes, as shown below:
$ cat sqoopCommand.sh
sqoop import --connect 'jdbc:postgresql://aaa.bbb.ccc.ddd:5432/toptal?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory' \
--username 'postgres' -P \
--table 'sales' \
--target-dir 'sales' \
--split-by 'pksales' 
After executing the Sqoop command, we can execute the hdfs dfs -ls command to 
see the directory that was created by default with the table name on the HDFS.
$ hdfs dfs -ls sales
Found 6 items
-rw-rw-rw-   1 toptal data          0 2016-10-02 18:59 sales/_SUCCESS
drwxrwxrwx   - toptal data          0 2016-10-02 18:58 sales/_logs
-rw-rw-rw-   1 toptal data        110 2016-10-02 18:59 sales/part-m-00000
-rw-rw-rw-   1 toptal data        111 2016-10-02 18:59 sales/part-m-00001
-rw-rw-rw-   1 toptal data        115 2016-10-02 18:59 sales/part-m-00002
-rw-rw-rw-   1 toptal data        115 2016-10-02 18:59 sales/part-m-00003
The hdfs dfs -cat command will display all of the records
$ hdfs dfs -cat sales/part-m-00000
1,2016-09-27,1.23,1,1
2,2016-09-27,2.34,1,2
3,2016-09-27,1.23,2,1
4,2016-09-27,2.34,2,2
5,2016-09-27,3.45,2,3
Notice that the default file delimiter is a comma. Also, notice that there 
are only five rows in each partition
If you need to run a query to extract data from multiple tables in the 
PostgreSQL database
$ cat sqoopCommand.sh
sqoop import --connect 'jdbc:postgresql://aaa.bbb.ccc.ddd:5432/toptal?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory' \
--username 'postgres' -P \
--target-dir 'creditCardOrders' \
--split-by 'pksales' \
--query "select s.pksales, s.saledate, s.saleamount, o.shippingtype, o.methodofpayment from sales s inner join orders o on s.orderid=o.orderid where o.methodofpayment='credit card' and \$CONDITIONS"
--target-dir - The target directory tells Sqoop in which directory on the HDFS 
to store the selected data. This argument is required by Sqoop when using a 
free-form query.
--split-by - Even though we are selecting the primary key of the sales table, 
we still have to provide Sqoop with a unique identifier to help it distribute 
the workload.
--query - This is the argument in which we supply the SQL query. The query above
is enclosed in double quotes. Notice that there is not a backslash (the line 
continuation character) in the multiple lines containing the query. Also notice 
the and \$CONDITIONS at the end of the WHERE clause. This is required by Sqoop 
because Sqoop will automatically replace the $CONDITIONS token with a unique 
expression.
0 Comments

There are no comments yet

Leave a comment

Your email address will not be published. Required fields are marked *