PROC SQOOP

10月 282015
 

SAS 9.4 M3, introduces a new procedure named PROC SQOOP. This procedure enables users to access an Apache Sqoop utility from a SAS session to transfer data between a database and HDFS. Using SAS PROC SQOOP lets you submit Sqoop commands from within your SAS application to your Hadoop cluster.
PROC SQOOP is licensed with SAS/ACCESS® Interface for Hadoop, it’s not part of the Base SAS® license. PROC SQOOP is supported in UNIX and Windows SAS.

Sqoop commands are passed to the cluster using the Apache Oozie Workflow Scheduler for Hadoop. PROC SQOOP defines an Oozie workflow for your Sqoop task, which is then submitted to an Oozie server using a RESTFUL API.

PROC SQOOP works similarly to the Apache Sqoop command-line interface (CLI), using the same syntax. The procedure provides feedback as to whether the job completed successfully and where to get more details in your Hadoop cluster.

Database Connector

Sqoop can be used with any Java Database Connectivity (JDBC) compliant database and automatically supports several databases. In some cases, the database vendor’s JDBC driver (JAR file) might need to be installed in the “$sqoop_home/lib” path on the Sqoop client machine.

For example, if you are connecting to MySQL database, you place the mySQL connector JAR file under the /../lib/oozie/lib folder.

    [root@sascdh01 ~]# ls -l /usr/lib/oozie/lib/my*
    -rw-r–r– 1 root root 972009 Oct 15 02:48 /usr/lib/oozie/lib/mysql-connector-java-5.1.36-bin.jar

SQOOP setup

  • Download Sqoop-1 (1.4.x) from the Apache Sqoop website. Sqoop 1.4.5 is recommended. Be sure to get a Sqoop JAR file that is compatible with your Hadoop.
  • For each database that you plan to use Sqoop with, you must download a compatible JDBC driver or Sqoop connector from the associated database vendor. The connectors (JAR files) should be places in Sqoop lib and in the Oozie share lib in HDFS.
  • PROC SQOOP uses your Hadoop cluster configuration files. Set the environment variable SAS_HADOOP_CONFIG_PATH, which points to the location of your Hadoop configuration directory.
  • SQOOP JAR files are not required for the SAS client machine. The PROC SQOOP uses Apache Oozie, which provides REST API communication to the Hadoop cluster without local JAR files. Set the environment variable SAS_HADOOP_RESTFUL=1; to connect to the Hadoop server by using the WebHDFS or the HttpFS REST API.

To use PROC SQOOP, you must have the following information:

  • Database connection information; each vendor has its own connection options
  • Database user ID and password
  • HDFS file that contains database password ( for some database cases)
  • Hadoop user ID and password
  • Oozie URL ( host and port #)
  • NameNode service URL information (host and port #)
  • JobTracker/Resource manager service URL information (host and port #)
  • Oozie Workflow output Path
  • Sqoop command

Example
The following code example illustrates the data transfer from a MySQL database table to HDFS. The data transfer operation executed on the Hadoop cluster using Oozie workflow and MapReduce steps. The data is streamed directly from database server to the Hadoop cluster without routing through SAS workspace server. Under the PROC SQOOP statement, you provide environment properties where the data is located and the target location. Under the command section, you provide the native Sqoop statement for specific required actions.

OPTIONS SET=SAS_HADOOP_CONFIG_PATH="/opt/sas/thirdparty/Hadoop_Conf/CDH524";
OPTIONS SET=SAS_HADOOP_RESTFUL=1 ;

proc sqoop
 hadoopuser='sasdemo'
 dbuser='hdp' dbpwd='xxxxxx'
 oozieurl='http://xxxxxxx.xxxx.sas.com:11000/oozie'
 namenode='hdfs://xxxxxxx.xxxx.sas.com:8020'
 jobtracker='xxxxxxx.xxxx.sas.com:8032'
 wfhdfspath='hdfs://xxxxxxx.xxxx.sas.com:8020/user/sasdemo/myworkflow.xml'
 deletewf
 command=' import --connect jdbc:mysql://XXXXX.XXXX.sas.com/hdpdata --append -m 1 --table department --target-dir /user/sasdemo/department ';
 run;

Log extract from the above code execution
………….
…………………..
NOTE: SAS initialization used:
real time 0.03 seconds
cpu time 0.02 seconds

1 OPTIONS SET=SAS_HADOOP_CONFIG_PATH="/opt/sas/thirdparty/Hadoop_Conf/CDH524";
2 OPTIONS SET=SAS_HADOOP_RESTFUL=1 ;
3
4
5
6 proc sqoop
7 hadoopuser='sasdemo'
8 dbuser='hdp' dbpwd=XXXXXXXXX
9 oozieurl='http://xxxxx.xxxx.sas.com:11000/oozie'
10 namenode='hdfs://xxxxx.xxxx.sas.com:8020'
11 jobtracker=xxxxx.xxxx.sas.com:8032'
12 wfhdfspath='hdfs://xxxxx.xxxx.sas.com:8020/user/sasdemo/myworkflow.xml'
13 deletewf
14 command=' import --connect jdbc:mysql://xxxxx.xxxx.sas.com/hdpdata --append -m 1 --table department
--target-dir
14 ! /user/sasdemo/department ';
15 run;

NOTE: Job ID : 0000004-151015031507797-oozie-oozi-W
NOTE: Status : SUCCEEDED
NOTE: PROCEDURE SQOOP used (Total process time):
real time 55.89 seconds
cpu time 0.05 seconds
……….
………………..

On an Oozie web console, you can see the Oozie jobs that have been submitted and their status shows either running, killed, or succeeded.proc_sqoop_1
On a Yarn Resource Manager User interface, you can see the MapReduce tasks that have been submitted and executing.
proc_sqoop_2

On HDFS, you can see the data files that have been created per the –targetdir in the Sqoop command.

 

[root@sascdh01 ~]# hadoop fs -ls /user/sasdemo/department
Found 3 items
-rw-r--r-- 3 sasdemo supergroup 22 2015-10-15 10:56 /user/sasdemo/department/part-m-00000
-rw-r--r-- 3 sasdemo supergroup 22 2015-10-15 11:54 /user/sasdemo/department/part-m-00001
-rw-r--r-- 3 sasdemo supergroup 22 2015-10-15 12:03 /user/sasdemo/department/part-m-00002

Dependencies on SAS_HADOOP_RESTFUL environment variable
The SAS_HADOOP_RESTFUL environment variable determines whether to connect to the Hadoop server through JAR files, HttpFs, or WebHDFS. The default setting for this variable is 0, which connects to the Hadoop cluster using JAR files. The PROC SQOOP uses Apache Oozie, which provides REST API communication to a Hadoop cluster without JAR files. So when running PROC SQOOP, we need this parameter set to one.
If you don’t set the environment variable SAS_HADOOP_RESTFUL=1 in your SAS session, you could see a strange error message while executing the PROC SQOOP statement. The following SAS log reports issues with ‘hadoopuser’ parameters, and the process assumes that the Hadoop cluster is enabled with Kerberos security. However, security is disabled on the Hadoop cluster. This is therefore a misleading error message toward Kerberos.

..........
.....
NOTE: SAS initialization used:
real time 0.02 seconds
cpu time 0.02 seconds

1 OPTIONS SET=SAS_HADOOP_CONFIG_PATH="/opt/sas/thirdparty/Hadoop_Conf/CDH524";
2 /* OPTIONS SET=SAS_HADOOP_RESTFUL=1; */
3
4
5 proc sqoop
6 hadoopuser='sasdemo'
7 dbuser='hdp' dbpwd=XXXXXXXXX
8 oozieurl='http://xxxxx.xxxx.sas.com:11000/oozie'
9 namenode='hdfs://xxxxx.xxxx.sas.com:8020'
10 jobtracker=xxxxx.xxxx.sas.com:8032'
11 wfhdfspath='hdfs://xxxxx.xxxx.sas.com:8020/user/sasdemo/myworkflow.xml'
12 deletewf
13 command=' import --connect jdbc:mysql://xxxxx.xxxx.sas.com/hdpdata --append -m 1 --table department
--target-dir
13 ! /user/sasdemo/department ';
14 run;

ERROR: HADOOPUSER should not be provided for Kerberos enabled clusters.
ERROR: The path was not found: /user/sasdemo/SAS_SQOOPaq7ddo96.
NOTE: PROCEDURE SQOOP used (Total process time):
real time 3.41 seconds
cpu time 0.09 seconds
...........
......

Reference document

Base SAS® Procedure Guide, Fourth Edition

tags: Hadoop, PROC SQOOP, SAS Professional Services

Using SAS PROC SQOOP was published on SAS Users.