3月 292018
 

With the release of SAS Viya 3.3, you now have the ability to pass implicit SQL queries to a variety of SQL data sources, including Hive. Under an implicit pass-through, users can write SAS compliant SQL code, and SAS will:

  1. Convert as much code as possible into database native SQL.
  2. Execute the resulting query in-database.
  3. Bring the result back into SAS Viya.

My SAS Viya is co-located within a Hortonworks Hadoop environment. Within this environment, I have set up multiple tables within Hive, which provides structure and a query-like environment for Hadoop data. Using the SAS Data Explorer in SAS Viya, I can easily see the different tables in the Hive environment, and visually inspect them without having to load the data into SAS. The screenshot below shows the Hive table va_service_detail, which contains anonymous data related to recent hospital stays.

SQL Pass-through to Hive in SAS Viya

In my Hive environment, I have a second table called va_member_detail, which contains information about the individuals who were hospitalized in the above table, va_service_detail. A summary of this Hive table can be found in the screenshot below.

Using this data, I would like to perform an analysis to determine why patients are readmitted to the hospital, and understand how we can preventatively keep patients healthy. I will need to join these two tables to allow me to have visit-level and patient-level information in one table. Since medical data is large and messy, I would like to only import the needed information into SAS for my analysis.  The simplest way to do this is through an implicit SQL pass-through to Hive, as shown below:

With an implicit pass-through, I write normal SAS FedSQL code on top of a SAS Library called “Hadoop” pointing to my Hive Server. Once the code is submitted, the SAS System performs the following steps:

  1. Translates the SAS FedSQL code into HiveQL.
  2. Executes the HiveQL script in Hive.
  3. Loads the resulting data in parallel into SAS.

Looking at the log, we can see that the SQL statement was “Fully offloaded to the underlying data source via fill pass-through”, meaning that SAS successfully executed the query, in its entirety, in Hive. With the SAS Embedded Process for Hadoop, the resulting table is then lifted in-parallel from Hive into SAS Viya, making it available for analysis.

As we can see in the log, it took 42 seconds to execute the query in Hive, and bring the result into SAS. To compare efficiency, I redid the analysis, loading va_service_detail and va_member_detail into the memory of the SAS server and performed the join there. The execution took 58 seconds, but required three in-memory tables to do so, along with much more data passing through the network. The implicit pass-through has the benefits of increased speed and decreased latency in data transfer by pushing the query to its source, in this case Hive.

Conclusion

The Implicit SQL Pass-through to Hive in SAS Viya is a must have tool for any analyst working with Hadoop data. With normal SQL syntax in a familiar SAS interface, analysts can push down powerful queries into Hive, speeding up their analysis while limiting data transfer. Analysts can effectively work with large ever-growing data sizes, and speed up the time to value on solving key business challenges.

Implicit SQL Pass-through to Hive in SAS Viya was published on SAS Users.

6月 202014
 
This is a simple tutorial showing how to use SQL to subset data, and then create a histogram using Proc Sgplot, in SAS University Edition. So you've downloaded SAS University Edition, and you're wondering "What now?" -- I would recommend exploring some of the sample data, and creating some simple […]
3月 252014
 

The new book Routine SAS SQL, out today, is especially for people getting started with SQL in SAS. When I wrote it, I started from the “SQL” chapter in Professional SAS Programming Logic, but I added many more details that a SAS SQL coder is likely to need.

This book is also my first e-book. I’ll be watching to see how popular the e-book formats are in comparison to the print format.

Links:
Routine SAS SQL at Amazon (paperback)
http://www.amazon.com/gp/product/ASIN/189195721X
Routine SAS SQL at Kobo
http://store.kobobooks.com/en-US/ebook/routine-sas-sql
Routine SAS SQL for Amazon Kindle
http://www.amazon.com/dp/B00J7QKFM6
Routine SAS SQL for Apple iPad
https://itunes.apple.com/us/book/routine-sas-sql/id845242502
11月 132013
 

Wow did the summer fly by! Now that grad school is over, it’s about time to resume the “it’s all about the data” series. In the last several posts, I tried to lay a foundation for understanding how SAS stores and manages data for use in business intelligence and analytic applications. Now, I’d like to turn your attention to third-party database engines, which is the term (or SAS-ism) for when data is housed in non-SAS data structures. Examples include the obvious choices like Microsoft SQL Server, Oracle, MySQL, Teradata and a dozen or so additional supported third party database interfaces.

This article provides a quick overview of how SAS “talks” to third-party databases. If you are an experienced SAS programmer, you probably know much of what’s covered here. This topic has been written about extensively over the years, so while I wont go into detail on every nuance, I thought (thot!) it would be useful to provide a high level overview for those SAS administrators who are not SAS programmers but have to support the data needs of their user community. The focus here is the necessary preconditions for connectivity and the differences between explicitly passing through your query to the database versus letting SAS handle the negotiations. Let’s get started!

I can’t seem to connect to my database from SAS, what’s wrong?
Whether you are on Windows or UNIX, you need to both an ODBC driver manager and an ODBC driver to be installed. The error in the log below is for SQL Server and shows that the connection could not be made. You need to look at your ODBC.ini file (UNIX) or ODBC Administrator (Windows) and confirm that you have specified the details correctly. For more information on how to connect, take a look at the articles here for UNIX and Windows

 

For Oracle, you need an Oracle client driver installed on the server, and the driver needs to know where to look for the database. This driver is usually defined in a file called tnsnames.ora that lives within the client driver directory. For more information, take a look at the articles here for connecting to Oracle databases on UNIX and Windows. Note that you can use either SAS/Access to Oracle (native driver) or SAS/Access to ODBC to make these connections from within SAS.

What is SQL Pass-thru and why should I care?
In SAS, you also have the option of using two features to access your third-party database:  libname access or explicit pass-thru capabilities.

Libname access is just like it sounds – you simply use the LIBNAME statement in SAS and specify the engine along with your credentials and database specific attributes. Once the libref is instantiated, then you can use references to the library just as if it were a SAS library. Here is a quick example:

Explicit pass-thru, on the other hand, is different in that you are not letting SAS do any of the database negotiations for you. Instead, you tell SAS to pass through everything in your program directly to the database for processing. Here is the same connection to sqlsvr, this time using explicit pass-thru.

This is perhaps one of the better-documented SAS capabilities as it relates to best practice. For example, Jessica Hampton discusses SQL pass-thru in her NESUG 2011 paper "SQL Pass-Through and the ODBC Interface". Here, she compares the libname method (implicit) versus explicit pass-thru method. 

There are so many options for connecting SAS to my database, which is the best?
This is a common question. After all, you can use Microsoft SQL Server using ODBC, SAS/Access to Microsoft SQL Server and OLE DB. For Oracle, you also have the option of using the native driver as well as ODBC. You’ll also see that this is true for many other database such as Teradata, mySQL, and so on.  Generally the choice comes down to performance or flexibility:

  • The best performance will always be the native driver for that database.
  • However, if you want the flexibility of talking to lots of databases and don’t mind having to find and purchase individual ODBC drivers, then the more general SAS/Access to ODBC may be right for you.

One word of caution, however:  if you use explicit SQL pass-thru, then you have to comply with the rules of the database driver that you are using. For example, there are lots of capabilities within the Oracle SQL language that are not available in ODBC. So if you want the full power of using database-specific language elements, then you will want to go with the native drivers.

What is the best option for connecting to Microsoft SQL Server?
For Microsoft SQL Server, people often ask if they should use SAS/Access to ODBC or the native SAS/Access to SQL Server engine? One of the best “how-to” articles on SAS and Microsoft SQL Server can be found in the SAS Tech Note "Accessing a Microsoft SQL Server Database from SAS on Microsoft Windows".  Although this note is a bit dated, it provides a good foundation for most users.

What might be surprising to some is that the native engine is actually a Microsoft SQL Server ODBC driver that is shipped with the SAS/Access product. SAS licenses the DataDirect ODBC driver for you so that you don’t have to find and eventually purchase another ODBC driver for your environment.

 The DataDirect drivers have been proven to work efficiently with SAS, and they go through extensive testing before each version of SAS is released. On the other hand, if you want the flexibility of using your own ODBC drivers or have a need to connect to other database via ODBC, then SAS/Access to ODBC might be the route for you.

In the SAS Community, you may see papers that talk about Microsoft SQL Server that use ODBC and some use the native access engine:  sqlsvr. Be sure to use the engine that you have licensed as well as the syntax appropriate to that engine.

There are other methods that can be used to get data in and out of Microsoft SQL Server from SAS such as OLE DB, but I’ll leave that to you to read "Getting Started with SAS/ACCESS Interface to OLE DB" as homework.

 Can I use SAS/Access to ODBC and SAS/Access to Microsoft SQL Server on the same machine?
A slightly more confusing answer comes when people want to use both SAS/Access to ODBC and SAS/Access to Microsoft SQL Server on the same machine. Without going into a detailed explanation, the short answer is “yes – you can, but you might not want to do that.”

Supporting SAS/Access products on the same machine requires that you set up the environment so that the drivers are defined correctly in the odbc.ini file and may necessitate using two startup scripts when launching SAS depending on how you want to install and configure. As long as you are precise about where you install the software and how the environment variables are instantiated, it should work.

Where do I go from here?
Obviously this topic is huge! Since there are already so many good references around this topic, I’ll just highlight a few more:

For those new SAS administrators, are there other questions that you have about accessing data from third party relational database management systems?

Remember--Happy Data, Happy Users!

--greg

tags: database engines, Oracle, SAS Administrators, sql
6月 282013
 

Many SAS Enterprise Guide users practically live in the Query Builder. For those who understand their data tables, the Query Builder provides a tremendous amount of flexibility to pull and manipulate data. The Query Builder produces SQL programs behind the scenes, which translates well for database-centric work.

Sometimes a complex query requires a two-pass process. For example, suppose that you want to create a table of CARS that contains only those models with a higher-than-average price. First you need a query to calculate the average price across all cars. Then you need a second query to filter the records with a price that's higher than that result.

A SAS Enterprise Guide user might build something like this:

This method has obvious downsides: it requires two passes through the data, and two sets of results are brought into your SAS session. If the source table is in a database, this can be especially inefficient.

SAS Enterprise Guide 5.1 supports the idea of subqueries, a concept familiar to SQL programmers. With a subquery (sometimes called "subselect" or "inner query"), you can express the entire operation in one SQL step. That's one trip to the database, and only one set of results brought back to SAS.

I'll briefly describe how this works. For comprehensive documentation and examples about the subquery feature in SAS Enterprise Guide 5.1 and later, read Michael Burke's excellent conference paper on Finding Your Inner Query with SAS Enterprise Guide.

The subquery feature relies on the ability to define query templates (also new in SAS Enterprise Guide 5.1), which are query definitions that you can reuse as building blocks for other query steps. (Query templates are useful on their own, as a way to store a query within your task menus so that you can use it across different projects.)

Using the subquery feature

First, use the Query Builder to define the "inner query". In my example, that query contains just one item, the calculated average of MSRP.

Instead of running the query, use the drop-down arrow on the Run button to select Create Template. The Create Template window allows you to assign a name to this definition, and it's automatically added to your Tasks->Task Templates menu. Note that you can create the query template without even running this part of the query. That can be a time saver if you're a Query Builder Ninja with high confidence in your result.

Next, design the outer query with a new Query Builder task. Select all of the fields that you want to include in the result. In the Filter tab, add a new filter for MSRP. Instead of comparing MSRP to a constant or other expression, you'll select from a list of available Subquery templates.

In your Filter tab, you'll see a partial view of the inner SELECT expression.

When you complete your selections and have the other query options just the way that you want them, click Run to execute the query.

If you examine the SAS program that the task creates, you'll see the inner SELECT clause:

PROC SQL;
   CREATE TABLE WORK.expensive AS 
   SELECT t1.Make, 
          t1.Model, 
          t1.MSRP, 
      FROM SASHELP.CARS t1
      WHERE t1.MSRP >= (
         SELECT /* AVG_of_MSRP */
           (AVG(t1.MSRP)) AS AVG_of_MSRP
             FROM SASHELP.CARS t1
      );
QUIT;

You can also use subqueries as a way to filter/compare a column to a list of values. In that case, your inner query simply needs to have a single-column list of results (instead of a single number, as in my example). Read Michael's subquery paper for all of the possibilities and limitations. Here's an executive summary from Michael's paper:

Here are the types of subqueries that you can create using query-based templates:

  • Subqueries that return a single value
  • Subqueries that return multiple values (multiple rows of a single field)
  • Subqueries that appear as part of a filter of the raw data (on the WHERE clause)
  • Subqueries that appear as part of a filter on the grouped data (on the HAVING clause)
  • Subqueries that appear as part of a recode condition as part of recoding a column in a computed column (on the SELECT clause)

Here are the types of subqueries that you cannot create using query-based templates:

  • Subqueries that form a derived table (subqueries that appear on the FROM clause)
  • Subqueries that refer to columns on the outer query (correlated subqueries)

More than one way to skin this data

There are many other ways to accomplish this simple example in SAS. For example, you may be able to use the HAVING clause feature in the Query Builder, or use SQL joins and special join conditions to filter a table based on another result set. But the ability to create a subquery has been long-requested from SAS Enterprise Guide users who understand databases. I hope you'll agree that it's an important addition to your query and reporting toolbox.

tags: SAS Enterprise Guide, sql, subquery
2月 092013
 

Computing_for_data_analysis_Coursera

I spent almost all of my blogging time last month to follow an online course in Coursera, Computing for Data Analysis (with R) by Dr. Roger  Peng of Johns Hopkins, Biostatistics Department. I already checked out bunch of Coursera courses just to take a look at what else MOOC look like, this R course was the first and only one I took seriously: I reviewed all slides (although didn’t watch the videos since I didn’t have enough time) and most important, finished all programming assignments(and got full score!).

Not a frequent R user, I used some R packages for learning purpose (mostly in data mining). This Computing for Data Analysis course focused mostly on data manipulation and that’s why I chose it to dig into R. R by design is not a tool for data management (where SAS excels), but it’s nice to have (no hurt since it’s free). Those days I put most of my efforts on R data structures(vector, matrix, array, list, data frame) and by grouping processing(*ply functions). Few resource I found extremely useful as a starter:

  • http://stackoverflow.com/questions/tagged/r  when I google a specific R programming technique, mostly I will reach to stackoverflow website, a new generation Q&A hub to replace user forum, mailing list.
  • An introduction to R by Longhow Lam, a free book. I like its Chapter 2, Data Objects, including R data types and data structures.
  • Data Manipulation with R by Phil Spector, one of the Use R! books. I used its Chapter 8, Data Aggregation on R by group processing.
  • RStudio, a must have R IDE, the best of best.
  • Package plyr, tools for splitting, applying and combining data. Actually, it is awful to use R to manipulate data from a SAS programmer point of view. Use plyr to alleviate the pain.
  • Package sqldf, like SAS Proc SQL against data frame. Reason, ibid. Use sqldf won’t enhance your R programming skill, but sometime when you feel down at R, you may want to launch piece of SQL scripts you really like.