proc sql

3月 282017
 

Did you know that PROC SQL captures the record count for a result set in a special automatic macro variable? When you create a subset of data to include in a report, it's a nice touch to add a record count and other summaries as an eye-catcher to the report title. I often see the following pattern in SAS programs, which adds an extra step to get a record count:

proc sql noprint;
 
 create table result 
  as select * from sashelp.cars
  where origin='Asia';
 
 /* count the records in the result */
 select count(model) into :resultcount
  from result;
quit;
 
title "Summary of Cars from Asia: &resultcount. records";
proc means data=result;
run;

This creates a report with an informative title like this:

Here's the tip. Instead of including a SELECT INTO step that's going to make another pass through the data, you can rely on the &SQLOBS automatic macro variable. This variable holds the record "result set" count from the most recent SELECT clause.

proc sql noprint;
 create table result 
  as select * from sashelp.cars
  where origin='Asia';
 
 %let resultcount=&sqlobs;
quit;
 
title "Summary of Cars from Asia: &resultcount. records";
proc means data=result;
run;

Because SAS replaces the value with each subsequent SELECT clause, it's important to assign it to another macro variable immediately if you intend to use it later. Here's the result:

Not only is this more efficient, but SAS automatically trims the whitespace from the SQLOBS variable so that it looks better in a TITLE statement. If you're using SELECT INTO to populate macro variables for other reasons, you can use the TRIMMED keyword to achieve the same effect.

proc sql noprint;
 
 create table result 
  as select * from sashelp.cars
  where origin='Asia';
 
 %let resultcount=&sqlobs;
 
 select avg(mpg_highway) into: AvgMpg TRIMMED
  from result;
 
quit;
 
title "Summary of Cars from Asia: &resultcount. records, &AvgMpg. MPG Average";
proc means data=result;
run;

See also

The post How many records are in that PROC SQL result? appeared first on The SAS Dummy.

5月 192016
 

Yesterday a frustrated SAS user complained on Twitter. He's working with a database that stores an ID field as a big long number (perhaps using the database BIGINT type), and SAS can't display a number greater than 15 digits. Well, it's actually 16 digits, depending on the value:

%put Biggest Exact Int = %sysfunc(constant(EXACTINT,8));
>> Biggest Exact Int = 9007199254740992

It's a controversial design decision to use an integer to represent an ID value in a database. You might save a few bytes of storage, but it limits your ability to write programs (not just SAS programs) that have to store and manipulate that value. And if you don't need to do math operations with the ID, your data consumers would rather see a nice character value there.

Fortunately, when working with databases, you can tell SAS to read numeric values as character values into your SAS data sets. In addition to solving the precision problem I've just described, this can also help when you need to join database fields with other source systems that store their key fields differently. It's usually much easier to convert the field "on the way in" rather than try to mangle it after you've already read in the records. Use the DBSASTYPE= data set option to tell SAS how to read database fields. Here's a sample SAS program that shows how I access a table using ODBC, one step without and one step with the DBSASTYPE= option.

libname wpblogs odbc datasrc="wpblogs";
 
options obs=10;
data users_IDint (keep=ID display_name);
  set wpblogs.wp_users;
run;
 
data users_IDchar (keep=ID display_name);
  set wpblogs.wp_users 
    (dbsastype=(ID='char(20)'));
run;

Here are the resulting tables; you can see the simple difference. One has ID as a number, and one has it as a character. Magic!

dbsastype_out
The DBSASTYPE= option is supported for virtually all SAS/ACCESS database engines, including the ubiquitous SAS/ACCESS to ODBC.

Oh, and you might be wondering how things turned out for our frustrated user on Twitter. Our SAS Cares social media team heard his plea and responded -- as they always do. And our user not only found the information useful, he took it a step further by replying back with an additional syntax tip.

tags: DBSASTYPE option, PROC SQL, sas/access

The post Tell SAS to read a database field as CHAR instead of INT or BIGINT appeared first on The SAS Dummy.

5月 182016
 

As a parent of children who love books, I can tell you that there is something humorous about taking a first name, adding a "Mc" and then a rhyming surname to make up a brand new character name. My daughters always loved to read the adventures of Harry Mclary from Donaldson's Dairy, and we loved to read it aloud to them. It was just fun.

The Boaty McBoatface phenomenon has taken this to the next level by adding "face" as a suffix, which often has a funny punctuating effect ("silly face", "Chu chi face", "doody face," etc. Hilarious!).

I thought that I was done writing blogs about Boaty McBoatface, but I've been hearing from so many people about this topic that I need at least this one more to finish it off.

Name our ship: final results

Spoiler: NERC is not going to christen the new vessel "Boaty McBoatface." Instead the name comes from the 4th-highest vote-getter, "David Attenbourough." The famous explorer earned over 11,000 votes, or 2.78% of all votes cast. However, as a crowd-pleasing nod to the plebians, NERC will name one of the ship's remotely operated submarines "Boaty McBoatface." Hooray! I grabbed the final voting results from the NameOurShip website and re-ran my analysis. Here's the final top 10 standings.

boatyfinal

Many imitators, but original stays on top

The original entry of "Boaty McBoatface" inspired many copycats who submitted names with a similar formula. None of them seemed to have the wide appeal of Boaty, probably because they weren't first and original, but here they are with their vote counts.

boatycopy
I found these in the data with a simple SQL LIKE operator, finding those names that had the pattern "-y Mcy".

proc sql;
   create table work.TheMCs as 
   select t1.title, 
            (sum(t1.likes)) format=comma20. as totalVotes
      from work.votes t1
      where t1.title like '%y Mc%'
      group by t1.title
      order by totalVotes desc;
quit;

Boaty Mac: start of a popular movement

rockymc
Silly names are not limited to research vessels. The world has embraced the Boaty McBoatface pattern with much enthusiasm. A colleague sent me news about Parsey McParseface, an open-source project from Google. Grumpy McNoisybutt was proposed as a name for a rattlesnake. Even my own daughter has created Rocky McRockface, a major character in her rock cycle project.

I won't say that this is my final Boaty post. Who knows? In a couple of years I might be reporting on "Boaty McBoatface"-inspired baby names. I'm confident that at least one poor child will bear the name; that's the sort of world we live in. Fortunately, children usually find a way to have revenge on their parents (which is why I have nothing but praise for Rocky McRockface).

tags: Boaty McBoatface, PROC SQL

The post Copy McCopyface and the new naming revolution appeared first on The SAS Dummy.

5月 272015
 

“Phew! That tip alone was a life saver,” said a student in one of my SAS SQL classes. “Before, I would have to read about ten Google search results before I could find that content of the sort you shared in class.” That student was referring to the tip I […]

The post Life saver tip for comparing PROC SQL join with SAS data step merge appeared first on The SAS Training Post.

5月 112015
 
In this post, I show a trick to do moving average calculation (can be extended to other operations requiring windowing functions) that is super fast.

Often, SAS analysts need to conduct moving average calculation and there are several options by the order of preference:

1. PROC EXPAND
2. DATA STEP
3. PROC SQL

But many sites may not licensed SAS/ETS to use PROC EXPAND and doing moving average in DATA STEP requires some coding and is error prone. PROC SQL is a natural choice for junior programmers and in many business cases the only solution, but SAS's PROC SQL lacks windowing functions that are available in many DBs to facilitate moving average calculation. One technique people usually use is CROSS JOIN, which is very expensive and not a viable solution for even medium sized data set. In this post, I show a trick to do moving average calculation (can be extended to other operations requiring windowing functions) that is super fast.

Consider the simplest moving average calculation where the trailing K observations are included in the calculation, namely MA(K), here we set K=5. We first generate a 20 obs sample data, where variable ID is to be used for windowing and the variable X is to be used in MA calculation, and then we apply the standard CROSS JOIN to first examine the resulting data, Non-Grouped, just to understand how to leverage the data structure.

%let nobs=20;
%let ndiff=-5;
data list;
do id=1 to &nobs;
x=id*2;
output;
end;
run;

options notes;
options fullstimer;
proc sql;
create table ma as
select a.id as aid, b.id as bid, a.id-b.id as diff, a.x as ax, b.x as bx
from list as a, list as b
where a.id>=b.id and (a.id-b.id)<= abs(&ndiff)-1
having aid-bid>=(&ndiff+1)
order by aid, bid
;
quit;



From the resulting data set, it is hard to find a clue, now let's sort by "bid" column in this data set:

From this sorted data, it is clear that we actually don't have to CROSS JOIN the whole original data set, but instead, we can generate an "operation" data set that contains the difference value, and let the original data set CROSS JOIN with this much smaller "operation" data set, and all the data we need to use for MA calculation will be there. Now let's do it: CROSS JOIN original data with "operation" data, sort by (a.id+ops), which is actually "bid' in sorted data set;


%let ndiff=5;
data operation;
do ops = 0 to &ndiff by 1;
weight=1;
output;
end;
run;
proc sql;
create table ma2 as
select a.id as aid, b.ops, a.id+b.ops as id2, a.x*b.weight as ax
from list as a, operation as b
order by id2, aid
;
quit;
Note that in above code, it is necessary to have a.x multiply by b.weight so that the data can be inter-leaved, otherwise the same X value from original table will be output and MA calculation will be failed. The explicit weight variable actually adds in more flexibility to the whole MA calculation. While setting it to be 1 for all obs result in a simple MA calculation, assign different weights will help to resolve more complex MA computing, such as giving further observations less weight for a decayed MA. If different K parameter in MA(K) calculations are required, only the operation data set need to be updated which is trivial job. Now the actual code template for MA(K) calculation will be:

%let ndiff=5;
data operation;
do ops = 0 to &ndiff by 1;
weight=1;
output;
end;
run;
proc sql noprint;
select max(id) into :maxid
from list;
quit;
proc sql;
create table ma2 as
select a.id+b.ops as id2, avg(a.x*b.weight) as MA
from list as a, operation as b
group by id2
having id2>=&ndiff
and id2<=&maxid
order by id2
;
quit;
With this new method, it is interesting to compare it to the expensive self CROSS JOIN as well as to PROC EXPAND. On my workstation (Intel i5 3.8Ghz, 32GB memory, 1TB 72K HDD), self CROSS JOIN is prohibitively long in running time (if data is large) while the new method uses only 2X as much time as PROC EXPAND, both time consumptions are trivial comparing to self CROSS JOIN. Time consumption shown below is in "second".




Below is the code readers can run and compare yourselves.


%macro test(nobs, ndiff);
options nonotes;
data list;
do id=1 to &nobs;
x=id*2;
output;
end;
run;



%let t0 = %sysfunc(time());
options fullstimer;
proc sql;
create table ma as
select a.id, avg(b.x) as ma
from list as a, list as b
where a.id>=b.id and (a.id-b.id)<= &ndiff-1
group by a.id
having id>=abs(&ndiff)
;
quit;
%let t1 = %sysfunc(time());


proc expand data=list out=ma2 method=none;
convert x=ma / transformout=(movave 5);
run;


%let t2 = %sysfunc(time());

%let ndiff=5;
data operation;
do ops = 0 to &ndiff-1 by 1;
weight=1;
output;
end;
run;
proc sql noprint;
select max(id) into :maxid
from list;
quit;
proc sql;
create table ma3 as
select a.id+b.ops as id2, avg(a.x*b.weight) as ma
from list as a, operation as b
group by id2
having id2>=abs(&ndiff)
;
quit;

%let t3 = %sysfunc(time());

%let d1 = %sysfunc(round( %sysevalf(&t1 - &t0), 0.001));
%let d2 = %sysfunc(round( %sysevalf(&t2 - &t1), 0.001));
%let d3 = %sysfunc(round( %sysevalf(&t3 - &t2), 0.001));
%put SelfXJoin : &d1, EXPAND: &d2, Operation: &d3;
options notes;
%mend;



dm log 'clear';
%test(5000, -40);

%test(10000, -40);

%test(15000, -40);

%test(20000, -40);

%test(25000, -40);

%test(30000, -40);


 Posted by at 9:09 上午
1月 262015
 

splitdataBack in the day when the prison system forced inmates to perform "hard labor", folks would say (of someone in prison): "He's busy making little ones out of big ones." This evokes the cliché image of inmates who are chained together, forced to swing a chisel to break large rocks into smaller rocks. (Yes, it seems like a pointless chore. Here's a Johnny Cash/Tony Orlando collaboration that sets it to music.)

SAS programmers are often asked to break large data sets into smaller ones. Conventional wisdom says that this is also a pointless chore, since you can usually achieve what you want (that is, process a certain subset of data) by applying a WHERE= option or FIRSTOBS=/OBS= combination. Splitting a data set creates more files, which occupy more disk space and forces more I/O operations. I/O and disk access is often the most expensive part of your SAS processing, performance-wise.

But if the boss asks for broken-up data sets, you might as well spend the least possible effort on the task. Let's suppose that you need to break up a single data set into many based on the value of one of the data columns. For example, if you need to break SASHELP.CARS into different tables based on the value of Origin, the SAS program would look like:

DATA out_Asia;
 set sashelp.cars(where=(origin='Asia'));
run;
DATA out_Europe;
 set sashelp.cars(where=(origin='Europe'));
run;
DATA out_USA;
 set sashelp.cars(where=(origin='USA'));
run;

I'm going to admit right now that this isn't the most efficient or elegant method, but it's something that most beginning SAS programmers could easily come up with.

Writing the above program is easy, especially since there are only 3 different values for Origin and I've memorized their values. But if there are more discrete values for the "split-by" column, the task could involve much more typing and has a high possibility for error. This is when I usually use PROC SQL to generate the code for me.

If you've read my article about implementing BY processing for an entire SAS program, you know that you can use PROC SQL and SELECT INTO to place data values from a data set into a macro variable. For example, consider this simple program:

proc sql;
 select distinct ORIGIN into :valList separated by ',' from SASHELP.CARS;
quit;

It creates a macro variable VALLIST that contains the comma-separated list: "Asia,Europe,USA".

But we can use SAS functions to embellish that output, and create additional code statements that weave the data values into SAS program logic. For example, we can use the CAT function to combine the values that we query from the data set with SAS keywords. The results are complete program statements, which can then be referenced/executed in a SAS macro program. I'll share my final program, and then I'll break it down a little bit for you. Here it is:

/* define which libname.member table, and by which column */
%let TABLE=sashelp.cars;
%let COLUMN=origin;
 
proc sql noprint;
/* build a mini program for each value */
/* create a table with valid chars from data value */
select distinct 
   cat("DATA out_",compress(&COLUMN.,,'kad'),
   "; set &TABLE.(where=(&COLUMN.='", &COLUMN.,
   "')); run;") into :allsteps separated by ';' 
  from &TABLE.;
quit;
 
/* macro that includes the program we just generated */
%macro runSteps;
 &allsteps.;
%mend;
 
/* and...run the macro when ready */
%runSteps;

Here are the highlights from the PROC SQL portion of the program:

  • SELECT DISTINCT ensures that the results include just one record for each unique value of the variable.
  • The CAT function concatenates a set of string values together. Note that CATX and CATS and CATT -- other variations of this function -- will trim out white space from the various string elements. In this case I want to keep any blank characters that occur in the data values because we're using those values in an equality check.
  • The program calculates a name for each output data set by using each data value as a suffix ("OUT_dataValue"). SAS data set names can contain only numbers and letters, so I use the COMPRESS function to purge any invalid characters from the data set name. The 'kad' options on COMPRESS tell it to keep only alpha and digit characters.
  • The resulting program statements all end up in the &ALLSTEPS macro variable. I could just reference the &ALLSTEPS variable in the body of the SAS program, and SAS would run it as-is. Instead I chose to wrap it in the macro %runSteps. This makes it a little bit easier to control the scope and placement of the executable SAS program statements.

"By each value of a variable" is just one criterion that you might use for splitting a data set. I've seen cases where people want to split the data based on other rules, such as:

  • Quantity of observations (split a 3-million-record table into 3 1-million-record tables)
  • Rank or percentiles (based on some measure, put the top 20% in its own data set)
  • Time span (break up a data set by year or month, assuming the data records contain a date or datetime variable)

With a small modification, my example program can be adapted to serve any of these purposes. What about you? Are you ever asked to split up SAS data sets, and if so, based on what criteria? Leave a comment and tell us about it.

tags: data management, macro programming, PROC SQL, SAS programming
6月 122013
 
You might have seen my previous blog, where I plotted some interesting tourist attractions on a geographical map of the British Isles (which may be of interest to Analytics 2013 Conference delegates)... Well this blog uses even more powerful analytics, and shows how SAS can calculate the "optimal tour" to visit all [...]
1月 262013
 
This week's SAS tip is from Kirk Lafler and his book PROC SQL: Beyond the Basics Using SAS. A SAS user since 1979, Kirk often takes his expertise on the road and is a frequent speaker at SAS conferences. He's also currently working on a second edition of the book. If [...]
8月 182012
 
This week's SAS tip is from master user Phil Holland and his popular book Saving Time and Money Using SAS. If you're not yet familiar with Phil and his extensive work in the user community, start out by visiting his author page. If  you're on LinkedIn, look for him on SAS Professional Forum [...]