1月 142019
 

When you overlay two series in PROC SGPLOT, you can either plot both series on the same axis or you can assign one series to the main axis (Y) and another to a secondary axis (Y2). If you use the Y and Y2 axes, they are scaled independently by default, which is usually what you want. However, if the measurements for the two series are linearly related to each other, then you might want to specify the tick values for the Y2 axis so that they align with the corresponding tick marks for the Y axis. This article shows how to align the Y and Y2 axes in PROC SGPLOT in SAS for two common situations.

Different scales for one set of measurements

The simplest situation is a single set of data that you want to display in two different units. For example, you might use one axis to display the data in imperial units (pounds, gallons, degrees Fahrenheit, etc.) and the other axis to display the data in metric units (kilograms, liters, degrees Celsius, etc.).

To plot the data, define one variable for each unit. For example, the Sashelp.Class data records the weight for 19 students in pounds. The following DATA view creates a new variable that records the same data in kilograms. The subsequent call to PROC SGPLOT plots the pounds on the Y axis (left axis) and the kilograms on the Y2 axis (right axis). However, as you will see, there is a problem with the default scaling of the two axes:

data PoundsKilos / view=PoundsKilos;
   set Sashelp.Class(rename=(Weight=Pounds));
   Kilograms = 0.453592 * Pounds;            /* convert pounds to kilos */
run;
 
title "Independent Axes";
title2 "Markers Do Not Align Correctly!";   /* the tick marks on each axis are independent */
proc sgplot data=PoundsKilos;
   scatter x=Height y=Pounds;
   scatter x=Height y=Kilograms / Y2Axis;
run;

The markers for the kilogram measurements should exactly overlap the markers for pounds, but they don't. The Y and Y2 axes are independently scaled because PROC SGPLOT does not know that pounds and kilograms are linearly related. The SGPLOT procedure displays each variable by using a range of round numbers (multiples of 10 or 20). The range for the Y2 axis is [20, 70] kilograms, which corresponds to a range of [44.1, 154.3] pounds. However, the range for the Y axis is approximately [50, 150] pounds. Because the axes display different ranges, the markers do not overlap.

To improve this graph, use the VALUES= and VALUESDISPLAY= options on the YAXIS statement (or Y2AXIS statement) to force the ticks marks on one axis to align with the corresponding tick marks on the other axis. In the following DATA step, I use the kilogram scale as the standard and compute the corresponding pounds.

data Ticks;
do Kilograms = 20 to 70 by 10;     /* for each Y2 tick */
   Pounds = Kilograms / 0.453592;  /* convert kilos to pounds */
   Approx = round(Pounds, 0.1);    /* use rounded values to display tick values */
   output;
end;
run;
proc print; run;

You can use the Pounds column in the table to set the VALUES= list on the YAXIS statement. You can use the Approx column to set the VALUESDISPLAY= list, as follows:

/* align tick marks on each axis */
title "Both Axes Use the Same Scale";
proc sgplot data=PoundsKilos noautolegend;
   scatter x=Height y=Pounds;
   /* Make sure the plots overlay exactly! Then you can set SIZE=0 */
   scatter x=Height y=Kilograms / markerattrs=(size=0) Y2Axis;
   yaxis grid values=(44.092 66.139 88.185 110.231 132.277 154.324)
       valuesdisplay=('44.1' '66.1' '88.2' '110.2' '132.3' '154.3');
run;

Success! The markers for the two variables align exactly. After verifying that they align, you can use the MARKERATTRS=(SIZE=0) option to suppress the display of one of the markers.

Notice that the Y axis (pounds) no longer displays "nice numbers" because I put the tick marks at the same vertical heights on both axes. A different way to solve the misalignment problem is to use the MIN=, MAX=, THRESHOLDMIN=, and THRESHOLDMAX= options on both axes. This will enable both axes to use "nice numbers" while still aligning the data. If you want to try this approach, here are the YAXIS and Y2AXIS statements:

   /* set the axes ranges to coresponding values */
   yaxis  grid thresholdmin=0 thresholdmax=0 min=44.1 max=154.3;
   y2axis grid thresholdmin=0 thresholdmax=0 min=20   max=70;

Different scales for different measurements

Another situation that requires two Y axes is the case of two series that use different units. For example, you might want to plot the revenue for a US company (in dollars) and the revenue for a Japanese company (in yen) for a certain time period. You can use the conversion rate between yen and dollars to align the values on the axes. Of course, the conversion from Japanese yen to the US dollars changes each day, but you can use an average conversion rate to set the correspondence between the axes.

This situation also occurs when two devices use different methods to measure the same quantity. The following example shows measurements for a patient who receives a certain treatment. The quantity of a substance in the patient's blood is measured at baseline and for every hour thereafter. The quantity is measured in two ways: by using a traditional blood test and by using a new noninvasive device that measures electrical impedance. The following statements define and plot the data. The two axes are scaled by using the default method:

data BloodTest1;
label t="Hours after Medication"  x="micrograms per deciliter"  y="kiloOhms";
input x y @@;
t = _N_ - 1;
datalines;
169.0 45.5 130.8 33.4 109.0 23.8 94.1 19.8 86.3 20.4 78.4 18.7 
 76.1 16.1  72.2 16.7  70.0 11.9 69.8 14.6 69.5 10.6 68.7 12.7 67.3 16.9 
;
 
title "Overlay Measurements for Two Medical Devices";
title2 "Default Scaling";
proc sgplot data=BloodTest1;
   series x=t y=x / markers legendlabel="Standard Lab Value";
   series x=t y=y / markers Y2Axis legendlabel="New Device";
   xaxis values=(0 to 12 by 2);
   yaxis grid label="micrograms per deciliter";
   y2axis grid label="kiloOhms";
run;

In this graph, the Y axes are scaled independently. However, the company that manufactures the device used Deming regression to establish that the measurements from the two devices are linearly related by the equation Y = –10.56415 + 0.354463*X, where X is the measurement from the blood test. You can use this linear equation to set the scales for the two axes.

The following DATA step uses the Deming regression estimates to convert the tick marks on the Y axis into values for the Y2 axis. (Click here for the PROC PRINT output.) The call to PROC SGPLOT creates a graph in which the Y2 axis is aligned with the Y axis according to the Deming regression estimates.

data Ticks;
do Y1 = 60 to 160 by 20;
   /* use Deming regression to find one set of ticks in terms of the other */
   Y2 =  -10.56415 + 0.354463 * Y1;  /* kiloOhms as a function of micrograms/dL */
   Approx = round(Y2, 0.1);
   output;
end;
run;
 
proc print; run;
 
title "Align Y Axes for Different Series";
title2 "Measurements are Linearly Related";
proc sgplot data=BloodTest1;
   series x=t y=x / markers legendlabel="Standard Lab Value";
   series x=t y=y / markers Y2Axis legendlabel="New Device";
   xaxis values=(0 to 12 by 2);
   yaxis grid label="micrograms per deciliter" offsetmax=0.1
      values=(60 to 160 by 20);
   /* the same offsets must be used in both YAXIS and Y2AXIS stmts */
   y2axis grid label="kiloOhms" offsetmax=0.1
      values=(10.7036 17.7929 24.8822 31.9714 39.0607 46.1499)
      valuesdisplay=('10.7' '17.8' '24.9' '32.0' '39.1' '46.1'); 
run;

In this new graph, the measurements are displayed on compatible scales and the reference lines connect round numbers on one axis to the corresponding values on the other axis.

The post How to align the Y and Y2 axes in PROC SGPLOT appeared first on The DO Loop.

1月 142019
 

New to SAS?  Here are tips from the translator of The Little SAS Book, Fifth Edition.

Hongqiu Gu, Ph.D. works at the China National Clinical Research Center for Neurological Diseases at the National Center for Healthcare Quality Management in Neurological Diseases at Beijing Tiantan Hospital, Capital Medical University.

He shared these important tips to learn SAS well:

1.  Read SAS Reference Books

I have not counted the number of SAS books I have read; I would estimate over 50 or 60.  The best books to give me a deep understanding of SAS are the SAS Reference Books, including SAS Language Reference Concepts, SAS Functions and CALL Routines Reference, SAS Macro Language Reference, and so on.  There are lots of excellent books published by SAS Press, and usually they are concise and suitable for quick learners.  However, when I realized that SAS could give me a powerful career advantage, I needed to learn SAS systematically and deeply.  I believe the SAS Reference Books are the most authoritative and comprehensive learning materials. Besides, all the updated SAS Reference Books are free to all readers.

2.  Use the SAS Help and Documentation frequently

No one can remember all the syntaxes or options in SAS.  However, don’t worry, SAS Help and Documentation is our best friend.  I use the SAS Help and Documentation quite often.  Even as an experienced SAS user, there are still many situations in which I need to ask for help from SAS Help and Documentation. Every time I use it, I learn something new.

3.  Solve SAS related questions in SAS communities

As the saying goes, practice makes perfect.  Answering SAS related questions is a good way to practice.  Questions can come from daily work, from friends around you, or from other SAS users on the web.  From 2013 to 2015, I spent a lot of time in the largest Chinese SAS online  community answering SAS related questions and I learned many practical skills in a short period.

4.  Make friends with skilled SAS programmers

Learning alone without interacting with others will lead to ignorance.  I have learned a lot from other experienced SAS users and SAS developers.  We share our ideas from time to time, and benefit a lot from the exchange.

 

 

1月 142019
 

Recently The Little SAS Book reached a major milestone.  For the first time ever, it was translated into another language.  The language in this case was Chinese, and the translator was Hongqiu Gu, Ph.D. from the China National Clinical Research Center for Neurological Diseases at the National Center for Healthcare Quality Management in Neurological Diseases at Beijing Tiantan Hospital, Capital Medical University.

To mark this achievement, I asked Hongqiu a few questions.

Susan:  First I want to say how honored I am that you translated our book.  It must have been a lot of work.  Receiving a copy of the translation was a highlight of the year for me.  How did you learn SAS?

Hongqiu:  How did I learn SAS?  That is a long story.  I had not heard of SAS before I took an undergraduate statistics course in 2005.  The first time I heard the name “SAS,” I mistook it for SARS (Severe Acute Respiratory Syndrome).  Although the pronunciations of these two words are entirely different for native English speakers, most Chinese people pronounced them as /sa:s/.  At that time, I was not trying to learn SAS well, and I simply wanted to pass the exam.  After the exam, all I had learned about SAS was entirely forgotten.  However, during the preparation of my master’s thesis, I had to do a lot of data cleaning and data analysis work with SAS, and I began to learn SAS enthusiastically.

Susan:  Why did you decide to translate The Little SAS Book?

Hongqiu:  Although I highly recommend the SAS Reference Books for learning SAS, most beginners need a concise SAS book to give them a quick overview of what SAS is and what SAS can do.  There is no doubt that The Little SAS Book is the best one as the first SAS book for SAS beginners.  However, it was not easy for a Chinese SAS beginner to get a hardcopy of The Little SAS Book because it was not available in the Chinese market and the price was too high if they shopped overseas.  Another barrier is the language.  Most beginners still want an elementary book in their mother language. Besides, lots of R books had been introduced and translated into Chinese.  Therefore, I believed there was an urgent need to translate this book into Chinese.  So I tried several times to contact SAS press to get permission to translate it into Chinese, but no reply.  Things changed when manager Frank Jiang from SAS China found me after my book, The Romance of SAS Programming, was published by Tsinghua University Press.

Susan:  How long did it take you to translate the book?

Hongqiu:  First, I must state that the Chinese version of The Little SAS Book is a collaborative work.  Manager Frank Jiang from SAS China together with managing editor Yang Liu from Tsinghua University Press did much early-stage work to start this project.  We began the translation in early April 2017 and finished the translation in July 2017.  After that, we took more than three months to complete the two rounds of cross-audit to make sure the translation was correct and typo errors were minimized.

Members of the translation team include Hongqiu Gu, Adrian Liu, Louanna Kong, Molly Li, Slash Xin, Nick Li, Zhixin Yang, Amy Qian, Wei Wang, and Ke Yang.

Members of the audit team include Silence Zeng, Mary Ma, Wei Wang, Jianping Xue, and Sikan Luan.

Susan:  What was the hardest part of translating it?

Hongqiu:  The book is written in plain English and easy to understand.  We did not find any particular part that hard to translate.

Susan:  Are there a lot of SAS users in China?

Hongqiu:  There are a lot of SAS users in China.  I’ve no idea what the exact number of SAS users in China is.  With the increasing need for SAS users in medicine, life science, finance and banking industries, SAS users will become more and more prevalent.

Susan:  Thank you for sharing your experiences.  Perhaps someday we can meet in person at SAS Global Forum.

1月 102019
 

Everyone’s excited about artificial intelligence. But most people, in most jobs, struggle to see the how AI can be used in the day-to-day work they do. This post, and others to come, are all about practical AI. We’ll dial the coolness factor down a notch, but we explore some real gains to be made with AI technology in solving business problems in different industries.

This post demonstrates a practical use of AI in banking. We’ll use machine learning, specifically neural networks, to enable on-demand portfolio valuation, stress testing, and risk metrics.

Background

I spend a lot of time talking with bankers about AI. It’s fun, but the conversation inevitably turns to concerns around leveraging AI models, which can have some transparency issues, in a highly-regulated and highly-scrutinized industry. It’s a valid concern. However, there are a lot of ways the technology can be used to help banks –even in regulated areas like risk –without disrupting production models and processes.

Banks often need to compute the value of their portfolios. This could be a trading portfolio or a loan portfolio. They compute the value of the portfolio based on the current market conditions, but also under stressed conditions or under a range of simulated market conditions. These valuations give an indication of the portfolio’s risk and can inform investment decisions. Bankers need to do these valuations quickly on-demand or in real-time so that they have this information at the time they need to make decisions.

However, this isn’t always a fast process. Banks have a lot of instruments (trades, loans) in their portfolios and the functions used to revalue the instruments under the various market conditions can be complex. To address this, many banks will approximate the true value with a simpler function that runs very quickly. This is often done with first- or second-order Taylor series approximation (also called quadratic approximation or delta-gamma approximation) or via interpolation in a matrix of pre-computed values. Approximation is a great idea, but first- and second-order approximations can be terrible substitutes of the true function, especially in stress conditions. Interpolation can suffer the same draw-back in stress.

An American put option is shown for simplicity. The put option value is non-linear with respect to the underlying asset price. Traditional approximation methods, including this common second-order approximation, can fail to fit well, particularly when we stress asset prices.

Improving approximation with machine learning

Machine learning is technology commonly used in AI. Machine learning is what enables computers to find relationships and patterns among data. Technically, traditional first- order and second-order approximation is a form of classical machine learning, such as linear regression. But in this post we’ll leverage more modern machine learning, like neural networks, to get a better fit with ease.

Neural networks can fit functions with remarkable accuracy. You can read about the universal approximation theorem for more about this. We won’t get into why this is true or how neural networks work, but the motivation for this exercise is to use this extra good-fitting neural network to improve our approximation.

Each instrument type in the portfolio will get its own neural network. For example, in a trading portfolio, our American options will have their own network and interest rate swaps, their own network.

The fitted neural networks have a small computational footprint so they’ll run very quickly, much faster than computing the true value of the instruments. Also, we should see accuracy comparable to having run the actual valuation methods.

The data, and lots of it

Neural networks require a lot of data to train the models well. The good thing is we have a lot of data in this case, and we can generate any data we need. We’ll train the network with values of the instruments for many different combinations of the market factors. For example, if we just look at the American put option, we’ll need values of that put option for various levels of moneyness, volatility, interest rate, and time to maturity.

Most banks already have their own pricing libraries to generate this data and they may already have much of it generated from risk simulations. If you don’t have a pricing library, you may work through this example using the Quantlib open source pricing library. That’s what I’ve done here.

Now, start small so you don’t waste time generating tons of data up front. Use relatively sparse data points on each of the market factors but be sure to cover the full range of values so that the model holds up under stress testing. If the model was only trained with interest rates of 3 -5 percent, it’s not going to do well if you stress interest rates to 10 percent. Value the instruments under each combination of values.

Here is my input table for an American put option. It’s about 800k rows. I’ve normalized my strike price, so I can use the same model on options of varying strike prices. I’ve added moneyness in addition to underlying.

This is the input table to the model. It contains the true option prices as well as the pricing inputs. I used around 800K observations to get coverage across a wide range of values for the various pricing inputs. I did this so that my model will hold up well to stress testing.

The model

I use SAS Visual Data Mining and Machine Learning to fit the neural network to my pricing data. I can use either the visual interface or a programmatic interface. I’ll use SAS Studio and its programmatic interface to fit the model. The pre-defined neural network task in SAS Studio is a great place to start.

Before running the model, I do standardize my inputs further. Neural networks do best if you’ve adjusted the inputs to a similar range. I enable hyper-parameter auto-tuning so that SAS will select the best model parameters for me. I ask SAS to output the SAS code to run the fitted model so that I can later test and use the model.

The SAS Studio Neural Network task provides a wizard to specify the data and model hyper parameters. The task wizard generates the SAS code on the right. I’ve allowed auto-tuning so that SAS will find the best model configuration for me.

I train the model. It only takes a few seconds. I try the model on some new test data and it looks really good. The picture below compares the neural network approximation with the true value.

The neural network (the solid red line) fits very well to the actual option prices (solid blue line). This holds up even when asset prices are far from their base values. The base value for the underlying asset price is 1.

If your model’s done well at this point, then you can stop. If it’s not doing well, you may need to try a deeper model, or different model, or add more data. SAS offers model interpretability tools like partial dependency to help you gauge how the model fits for different variables.

Deploying the model

If you like the way this model is approximating your trade or other financial instrument values, you can deploy the model so that it can be used to run on-demand stress tests or to speed up intra-day risk estimations. There are many ways to do this in SAS. The neural network can be published to run in SAS, in data-base, in Hadoop, or in-stream with a single click. I can also access my model via REST API, which gives me lots of deployment options. What I’ll do, though, is use these models in SAS High-Performance Risk (HPRisk) so that I can leverage the risk environment for stress testing and simulation and use its nice GUI.

HPRisk lets you specify any function, or method, to value an instrument. Given the mapping of the functions to the instruments, it coordinates a massively parallel run of the portfolio valuation for stress testing or simulation.

Remember the SAS file we generated when we trained the neural network. I can throw that code into HPRisk’s method and now HPRisk will run the neural network I just trained.

I can specify a scenario through the HPRisk UI and instantly get the results of my approximation.

Considerations

I introduced this as a practical example of AI, specifically machine learning in banking, so let’s make sure we keep it practical, by considering the following:
 
    • Only approximate instruments that need it. For example, if it's a European option, don’t approximate. The function to calculate its true price, the Black-Scholes equation, already runs really fast. The whole point is that you’re trying to speed up the estimation.
 
    • Keep in mind that this is still an approximation, so only use this when you’re willing to accept some inaccuracy.
 
    • In practice, you could be training hundreds of networks depending on the types of instruments you have. You’ll want to optimize the training time of the networks by training multiple networks at once. You can do this with SAS.
 
    • The good news is that if you train the networks on a wide range of data, you probably won’t have to retrain often. They should be pretty resilient. This is a nice perk of the neural networks over the second-order approximation whereby parameters need to be recomputed often.
 
    • I’ve chosen neural networks for this example but be open to other algorithms. Note that different instruments may benefit from different algorithms. Gradient boosting and others may offer simpler, more intuitive models, that get similar accuracy.

When it comes to AI in business, you’re most likely to succeed when you have a well-defined problem, like our stress testing that takes too long or isn’t accurate. You also need good data to work with. This example had both, which made it a good candidate for to demonstrate practical AI.

More resources

Interested in other machine learning algorithms or AI technologies in general? Here are a few resources to keep learning.

Article: A guide to machine learning algorithms and their applications
Blog post: Which machine learning algorithm should I use?
Video: Supervised vs. Unsupervised Learning
Article: Five AI technologies that you need to know

Practical AI in banking was published on SAS Users.

1月 102019
 

Does this situation sound familiar? You have a complex analysis that must be finished urgently. The data was delivered late and its quality and structure are far from the expected standard. The time pressure to present the results is huge, and your SAS program is not giving you the expected [...]

The post Favorite SAS Press books from a SAS Press author appeared first on SAS Learning Post.

1月 102019
 


My New Year's resolution: “Unclutter your life” and I hope this post will help you do the same.

Here I share with you a data preparation approach and SAS coding technique that will significantly simplify, unclutter and streamline your SAS programming life by using data templates.

Dictionary.com defines template as “anything that determines or serves as a pattern; a model.” However, I was flabbergasted when my “prior art research” for the topic of this blog post ended rather abruptly: “No results found for data template.”

What do you mean “no results?!” (Yes, sometimes I talk to the Internet. Do you?) We have templates for everything in the world: MS Word templates, C++ templates, Photoshop templates, website templates, holiday templates, we even have our own PROC TEMPLATE. But no templates for data?

At that point I paused, struggling to accept reality, but then felt compelled to come up with my own definition:

A data template is a well-defined data structure containing a data descriptor but no data.

Therefore, a SAS data template is a SAS dataset (data table) containing the descriptor portion with all necessary attributes defined (variable types, labels, lengths, formats, and informats) and empty (zero observations) data portion.

Less clutter = greater efficiency

When you construct SAS data tables using SAS code or data management tools such as using data design documentation as a feed to code-generating SAS program.

Unfortunately, despite all these benefits the data template concept is not explicitly and consistently employed and is noticeably absent from data development methodologies and practices.

Let’s try to change that!

How to create SAS data templates from scratch

It is very easy to create SAS data template. Here is an example:

 
libname PARMSDL 'c:\projects\datatemplates';
data PARMSDL.MYTEMPLATE;
   label
      newvar1 = 'Label for new variable 1'
      newvar2 = 'Label for new variable 2'
      /* ... */
      newvarN = 'Label for new variable N'
      ;
  length
      newvar1 newvar2 $40
      newvarN 8
      ;
   format newvarN mmddyy10.;
   informat newvarN date9.;
   stop;
run;

First, you need to assign a permanent library (e.g. PARMSDL) where you are going to store your SAS dataset template. I usually do not store data templates in the same library as data. Nor do I store it in the same directory/folder where you store your SAS code. Ordinarily, I store data templates in a so-called parameter data library (that is why I use PARMSDL as a libref), along with other data defining SAS code structure.

In the data step, the very first statement LABEL defines all variables’ labels as well as variable position determined by the order in which they are listed.

Statement LENGTH defines variables’ types (numeric or character) and their length in bytes. Here you may group variables of the same length to shorten your code or define them individually to be more explicit.
Statement FORMAT defines variables’ formats as needed. You don’t have to define formats for all the variables; define them only if necessary.

Statement INFORMAT (also optional) defines informats that come handy if you use this data template for creating SAS datasets by reading external raw files. With informats defined on the data template, you won’t have to specify informats in your INPUT statement while reading external file, as the informats will be inherently associated with the variable names. That is why SAS data sets have informat attribute for its variables in the first place (if you ever wondered why.)

Finally, don’t forget the STOP statement at the end of your data step, just before the RUN statement. Otherwise, instead of zero observations, you will end up with a data table that has a single observation with all missing variable values. Not what we want.

It is worth noting that obs=0 system option will not work instead of the STOP statement as it is applied only to the data being read, but we read no data here. For the same reason, (obs=0) data set option will not work either. Try it, and SAS log will dispel your doubts:

 
data PARMSDL.MYTEMPLATE (obs=0);
                        ---
                        70
WARNING 70-63: The option OBS is not valid in this context.  Option ignored.

How to create SAS data templates by inheritance

If you already have some data table with well-defined variable attributes, you may easily create a data template out of that data table by inheriting its descriptor portion:

 
data PARMSDL.MYTEMPLATE;
   set SASDL.MYDATA (obs=0);
run;

Option (obs=0) does work here as it is applied to the dataset being read, and therefore STOP statement is not necessary.

You can also combine inheritance with defining new variables, as in the following example:

 
data MYTEMPLATE;
   set SASDL.MYDATA (obs=0); *<-- inherited template;
   * variables definition: ;
   label
      newvar1 = &#039;Label for new variable 1&#039;
      newvarN = &#039;Label for new variable N&#039;
      oldvar =  &#039;New Label for OLD variable’ 
      ;
   length
      newvar1 $40
      newvarN 8
      oldvar  $100 /* careful here, see notes below */
      ;
   format newvarN mmddyy10.;
   informat newvarN date9.;
run;

A word of warning

Be careful when your new variable definition type and length contradicts inherited definition.
You can overwrite/re-define inherited variable attributes such as labels, formats and informats with no problem, but you cannot overwrite type and in some cases length. If you do need to have a different variable type for a specific variable name on your data template, you should first drop that variable on the SET statement and then re-define it in the data step.

With the length attribute the picture is a bit different. If you try defining a different length for some variable, SAS will produce the following WARNING in the LOG:

WARNING: Length of character variable  has already been set.
Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

You can either use the advice of the WARNING statement and place the LENGTH statement as the very first statement or at least before the SET statement. In this case, you will find that you can increase the length without a problem, but if you try to reduce the length relative to the one on the parent dataset SAS will produce the following WARNING in the LOG:

WARNING: Multiple lengths were specified for the variable  by input data set(s). This can cause truncation of data.

In this case, a cleaner way will also be to drop that variable on the SET statement and redefine it with the LENGTH statement in the data step.

Keep in mind that when you drop these variables from the parent data set, besides losing their type and length attributes, you will obviously lose the rest of the attributes too. Therefore, you will need to re-define all the attributes (type, length, label, format, and informat) for the variables you drop. At least, this technique will allow you to selectively inherit some variables from a parent data set and explicitly define others.

How to use SAS data templates

One way to apply your data template to a newly created dataset is to: 1) Copy your data template in that new dataset; 2) Append your data table to that new data set. Here is an example:

 
/* copying data template into dataset */
data SASDL.MYNEWDATA
   set PARMSDL.MYTEMPLATE;
run;
 
/* append data to your dataset with descriptor */
proc append base=SASDL.MYNEWDATA data=WORK.MYDATA;
run;

Your variable types and lengths should be the same on the BASE= and DATA= tables; labels, formats and informats will be carried over from the BASE= dataset/template.

It is simple, but could be simplified even more to reduce your code to just a single data step:

 data SASDL.MYNEWDATA;
   if 0 then set PARMSDL.MYTEMPLATE;
   set WORK.MYDATA;
   /* other statements */
run;

Even though set PARMSDL.MYTEMPLATE; statement has never executed because of the explicitly FALSE condition (0 means FASLE) in the IF statement, the resulting dataset SASDL.MYDATA gets all its variable attributes carried over from the PARMSDL.MYTEMPLATE data template during data step compilation.

This same coding technique can be used to implicitly apply data variable attributes from a well-defined data set by inheritance even though that data set is not technically a data template (has more than 0 observations.) Run the following code to make sure MYDATA table has all the variables and attributes of the SASHELP.CARS data table while data values come from the ABC data set:

 
data ABC;
  make='Toyota';
run;
 
data MYDATA;
   if 0 then set SASHELP.CARS;
   set ABC;
run;

Perhaps the benefits of SAS data templates are best demonstrated when you read external data into SAS data table. Here is an example for you to run (of course, in real life MYTEMPLATE should be a permanent data set and instead of datalines it should be an external file):

 
data MYTEMPLATE;
   label
      fdate = 'Flight Date'
      count = 'Flight Count'
      fdesc = 'Flight Description'
      reven = 'Revenue, $';
   length fdate count reven 8 fdesc $22;
   format fdate date9. count comma12. reven dollar12.2;
   informat fdate mmddyy10. count comma8. fdesc $22. reven comma10.;
   stop;
run;
 
data FLIGHTS;
   if 0 then set MYTEMPLATE;
   input fdate count fdesc & reven;
   datalines;
12/05/2018 500   Flight from DCA to BOS  120,034
10/01/2018 1,200 Flight from BOS to DCA  90,534
09/15/2018 2,234 Flight from DCA to MCO  1,350
;

Here is how the output data set looks:

Notice how simple the last data step is. No labels, no lengths, no formats, no informats – no clutter. Yet, the raw data is read in nicely, with proper informats applied, and the resulting data set has all the proper labels and variable formatting. And when you repeat this process for another sample of similar data you can still use the same data template, and your read-in data step stays the same – simple and concise.

Your thoughts

Do you find SAS data templates useful? Do you use them in any shape or form in your SAS data development projects? Please share your thoughts.

Simplify data preparation using SAS data templates was published on SAS Users.

1月 092019
 

Numbers don't lie, but sometimes they don't reveal the full story. Last week I wrote about the most popular articles from The DO Loop in 2018. The popular articles are inevitably about elementary topics in SAS programming or statistics because those topics have broad appeal. However, I also write about advanced topics, which are less popular but fill an important niche in the SAS community. Not everyone needs to know how to fit a Pareto distribution in SAS or how to compute distance-based measures of correlation in SAS. Nevertheless, these topics are interesting to think about.

I believe that learning should not stop when we leave school. If you, too, are a lifelong learner, the following topics deserve a second look. I've included articles from four different categories.

Data Visualization

  • Fringe plot: When fitting a logistic model, you can plot the predicted probabilities versus a continuous covariate or versus the empirical probability. You can use a fringe plot to overlay the data on the plot of predicted probabilities. The SAS developer of PROC LOGISTIC liked this article a lot, so look for fringe plots in a future release of SAS/STAT software!
  • Order variables in a correlation matrix or scatter plot matrix: When displaying a graph that shows many variables (such as a scatter plot matrix), you can make the graph more understandable by ordering the variables so that similar variables are adjacent to each other. The article uses single-link clustering to order the variables, as suggested by Hurley (2004).
  • A stacked band plot, created in SAS by using PROC SGPLOT
  • Stacked band plot: You can use PROC SGPLOT to automatically create a stacked bar plot. However, when the bars represent an ordered categorical variable (such as months or years), you might want to create a stacked band plot instead. This article shows how to create a stacked band plot in SAS.

Statistics and Data Analysis

Random numbers and resampling methods

Process flow diagram shows how to resample data to create a bootstrap distribution.

Optimization

These articles are technical but provide tips and techniques that you might find useful. Choose a few topics that are unfamiliar and teach yourself something new in this New Year!

Do you have a favorite article from 2018 that I did not include on the list? Share it in a comment!

The post 10 posts from 2018 that deserve a second look appeared first on The DO Loop.

1月 082019
 

I love my job, but I am not a morning person so I need a bit of inspiration to get out of bed. I’ve been a marketer for more than 15 years, and that inspiration has never been easier to find than since I joined SAS as an industry marketer [...]

Health care convergence gets me up in the morning. Yes, really. was published on SAS Voices by Cameron McLauchlin