sas programming

4月 202022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. This post builds upon CAS-Action! Create Columns in CAS Tables - Part 1 by showing how to add formats and modify the length of computed columns.

I'll start by building off the following code where I create three new columns in the PRODUCTS table:

proc cas;
    source createColumns;
        Total_Price = Price * Quantity;
        Product_fix = upcase(Product);
        if Return = "" then Return_fix = "No"; 
           else Return_fix = "Yes";
    endsource;
 
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVarsProgram = createColumns
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the results:

Notice in the results the Total_Price column does not have a format, and the Return_fix column truncates the value Yes. Let's fix these issues.

Add Formats to a Computed Column

First, I'll add a format to the Total_Price computed column. To add formats to a computed column use the computedVars sub-parameter. ComputedVars enables you to specify column metadata (labels and formats) for computed columns. The parameter expects a list of dictionaries, even if one column is specified.

For example, in the computedVars sub-parameter I'll add list with a single dictionary that specifies the name of a column Total_Price and apply the SAS dollar format. Next, I'll execute the fetch action to preview the table.

proc cas;
    source createColumns;
        Total_Price = Price * Quantity;
        Product_fix = upcase(Product);
        if Return = "" then Return_fix = "No"; 
           else Return_fix = "Yes";
    endsource;
 
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVars = {
                            {name = 'Total_Price', format = 'dollar16.2'}
                   },
                   computedVarsProgram = createColumns
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the results:

Notice in the results above the format was applied to the Total_Price column. However, now the additional computed columns are not displayed in the output. To fix this issue, add all of the computed columns to the computedVars sub-parameter as a dictionary with the name key, even if you do not modify the column metadata.

In the code below, the Product_fix and Return_fix columns were added to the list in the computedVars sub-parameter.

proc cas;
    source createColumns;
        Total_Price = Price * Quantity;
        Product_fix = upcase(Product);
        if Return = "" then Return_fix = "No"; 
           else Return_fix = "Yes";
    endsource;
 
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVars = {
                            {name = 'Total_Price', format = 'dollar16.2'},
                            {name = 'Product_fix'},
                            {name = 'Return_fix'}
                   },
                   computedVarsProgram = createColumns
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the results:

Notice in the results that adding each computed column in the computedVars sub-parameter fixed the issue and the computed columns are shown.

However, we still have one more issue. Look at the truncated values in the Return_fix column. How can we fix that?

Modify Computed Column Lengths

When using an IF/THEN/ELSE statement to create a new column in SAS, it sets the column width by using the length of the first assignment statement. In this example, the first assignment statement sets the value of the Return_fix column to No, so the column length is set to 2.

To modify the length of the computed column add a SAS LENGTH statement in the computedVarsProgram parameter prior to the first assignment statement. Here I'll set the Return_fix column to length of 3.

proc cas;
    source createColumns;
        Total_Price = Price * Quantity;
        Product_fix = upcase(Product);
 
        length Return_fix varchar(3);
        if Return = "" then Return_fix = "No"; 
           else Return_fix = "Yes";
    endsource;
 
    productsTbl = {name = 'products', 
                   caslib = 'casuser',
                   computedVars = {
                            {name = 'Total_Price', format = 'dollar16.2'},
                            {name = 'Product_fix'},
                            {name = 'Return_fix'}
                   },
                   computedVarsProgram = createColumns
    }; 
 
    table.fetch / table = productsTbl;
quit;

And the results:

In the results the truncation issues in the Return_fix column are corrected.

Summary

In summary, using the computedVars sub parameter allows you to add formats to the computed columns, and the computedVarsProgram parameter enables you to set the column lengths. A few key points:

  • When using the computedVars sub-parameter, all computed columns from the computedVarsProgram sub-parameter must be added. Otherwise, the computed columns are not returned.
  • The computedVars sub-parameter takes a list of dictionaries as it's value.

Additional resources

fetch Action
SAS® Cloud Analytic Services: Fundamentals
Code
Create Data Code

CAS-Action! Create Columns in CAS Tables - Part 2 was published on SAS Users.

3月 302022
 

In an article about how to visualize missing data in a heat map, I noted that the SAS SG procedures (such as PROC SGPLOT) use the GraphMissing style element to color a bar or tile that represents a missing value. In the HTMLBlue ODS style, the color for missing values is gray. This article shows how to override the GraphMissing color by using a range attribute map in SAS. The appendix of this article includes links to articles that discuss range attribute maps in more detail.

A range attribute map is usually used to define a color model (also called a color ramp) and to associate each color with a value for a variable. However, a range attribute map also supports assigning the color of the missing category, as shown in this article.

Creating a range attribute map enables you to specify the missing-value color for any graph that uses the map. A future article shows how to override the GraphMissing color by modifying an ODS style. Modifying an ODS style enables you to change the missing-value color for all graphs.

Example Data

The following data and heat map are from a previous article. The data are for five patients in a clinical study. After the initial baseline measurement (Week=0), the patients were supposed to be measured weekly for 10 weeks. Only one patient kept all 10 appointments. The remaining patients missed at least two appointments. The following heat map (sometimes called a lasagna plot) shows the clinical measurement for each patient and for each week of the study.

data Clinical;
input patientID @;
do Week = 0 to 10;
   input Value @;
   output;
end;
/* ID Wk1  Wk2  Wk3  Wk4 ... Wk10*/
datalines;
1001  12.0 13.0 13.0   .   .   .  13.0 14.0 14.5 15.0 13.5 
1002  11.5 12.5   .  11.0  .   .    .    .    .   9.5  8.0 
1003  12.0   .    .  11.0  . 10.5 11.0   .    .  10.5  9.0 
1004  11.0 11.0 11.0   .  7.5 6.5   .   7.0  7.5  5.5  4.0 
1005  10.0 10.5 11.0  9.0 7.0 7.5  7.0  7.5  4.0  6.5  5.5 
;
 
%let WhiteYeOrRed = (CXFFFFFF CXFFFFB2 CXFECC5C CXFD8D3C CXE31A1C);
 
title "Missing Values Displayed in GraphMissing Color";
proc sgplot data=Clinical;
   heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
        colormodel=&WhiteYeOrRed; 
   gradlegend;
   refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(0 to 10) valueshint;
   legenditem type=fill name='missItem' / fillattrs=GraphMissing label="Missing Data";
   keylegend 'missItem';
run;

The value of the clinical measurement is indicated by using a white-yellow-orange-red color model. Missed appointments are displayed in gray, which is the color of the GraphMissing style element in the ODS style that I am using. Suppose you want to use a color other than gray. You can override the color used in the ODS style, which will affect all graphs, or you can create a range attribute map and use it for only this one graph. The next section shows how to define a range attribute map.

Define a range attribute map

The references in the appendix provide details, but the primary purpose of a range attribute map is to map a set of continuous values onto a spectrum of colors. In short, a range attribute map is a special SAS data set that enables you to define the colors in a custom color ramp and the values that the ramp represents.

The data set must contain variables named MIN and MAX, which you use to associate a range of values to colors. But there are special values that you can use in the MIN or MAX columns:

  1. MIN = _MIN_ specifies the smallest data value in a variable.
  2. MAX = _MAX_ specifies the largest data value in a variable.
  3. MIN = _MISSING_ specifies how to assign attributes to missing values for the variable.

To make sure we can clearly see the missing values, let's choose a bright and obnoxious color, such as cyan (bright blue). I wouldn't choose this color for serious work, but it will make it easy to see the missing values in this example.

/* create a range attribute data set */
data MyRangeAttrs;
retain ID "MapMissing";
length min $10 max $10 
       color altcolor colormodel1 colormodel2 colormodel3 colormodel4 colormodel5 $15;
input min max color altcolor colormodel1 colormodel2 colormodel3 colormodel4 colormodel5;
datalines;
_MISSING_  .     CYAN CYAN .        .        .        .        .
_MIN_      _MAX_ .    .   CXFFFFFF CXFFFFB2 CXFECC5C CXFD8D3C CXE31A1C
;

The variables in the data set must have certain names, as specified in the the documentation. The first observation specifies the colors for the missing values (MIN=_MISSING_). The COLOR variable specifies the color for areas and bars. The ALTCOLOR variable specifies the color for markers and lines. The second observation specifies a color model to use for nonmissing observations. For this example, I've used the same white-yellow-orange-red color model.

To use the range attribute map, specify the name of the data set by using the RATTRMAP= option on the PROC SGPLOT statement. A data set can include many different mappings, each defined by a unique ID. In this case, the data set contains only one mapping, and the ID value is 'MapMissing.' Use the RATTRID=MapMissing option to specify the ID value for the map. The following statements use the range attribute data set to assign colors to the heat map tiles. I also had to modify the LEGENDITEM statement so that a cyan-colored swatch appears in the legend.

title "Missing Values Displayed in Custom Color";
proc sgplot data=Clinical RATTRMAP=MyRangeAttrs;  /* <== HERE */
   heatmapparm x=Week y=PatientID colorresponse=Value / outline outlineattrs=(color=gray)
        RATTRID=MapMissing;                       /* <== AND HERE */
   gradlegend;
   refline (1000.5 to 1005.5) / axis=Y lineattrs=(color=black thickness=2);
   xaxis integer values=(0 to 10) valueshint;
   /* if you use FILLATTRS=GraphMissing, you will get gray */
   legenditem type=fill name='missItem' / fillattrs=(color=CYAN) label="Missing Data";
   keylegend 'missItem';
run;

Success! The missed appointments are now displayed by using the (very bright!) cyan color. By using the range attribute map, I have complete control over the colors of the tiles, including the tiles that show missing values.

Summary

In summary, this article shows how to create a range attribute map for a heat map. The primary purpose of a range attribute map is to map a set of continuous values onto a spectrum of colors. However, by using the special keyword "_MISSING_" as a value for the MIN variable, you can control the color that is used to display missing values.

Further reading

The post Change the missing-value color by using a range attribute map appeared first on The DO Loop.

3月 172022
 

I created this project as a fun exercise to emulate the popular game Wordle in the SAS language. I was inspired by this story about a GitHub user who implemented Wordle in a Bash shell. (See the Bash script here. Read the comments -- it's an amazing stream of improved versions, takes in different programming languages, and "code golf" to reduce the lines-of-code.)

While developing my SAS solution, I created a Twitter poll to ask how other SAS programmers might approach it.

twitter poll
For me it was always going to be arrays, since that's what I know best. I'd love to be able to dash out a hash object approach or even use SAS/IML, but it would take me too long to wrap my brain around these. The PRX* regex function choice is bit of a feint -- regular expressions seem like a natural fit (pattern matching!), but Wordle play has nuances that makes regex less elegant than you might guess. Prove me wrong!

Two SAS users from Japan, apparently inspired by my poll, each implemented their own games! I've added links to their work at the end of this article.

How to code Wordle in SAS

You can see my complete SAS code for the game here: sascommunities/wordle-sas.

The interesting aspects of my version of the game include:

  • Uses the "official" word lists from New York Times as curated by cfreshman. I found these while examining the Bash script version. I used PROC HTTP to pull this list dynamically.
  • Also verifies guesses as "valid" using the list of allowed guesses, again curated by cfreshman. You know you can't submit just any 5 characters as a guess, right? That's an important part of the game.
  • Uses DATA step array to verify guesses against solution word.
  • Use DATA step object method to create a gridded output of the game board. Credit to my SAS friends in Japan for this idea!

I've added a screenshot of example game play below. This was captured in SAS Studio running in SAS Viya.

Example game with output

Example game play for my Wordle in SAS

How to play Wordle in SAS

To play:

  1. Submit the wordle-sas.sas program in your SAS session. This program should work in PC SAS, SAS OnDemand for Academics, SAS Enterprise Guide, and SAS Viya.

    The program will fetch word lists from GitHub and populate into data sets. It will also define two macros you will use to play the game.

  2. Start a game by running:
     %startGame;
    

    This will select a random word from the word list as the "puzzle" word and store it in a SAS macro variable (don't peek!)
  3. Optionally seed a game with a known word by using and optional 5-character word parameter:
     %startGame(crane);
    

    This will seed the puzzle word ("crane" in this example). It's useful for testing. See a battery of test "sessions" in wordle-sas-tests.sas
  4. Submit a first guess by running:
    %guess(adieu);
    

    This will check the guess against the puzzle word, and it will output a report with the familiar "status" - letters that appear in the word (yellow) and that are in the correct position (green). It will also report if the guess is not a valid guess word, and it won't count that against you as one of your 6 permitted guesses.

Use the %guess() macro (one at a time) to submit additional guesses. The program keeps track of your guesses and when you solve it, it shares the familiar congratulatory message that marks the end of a Wordle session. Ready to play again? Use the %startGame macro to reset.

Start a fresh game using Git functions

If you don't want to look at or copy/paste the game code, you can use Git functions in SAS to bring the program into your SAS session and play. (These Git functions require at least SAS 9.4 Maint 6 or SAS Viya.)

options dlcreatedir;
%let repopath=%sysfunc(getoption(WORK))/wordle-sas;
libname repo "&repopath.";
data _null_;
    rc = gitfn_clone( 
      "https://github.com/sascommunities/wordle-sas", 
      "&repoPath." 
    ); 
    put 'Git repo cloned ' rc=; 
run;
%include "&repopath./wordle-sas.sas";
 
/* start a game and submit first guess */
%startGame;
%guess(adieu);

Suggestions?

I know that my program could be more efficient...perhaps at the cost of readability. Also it's possible that I have some lingering bugs, although I did quite a bit of testing and bug-fixing along the way. Pro tip: The DATA step debugger (available in SAS Enterprise Guide and in SAS Viya version of SAS Studio) was a very useful tool for me!

Your feedback/improvements are welcome. Feel free to comment here or on the GitHub project.

Wordle games in SAS by other people

SAS users in Japan quickly implemented their own versions of Wordle-like games. Check them out:

The post Programming the Wordle game in SAS appeared first on The SAS Dummy.

2月 282022
 

An experienced SAS programmer recently switched to SAS Viya and asked how to discover what products are available on his version of Viya. We discussed a few older SAS 9 procedures, and I showed him a new Viya-specific way to get information about his version of SAS and his licensed products.

This article discusses the getLicensedProductInfo action, which is a new way to obtain information about all licensed products in SAS Viya. In addition, it discusses what you will see if you run the following SAS 9 procedures or statements in Viya:

  • The SYSVER and SYSVLONG macro variables. Did you know that there are newer macro variables that provide information about your SAS Viya version? Below, I discuss the SYSVIYAVERSION macro variable.
  • The SETINIT procedure.
  • The PRODUCT_STATUS procedure, which is not available in SAS Viya.

Since I am not an expert in SAS Administration, I invite comments from those who are more knowledgeable than I am.

The getLicensedProductInfo action

SAS Viya is built to work with many programming clients: SAS, Python, R, Lua, and so on. Accordingly, you can get the product information without using SAS-specific calls such as the SETINIT procedure. The getLicensedProductInfo action provides a client-agnostic way to obtain information about licensed products. You can call an action from any client language. From the SAS client, you can call actions by using PROC CAS. For example, if you want to display the products that are available in your Viya session, you can use the following statements:

cas;                 /* connect to CAS session */
 
proc cas;
   getLicensedProductInfo;
quit;

The output will contain every product that is available in your session. There are a total of eight columns in the output table, but only the first four columns are shown.

What version of SAS are you running?

A common inquiry on discussion forums is "What version of SAS are you running?" This is important because the person asking a question might be running an old version of SAS that does not contain newer features. In SAS 9, you can use the SYSVLONG system macro variables to discover the version of SAS 9 that you are running. (The SYSVLONG and SYSVLONG4 variables are similar: SYSVLONG displays a two-digit year such as 21 whereas SYSVLONG4 displays a four-digit year such as 2021.) For example, a SAS 9 programmer might submit the following statement, which displays the version information in the log:

%put In SAS 9: &=sysvlong4;

In SAS 9, the SAS log shows the following information:

In SAS 9: SYSVLONG4=9.04.01M6P11072018

The SYSVLONG4 string tells you that the SAS release is 9.4 ("9.04") and you are running the M6 release, which was built on 07NOV2018. You might wonder what the system macro variable contains if you submit the same statement on the SAS client that comes with SAS Viya:

%put In SAS Viya: &=sysvlong4;

In SAS Viya, the SAS log shows the following information:

In SAS Viya: SYSVLONG4=V.04.00M0P02142022

The output shows that the version is "V.04" and the build date was 14FEB2022. You can interpret "V.04" as "Viya 4," which is a cloud-native version of Viya.

If you know that you are running SAS Viya, there are two additional SAS macros that provide information about your Viya release: SYSVIYARELEASE and SYSVIYAVERSION, as follows:

%put In SAS Viya: &=SYSVIYARELEASE;
%put In SAS Viya: &=SYSVIYAVERSION;

The SAS log shows the following information:

In SAS Viya: SYSVIYARELEASE=20220221.1645427074741
In SAS Viya: SYSVIYAVERSION=Stable 2021.2.4

The value of SYSVIYARELEASE is a date (21FEB2022) followed by additional numbers that are related to the release. The value of SYSVIYAVERSION can be in two forms:

  • Stable YYYY.Major.Minor. For example, "Stable 2021.2.4" tells you that your version of Viya updates monthly (a "Stable" release) and that the cadence is version 6 of the 2nd release that occurred in 2021.
  • LTS YYYY.Release. For example, "LTS 2021.2" tells you that your version of Viya updates every six months (a "long-term stable" or LTS release) and that this is the 2nd release that occurred in 2021.

SAS 9 procedures

If you are a SAS 9 programmer, you might have used two SAS procedures that provide information about your SAS release. The first procedure is PROC SETINIT, which continues to be supported in SAS Viya. The following call to PROC SETINIT shows information that is similar to the output from the getLicensedProductInfo action:

proc setinit; 
run;

The SAS log displays the following information:

---Base SAS Software                             08MAY2022 (CPU A) 
---SAS/STAT                                      08MAY2022 (CPU A) 
---SAS/GRAPH                                     08MAY2022 (CPU A) 
---SAS/ETS                                       08MAY2022 (CPU A) 
---SAS/OR                                        08MAY2022 (CPU A) 
---SAS/IML                                       08MAY2022 (CPU A) 
<--- and many other lines --->

Although PROC SETINIT continues to work, it provides less information than the getLicensedProductInfo action, so I recommend using the action instead.

A second SAS 9 procedure that some customers use is PROC PRODUCT_STATUS. In SAS 9, the output from this procedure is similar to the PROC SETINIT output, but the output also includes version information for each product. For example, the output might tell you that the SAS/STAT version is 15.2 and the SAS/GRAPH version is 9.4_M5. This method of naming versions is not used in SAS Viya, so PROC PRODUCT_STATUS is no longer supported. If you try to run the procedure in SAS Viya, the log will contain the following error message:

ERROR: Procedure PRODUCT_STATUS not found.

Using the SYSVER macro variable in SAS macro programming

One use of the SYSVER macro variable is to check the SAS version to ensure that certain language features (procedures or options) are present. For example, I have seen SAS macros that include the following logic:

/* macro logic to test whether the version of SAS is 9.4 or greater */
%if %sysevalf(&sysver < 9.4) %then %do;
   %put SAS 9.4 or later is required.  Terminating.;
   %goto exit;
%end;

This snippet of code will exit the macro unless the SAS version is at least 9.4 or later. What will happen if you run this macro code in SAS Viya? Thankfully, the code continues to work properly!

To understand why the logic continues to work when SYSVER contains a value such as "V.04," first recall that the %IF condition compares STRINGS, not numbers. If the SYSVER variable is V.04 (or any string that begins with "V"), the logical expression %sysevalf(&sysver < 9.4) is false. This is because the ASCII value for "V" is 86 whereas the ASCII value for "9" is 57. You can confirm this logic yourself by running the following macro statements on any version of SAS 9 or SAS Viya. In SAS 9.4M5 or later, you don't even need to embed the code inside a macro. You can run %IF-%THEN statements in open code:

/* prior to SAS 9.4M5, you need to wrap this code inside a macro call */
%if %sysevalf(V.04 < 9.4) %then %do;
   %put Your version of SAS is less than 9.4;   /* What happens if &SYSVER resolves to V.xx? */
%end;
%else %do;
   %put Your version of SAS is SAS 9.4 or later;
%end;

This code snippet always prints "Your version of SAS is SAS 9.4 or later." Consequently, the SYSVER macro variable in SAS Viya has a value that is greater than 9.4 (or any other set of numbers).

Summary

In SAS Viya, you can use the getLicensedProductInfo action to obtain information about licensed products. Some of the older SAS 9 procedures (such as PROC SETINIT) continue to work in Viya, although others (such as PROC PRODUCT_STATUS) are deprecated. The SYSVER and SYSVLONG macro variables are supported, although in SAS Viya their values begin with "V." To discover your version of SAS Viya, use the SYSVIYAVERSION macro variable.

The post How to find release and licensing information in SAS Viya? appeared first on The DO Loop.

2月 112022
 

Ever had to work with data that had crazy column names, custom formats you didn’t have access to, or text columns sized way too big for the actual data they contained? Annoying, isn’t it? Well, you’re not alone and, as it turns out, it’s not that hard to fix. Let’s take a peek at renaming, reformatting, relabeling and resizing columns in your data using PROC SQL and PROC DATASETS. PROC SQL could handle most of these tasks, but you can’t rename a column in SQL without creating a new copy of the table. PROC DATASETS can do it all, but the syntax can be trickier. When we resize a column, we're going to have to re-write the data anyway, so I'll use PROC SQL for resizing, and PROC DATASETS for everything else. For example, we have this existing table:

Year Make Model MSRP Invoice Engine Size (L)
2004 Acura MDX $36,945 $33,337 3.5
2004 Audi A4 1.8T 4dr $25,940 $23,508 1.8
2004 Buick Rainier $37,895 $34,357 4.2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP num 8 DOLLAR8.
Invoice num 8 DOLLAR8.
EngineSize Engine Size (L) num 8

 
I want to modify the table to use European formats for the numeric values, to rename the EngineSize column as Size leaving its label unchanged, and to add descriptive labels for MSRP and Invoice.
 
The general syntax for this in PROC DATASETS is:
 
PROC DATASETS LIBRARY=;
MODIFY sas-dataset <(options)>;
FORMAT variable-1 <format-1> <variable-2 <format-2> ...>;
LABEL variable-1=<'label-1' | ' '> <variable-2=<'label-2' | ' '> ...>;
RENAME variable-1=new-name-1 <variable-2=new-name-2 ...>;

After running this code:

proc datasets library=work nolist nodetails;
   modify cars;
      format MSRP eurox8. Invoice eurox8. EngineSize commax5.1;
      label MSRP="Sticker Price" Invoice="Wholesale Price" ;
      rename EngineSize=Size;
run; quit;

 
the data now looks like this:
 

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
In the report, the longest value in the Make column is 5 characters long but the table metadata shows a column width of 13 characters. I can shorten the Make column to 5 characters without truncating the values, making my table size smaller on disk and in memory.

PROC SQL with an ALTER TABLE statement and a MODIFY clause will change a table’s descriptor portion (metadata) for one or more columns. Changing the length of a column will automatically rewrite the data portion of the table, too. The syntax looks like this:

ALTER TABLE table-name MODIFY column-definition-1 <, column-definition-2, ..., column-definition-n>;

After running this code:

proc sql;
alter table work.cars
   modify Make char(5)
;
quit;

The data is still correct:

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 
and the column size is now smaller:

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 5
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
Now, unless you want to write this type of code by hand every time you need it, you’ll want macros to do the work for you. And macros we’ve got! Check out my SAS tutorial on YouTube titled “Resize, Rename and Reformat Data with SAS Macro” for more details on how this works. The video is also embedded below. Links below the embedded video lead to the data and programs used in the video and to my personal macro trove on GitHub, including a macro that quickly strips labels, formats and informats from a data set.

Until next time, may the SAS be with you!
 
Mark

 

DOWNLOAD NOW | VIDEO DATA PACKAGE ON GITHUB
GET MACROS NOW | MY MACRO STASH ON GITHUB

Jedi SAS tricks: resizing, renaming and reformatting your data was published on SAS Users.

2月 112022
 

Ever had to work with data that had crazy column names, custom formats you didn’t have access to, or text columns sized way too big for the actual data they contained? Annoying, isn’t it? Well, you’re not alone and, as it turns out, it’s not that hard to fix. Let’s take a peek at renaming, reformatting, relabeling and resizing columns in your data using PROC SQL and PROC DATASETS. PROC SQL could handle most of these tasks, but you can’t rename a column in SQL without creating a new copy of the table. PROC DATASETS can do it all, but the syntax can be trickier. When we resize a column, we're going to have to re-write the data anyway, so I'll use PROC SQL for resizing, and PROC DATASETS for everything else. For example, we have this existing table:

Year Make Model MSRP Invoice Engine Size (L)
2004 Acura MDX $36,945 $33,337 3.5
2004 Audi A4 1.8T 4dr $25,940 $23,508 1.8
2004 Buick Rainier $37,895 $34,357 4.2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP num 8 DOLLAR8.
Invoice num 8 DOLLAR8.
EngineSize Engine Size (L) num 8

 
I want to modify the table to use European formats for the numeric values, to rename the EngineSize column as Size leaving its label unchanged, and to add descriptive labels for MSRP and Invoice.
 
The general syntax for this in PROC DATASETS is:
 
PROC DATASETS LIBRARY=;
MODIFY sas-dataset <(options)>;
FORMAT variable-1 <format-1> <variable-2 <format-2> ...>;
LABEL variable-1=<'label-1' | ' '> <variable-2=<'label-2' | ' '> ...>;
RENAME variable-1=new-name-1 <variable-2=new-name-2 ...>;

After running this code:

proc datasets library=work nolist nodetails;
   modify cars;
      format MSRP eurox8. Invoice eurox8. EngineSize commax5.1;
      label MSRP="Sticker Price" Invoice="Wholesale Price" ;
      rename EngineSize=Size;
run; quit;

 
the data now looks like this:
 

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
In the report, the longest value in the Make column is 5 characters long but the table metadata shows a column width of 13 characters. I can shorten the Make column to 5 characters without truncating the values, making my table size smaller on disk and in memory.

PROC SQL with an ALTER TABLE statement and a MODIFY clause will change a table’s descriptor portion (metadata) for one or more columns. Changing the length of a column will automatically rewrite the data portion of the table, too. The syntax looks like this:

ALTER TABLE table-name MODIFY column-definition-1 <, column-definition-2, ..., column-definition-n>;

After running this code:

proc sql;
alter table work.cars
   modify Make char(5)
;
quit;

The data is still correct:

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 
and the column size is now smaller:

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 5
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
Now, unless you want to write this type of code by hand every time you need it, you’ll want macros to do the work for you. And macros we’ve got! Check out my SAS tutorial on YouTube titled “Resize, Rename and Reformat Data with SAS Macro” for more details on how this works. The video is also embedded below. Links below the embedded video lead to the data and programs used in the video and to my personal macro trove on GitHub, including a macro that quickly strips labels, formats and informats from a data set.

Until next time, may the SAS be with you!
 
Mark

 

DOWNLOAD NOW | VIDEO DATA PACKAGE ON GITHUB
GET MACROS NOW | MY MACRO STASH ON GITHUB

Jedi SAS tricks: resizing, renaming and reformatting your data was published on SAS Users.

1月 182022
 

Several probability distributions model the outcomes of various trials when the probabilities of certain events are given. For some distributions, the definitions make sense even when a probability is 0. For other distributions, the definitions do not make sense unless all probabilities are strictly positive. This article examines how zero probabilities affect some common probability distributions, and how to simulate events in SAS when a probability is 0. However, the same techniques can be used to detect and handle other situations where an invalid parameter is used as an argument to a SAS function.

Elementary distributions and event-trial scenarios

The following discrete probability distributions are usually presented in an introductory statistics course. They describe how likely it is to observe one or more events when you run one or more trials. In the following, X is a random variable and p is the probability of success in one Bernoulli trial.

  • X follows the Bernoulli(p) distribution if P(X=1) = p and P(X=0) = 1 - p.
  • X follows the Binomial(p, k) distribution if it is the sum of k independent Bernoulli random variable.
  • X follows a Table(p1, p2, ..., pd) distribution if P(X=Xi) = pi is the probability that X takes on the i_th value in a set of d values, where the probability of observing the i_th value is pi.
  • X follows a Geometric(p) distribution if X is the number of failures until the first success in a sequence of independent Bernoulli trials, each with probability of success p.
  • X follows a NegativeBinomial(p, k) distribution if X is the number of failures before k successes in a sequence of independent Bernoulli trials, each with probability of success p.

Among these distributions, the Bernoulli and binomial distributions make sense if p=0. In that case, the random variable always has the value 0 because no successes are ever observed. The Table distribution can handle pi=0, but it usually is advisable to remove the categories that have zero probabilities. However, the geometric and negative binomial distributions are not well-defined if p=0.

Simulation in the SAS DATA step

What happens if you use p=0 for these five distributions? The result might depend on the software you use. The following SAS DATA shows what happens in SAS:

%let NSim = 200;         /* sample size */
data Sample; 
/* define array so that sum(prob)=1 */
array prob[10] _temporary_ (0 0.2 0.125 0.2 0 0.25 0 0.125 0 0.1); 
call streaminit(54321); 
p = 0;                   /* p=0 might cause errors */
k = 10;
do i = 1 to &NSim;
   /* For some distributions, p=0 is okay. */
   /* Success (1) or failure (0) of one trial when Prob(success)=p. */
   Bern = rand("Bernoulli", p);
   /* The number of successes in k indep trials when Prob(success)=p. */
   Bin = rand("Binomial", p, 10);
 
   /* For some distributions, p=0 is questionable, but supported. */
   /* The Table distribution returns a number between 1 and dim(pro)
      when the probability of observing the i_th category is prob[i]. */
   T = rand("Table", of prob[*]);
 
   /* For some distributions, p=0 does not make sense. */
   /* The number of trials that are needed to obtain one success when Prob(success)=p. */
   Geo = rand("Geometric", p);      /* ERROR */
   /* The number of failures before k successes when Prob(success)=p. */
   NB = rand("NegBinomial", p, k);  /* ERROR */
   output; 
end; 
run;

When you run this program, the log will display many notes like the following:

NOTE: Argument 2 to function RAND('Geometric',0) at line 1735
      column 10 is invalid.
NOTE: Argument 2 to function RAND('NegBinomial',0,10) at line
      1737 column 9 is invalid.
<many more notes>
NOTE: Mathematical operations could not be performed at the
      following places. The results of the operations have been
      set to missing values.
<more notes>

The log is telling you that the RAND function considers p=0 to be an invalid argument for the geometric and negative binomial distributions? Why? A geometric random variable shows the number of trials that are needed to obtain one success. But if the probability of success is exactly zero, then we will never observe a success. The number of trials will be infinite. Similarly, a negative binomial random variable shows the number of failures before k successes, but there will never be any successes when p=0. Again, the number of trials will be infinite.

Because the geometric and negative binomial distributions are not defined when p=0, SAS probability routines will note the error and return a missing value, as shown by printing the output data:

proc print data=Sample(obs=5); run;

Notice that the Bernoulli and binomial variables are well-defined. Every Bernoulli trial results in X=0, which means that the sum of arbitrarily many trials is also 0. The Table distribution will return only the categories for which the probabilities are nonzero. However, the geometric and negative binomial values are assigned missing values.

Eliminate the notes during a simulation

If you are running a large simulation in which p=0 is a valid possibility, the log will contain many notes. Eventually, SAS will print the following note:

NOTE: Over 100 NOTES, additional NOTES suppressed.

In general, it is a bad idea to suppress notes because there might be additional notes later in the program that you need to see. A better approach is to get rid of all the notes that are telling you that p=0 is an invalid argument. You can do that by using IF-THEN logic to trap the cases where p is very small. For those cases, you can manually assign a missing value. SAS will not issue a note in this case because all calls to SAS functions have valid arguments. This technique is shown in the following DATA step program:

/* workaround: use different logic when p is too small */
%let cutoff = 1e-16;
data Sample2; 
format p Geo NB Best7.;
call streaminit(54321); 
input p @@;
/* The number of trials that are needed to obtain one success when Prob(success)=p. */
if p > &cutoff then 
   Geo = rand("Geometric", p);
else
   Geo = .;
/* The number of failures before k successes when Prob(success)=p. */
if p > &cutoff then 
   NB = rand("NegBinomial", p, 10);  /* k=10 */
else
   NB = .;
datalines;
-0.1  0  1e-20  1e-15 1e-10  1e-6 0.001 0.1
;
 
proc print data=Sample2; run;

When you run the program, it "runs clean." There are no notes in the log about invalid parameters. The output shows that values of p that are below the cutoff value result in a missing value. Very small values of p that are above the cutoff value are handled by the RAND function, which returns large integers.

Summary

This article discusses how to handle extremely small probabilities in functions that compute probability distributions. The trick is to use IF-THEN logic to prevent the invalid probability from being used as an argument to the RAND function (or other probability functions, such as the PDF, CDF, or QUANTILE functions).

Although this article discusses zero probabilities, this situation is actually a specific case of a larger programming issue: how to prevent notes from appearing in the SAS log when a parameter to a function is invalid. Thus, you can think of this article as yet another example of how to use the "trap and cap" method of defensive programming.

The post Simulate events when some probabilities are zero appeared first on The DO Loop.

1月 182022
 

Several probability distributions model the outcomes of various trials when the probabilities of certain events are given. For some distributions, the definitions make sense even when a probability is 0. For other distributions, the definitions do not make sense unless all probabilities are strictly positive. This article examines how zero probabilities affect some common probability distributions, and how to simulate events in SAS when a probability is 0. However, the same techniques can be used to detect and handle other situations where an invalid parameter is used as an argument to a SAS function.

Elementary distributions and event-trial scenarios

The following discrete probability distributions are usually presented in an introductory statistics course. They describe how likely it is to observe one or more events when you run one or more trials. In the following, X is a random variable and p is the probability of success in one Bernoulli trial.

  • X follows the Bernoulli(p) distribution if P(X=1) = p and P(X=0) = 1 - p.
  • X follows the Binomial(p, k) distribution if it is the sum of k independent Bernoulli random variable.
  • X follows a Table(p1, p2, ..., pd) distribution if P(X=Xi) = pi is the probability that X takes on the i_th value in a set of d values, where the probability of observing the i_th value is pi.
  • X follows a Geometric(p) distribution if X is the number of failures until the first success in a sequence of independent Bernoulli trials, each with probability of success p.
  • X follows a NegativeBinomial(p, k) distribution if X is the number of failures before k successes in a sequence of independent Bernoulli trials, each with probability of success p.

Among these distributions, the Bernoulli and binomial distributions make sense if p=0. In that case, the random variable always has the value 0 because no successes are ever observed. The Table distribution can handle pi=0, but it usually is advisable to remove the categories that have zero probabilities. However, the geometric and negative binomial distributions are not well-defined if p=0.

Simulation in the SAS DATA step

What happens if you use p=0 for these five distributions? The result might depend on the software you use. The following SAS DATA shows what happens in SAS:

%let NSim = 200;         /* sample size */
data Sample; 
/* define array so that sum(prob)=1 */
array prob[10] _temporary_ (0 0.2 0.125 0.2 0 0.25 0 0.125 0 0.1); 
call streaminit(54321); 
p = 0;                   /* p=0 might cause errors */
k = 10;
do i = 1 to &NSim;
   /* For some distributions, p=0 is okay. */
   /* Success (1) or failure (0) of one trial when Prob(success)=p. */
   Bern = rand("Bernoulli", p);
   /* The number of successes in k indep trials when Prob(success)=p. */
   Bin = rand("Binomial", p, 10);
 
   /* For some distributions, p=0 is questionable, but supported. */
   /* The Table distribution returns a number between 1 and dim(pro)
      when the probability of observing the i_th category is prob[i]. */
   T = rand("Table", of prob[*]);
 
   /* For some distributions, p=0 does not make sense. */
   /* The number of trials that are needed to obtain one success when Prob(success)=p. */
   Geo = rand("Geometric", p);      /* ERROR */
   /* The number of failures before k successes when Prob(success)=p. */
   NB = rand("NegBinomial", p, k);  /* ERROR */
   output; 
end; 
run;

When you run this program, the log will display many notes like the following:

NOTE: Argument 2 to function RAND('Geometric',0) at line 1735
      column 10 is invalid.
NOTE: Argument 2 to function RAND('NegBinomial',0,10) at line
      1737 column 9 is invalid.
<many more notes>
NOTE: Mathematical operations could not be performed at the
      following places. The results of the operations have been
      set to missing values.
<more notes>

The log is telling you that the RAND function considers p=0 to be an invalid argument for the geometric and negative binomial distributions? Why? A geometric random variable shows the number of trials that are needed to obtain one success. But if the probability of success is exactly zero, then we will never observe a success. The number of trials will be infinite. Similarly, a negative binomial random variable shows the number of failures before k successes, but there will never be any successes when p=0. Again, the number of trials will be infinite.

Because the geometric and negative binomial distributions are not defined when p=0, SAS probability routines will note the error and return a missing value, as shown by printing the output data:

proc print data=Sample(obs=5); run;

Notice that the Bernoulli and binomial variables are well-defined. Every Bernoulli trial results in X=0, which means that the sum of arbitrarily many trials is also 0. The Table distribution will return only the categories for which the probabilities are nonzero. However, the geometric and negative binomial values are assigned missing values.

Eliminate the notes during a simulation

If you are running a large simulation in which p=0 is a valid possibility, the log will contain many notes. Eventually, SAS will print the following note:

NOTE: Over 100 NOTES, additional NOTES suppressed.

In general, it is a bad idea to suppress notes because there might be additional notes later in the program that you need to see. A better approach is to get rid of all the notes that are telling you that p=0 is an invalid argument. You can do that by using IF-THEN logic to trap the cases where p is very small. For those cases, you can manually assign a missing value. SAS will not issue a note in this case because all calls to SAS functions have valid arguments. This technique is shown in the following DATA step program:

/* workaround: use different logic when p is too small */
%let cutoff = 1e-16;
data Sample2; 
format p Geo NB Best7.;
call streaminit(54321); 
input p @@;
/* The number of trials that are needed to obtain one success when Prob(success)=p. */
if p > &cutoff then 
   Geo = rand("Geometric", p);
else
   Geo = .;
/* The number of failures before k successes when Prob(success)=p. */
if p > &cutoff then 
   NB = rand("NegBinomial", p, 10);  /* k=10 */
else
   NB = .;
datalines;
-0.1  0  1e-20  1e-15 1e-10  1e-6 0.001 0.1
;
 
proc print data=Sample2; run;

When you run the program, it "runs clean." There are no notes in the log about invalid parameters. The output shows that values of p that are below the cutoff value result in a missing value. Very small values of p that are above the cutoff value are handled by the RAND function, which returns large integers.

Summary

This article discusses how to handle extremely small probabilities in functions that compute probability distributions. The trick is to use IF-THEN logic to prevent the invalid probability from being used as an argument to the RAND function (or other probability functions, such as the PDF, CDF, or QUANTILE functions).

Although this article discusses zero probabilities, this situation is actually a specific case of a larger programming issue: how to prevent notes from appearing in the SAS log when a parameter to a function is invalid. Thus, you can think of this article as yet another example of how to use the "trap and cap" method of defensive programming.

The post Simulate events when some probabilities are zero appeared first on The DO Loop.

1月 122022
 

Some colors have names, such as "Red," "Magenta," and "Dark Olive Green." But the most common way to specify a color is to use a hexadecimal value such as CX556B2F. It is not obvious that "Dark Olive Green" and CX556B2F represent the same color, but they do! I like to use color names (when possible) instead of hexadecimal values because the names make the program more readable than the hexadecimal values. For example, a color ramp that is defined by using the names ("DarkSeaGreen" "SandyBrown" "Tomato" "Sienna") is easier to interpret than the equivalent color ramp that is defined by using the hexadecimal values (CX8FBC8F CXF4A460 CXFF6347, CXA0522D).

This article shows how to find a "named color" that is close to any color that you specify. Shakespeare asked, "What's in a name?" To paraphrase his response, this article shows that the name of "Rose" looks just as sweet as CXFF6060 but is easier to use!

Colors in SAS

When you create a graph in SAS, there are three ways to specify colors: use a pre-defined color name from the SAS registry, use the SAS-naming convention to specify hues, or use hexadecimal values to specify an 8-bit color for the RGB color model. An example of a pre-defined color name is "DarkOliveGreen," an example of a hue-based color is "Dark Moderate Green," and an example of a hexadecimal value is CX556B2F. Each hexadecimal value encodes the three RGB values for the color. For example, the hexadecimal values 55, 6B, and 2F correspond to the decimal integers 85, 107, and 47, so CX556B2F can be thought of as the RGB triplet (85, 107, 47).

In my SAS registry, there are 151 pre-defined color names, whereas there are 2563 = 16.7 million 8-bit RGB colors. Clearly, there are many RGB colors that do not have names! I thought it would be interesting to write a program that finds the closest pre-defined name to any RGB color that you specify. You can think of each color as a three-dimensional point (R, G, B), where R, G, and B are integers and 0 ≤ R,G,B ≤ 255. Thus, the space of all 8-bit colors is a three-dimensional integer lattice. Colors that are close to each other (in the Euclidean metric) have similar shades. Consequently, you can find named color that is closest to another color by using the following steps:

  1. Load the pre-defined color names and their RGB values.
  2. For any specified hexadecimal value, convert it to an RGB value.
  3. In RGB coordinates, find the pre-defined color name that is closest (in the Euclidean metric) to the specified value.

For example, if you specify an unnamed color such as CXE99F62=RGB(244, 107, 53), the program can tell you that "SandyBrown"=RGB(244, 164, 96) is the closest pre-defined color to CXE99F62. If you want to make your program more readable (and don't mind modifying the hues a little), you can replace CXE99F62 with "SandyBrown" in your program.

Read colors from the SAS registry

For the reference set, I will use the pre-defined colors in the SAS registry, but you could use any other set of names and RGB values. The SAS documentation shows how to use PROC REGISTRY to list the colors in your SAS registry.

The following program modifies the documentation example and writes the registry colors to a temporary text file:

filename _colors TEMP;    /* create a temporary text file */
 
/* write text file with colors from the registry */
proc registry startat='HKEY_SYSTEM_ROOT\COLORNAMES' list export=_colors;
run; 
 
/* In the flat file, the colors look like this:
"AliceBlue"= hex: F0,F8,FF
"AntiqueWhite"= hex: FA,EB,D7
"Aqua"= hex: 00,FF,FF
"Aquamarine"= hex: 7F,FD,D4
"Azure"= hex: F0,FF,FF
"Beige"= hex: F5,F5,DC
"Bisque"= hex: FF,E4,C4
"Black"= hex: 00,00,00
...
*/

You can use a DATA step to read the registry values and create a SAS data set that contains the names, the hexadecimal representation, and the RGB coordinates for each pre-defined color:

data RegistryRGB;
   infile _colors end=eof;         /* read from text file; last line sets EOF flag to true */
   input;                          /* read one line at a time into _infile_ */
 
   length ColorName $32 hex $8; 
   retain hex "CX000000";
   s = _infile_;
   k = findw(s, 'hex:');          /* does the string 'hex' appear? */
   if k then do;                  /* this line contains a color */
      i = findc(s, '=', 2);       /* find the second quotation mark (") */
      ColorName = substr(s, 2, i-3);            /* name is between the quotes */
      /* build up the hex value from a comma-delimited value like 'FA,EB,D7' */
      substr(hex, 3, 2) = substr(s, k+5 , 2);
      substr(hex, 5, 2) = substr(s, k+8 , 2);
      substr(hex, 7, 2) = substr(s, k+11, 2);
 
      R = inputn(substr(hex, 3, 2), "HEX2."); /* get RGB coordinates from hex */
      G = inputn(substr(hex, 5, 2), "HEX2.");
      B = inputn(substr(hex, 7, 2), "HEX2.");
   end;
   if k;
   drop k i s;
run;
 
proc print data=RegistryRGB(obs=8); run;

The above program works in SAS 9 and also in SAS Viya if you submit the program through SAS Studio. My versions of SAS each have 151 pre-defined colors. The output from PROC PRINT shows that the RegistryRGB data set contains the ColorName, Hex, R, G, and B variables, which describe each pre-defined color.

Find the closest "named color"

The RegistryRGB data set enables you to answer the following question: Given an RGB color, what "named color" is it closest to?

For example, in the article "A statistical palette of Christmas colors," I created a palette of colors that had the values {CX545733, CX498B60, CX94AF77, CXE99F62, CXF46B35, CXAA471D}. These colors are shown to the right, but it would be challenging to look solely at the hexadecimal values and know what colors they represent. However, if told you that the colors were close to other colors such as "DarkOliveGreen," "SandyBrown," and "Tomato," you would have a clue about what colors are represented by the hexadecimal values.

The following program uses two functions from previous articles:

proc iml;
/* function to convert an array of colors from hexadecimal to RGB 
   https://blogs.sas.com/content/iml/2014/10/06/hexadecimal-to-rgb.html */
start Hex2RGB(_hex);
   hex = colvec(_hex);        /* convert to column vector */
   rgb = j(nrow(hex),3);      /* allocate three-column matrix for results */
   do i = 1 to nrow(hex);     /* for each color, translate hex to decimal */
      rgb[i,] = inputn(substr(hex[i], {3 5 7}, 2), "HEX2.");
   end;
   return( rgb);
finish;
 
/* Compute indices (row numbers) of k nearest neighbors.
   INPUT:  S    an (n x d) data matrix
           R    an (m x d) matrix of reference points
           k    specifies the number of nearest neighbors (k>=1) 
   OUTPUT: idx  an (n x k) matrix of row numbers. idx[,j] contains the
                row numbers (in R) of the j_th closest elements to S
           dist an (n x k) matrix. dist[,j] contains the distances
                between S and the j_th closest elements in R
   https://blogs.sas.com/content/iml/2016/09/28/distance-between-two-group.html
*/
start PairwiseNearestNbr(idx, dist, S, R, k=1);
   n = nrow(S);
   idx = j(n, k, .);
   dist = j(n, k, .);
   D = distance(S, R);          /* n x m */
   do j = 1 to k;
      dist[,j] = D[ ,><];       /* smallest distance in each row */
      idx[,j] = D[ ,>:<];       /* column of smallest distance in each row */
      if j < k then do;         /* prepare for next closest neighbors */
         ndx = sub2ndx(dimension(D), T(1:n)||idx[,j]);
         D[ndx] = .;            /* set elements to missing */
      end;      
   end;
finish;

With those two functions defined, the remainder of the program is easy: read the reference RGB colors, define a palette of colors, and find the closest reference color to each specified color.

/* read the set of reference colors, which have names */
use RegistryRGB;
read all var {ColorName hex};
close;
RegRGB = Hex2RGB(hex);         /* RGB values for the colors in the SAS registry */
 
/* define the hex values that you want to test */
HaveHex = {CX545733, CX498B60, CX94AF77, CXE99F62, CXF46B35, CXAA471D};
HaveRGB = Hex2RGB(HaveHex);    /* convert test values to RGB coordinates */
 
run PairwiseNearestNbr(ClosestIdx, Dist, HaveRGB, RegRGB);
ClosestName = ColorName[ClosestIdx];  /* names of closest reference colors */
ClosestHex = hex[ClosestIdx];         /* hex values for closest reference colors */
print HaveHex ClosestHex ClosestName Dist;

The table shows the closest reference color to each specified color. For example, the color CX545733 is closest to the reference color "DarkOliveGreen." How close are they? In three-dimensional RGB coordinates, they are about 20.4 units apart. If you want to see the difference in each coordinate direction, you can print the difference between the RGB values:

/* how different is each coordinate? */
Diff = HaveRGB - RegRGB[Closestidx,];
print Diff[c={R G B}];

You can see that the red and blue coordinates of CX545733 and "DarkOliveGreen" are almost identical. The green coordinates differ by 20 units or about 8%. The "SandyBrown" color is a very good approximation to CXE99F62 because the distance between those colors is about 12.2 units. Every RGB coordinate of "SandyBrown" is within 11 units of the corresponding coordinate of CXE99F62.

You can display both palettes adjacent to each other to compare how well the reference colors approximate the test colors:

/* visualize the palettes */
ods graphics / width=640px height=160px;
k = nrow(HaveHex);
run HeatmapDisc(1:k, HaveHex) title="Original Palette" ShowLegend=0 
               xvalues=HaveHex yvalues="Colors";
run HeatmapDisc(1:k, ClosestHex) title="Nearby Palette of Registry Colors" ShowLegend=0 
               xvalues=ClosestName yvalues="Colors";

The eye can detect small differences in shades, but the overall impression is that the palette of named colors is very similar to the original palette. The palette of named colors is more informative in the sense that people have can visualize "SeaGreen" and "Tomato" without seeing the palette.

Summary

This article discusses how to create a SAS data set that contains the names and RGB values of a set of "named colors." For this article, I used the named colors in the SAS registry. You can use these named colors as reference colors. Given any other color, you can find the reference color that is closest to the specified color. This enables you to describe the color as being "close to SeaGreen" or "close to SandyBrown," which might help you when you discuss colors with your colleagues.

This article is about approximating colors by using a set of reference colors. If you want to visualize the reference colors themselves, Robert Allison has shown how to display a color swatch for each color in a SAS data set.

The post How to assign a name to a color appeared first on The DO Loop.

11月 292021
 

When SAS 9 programmers transition to SAS Viya, there are inevitably questions about how new concepts in Cloud Analytic Services (CAS) relate to similar concepts in SAS. This article discusses the question, "What is the difference between a libref and a caslib?" Both are used to access data, but they are used in different situations. Briefly, a caslib can be used by any language that calls an action (SAS, Python, Lua, R, and so forth), whereas librefs are used only by procedures in the SAS language.

This article also discusses how to determine the active caslib, how to load data into a caslib, and how to read data from a caslib by using a CAS-enabled procedure in SAS.

An overview of the CAS server and client languages

I use the term "SAS 9" to refer to the traditional SAS workspace server that runs procedures in products such as Base SAS, SAS/STAT, and SAS/ETS. So "SAS 9" refers to the "classic" SAS programming environment that existed before SAS Viya.

For an overview of the CAS server and client languages, see my article about CAS-enabled procedures. I use the SAS client (called the SAS compute server) to connect to the CAS server. By using a SAS client to communicate with CAS, I can leverage my 25 years of SAS programming knowledge and skills.

What is a libref?

SAS 9 programmers are used to working with librefs, which are defined by using the LIBNAME statement in SAS. In SAS 9, you use a libref to point to a data source, which can be a file directory (such as C:/MyData) or a database such as DB2, SQL Server, Oracle, and so forth. A libref enables you to read or write data by using a SAS procedure.

Some librefs are automatically available, such as WORK or SASUSER. Others might be pre-defined by your SAS administrator so that everyone on a team can access the same data. A SAS programmer can use the LIBNAME statement to create a personal librefs, which is a convenient way to organize data that belong to different projects.

The LIBNAME statement enables you to use a SAS "engine" to access data from a wide variety of data sources. An engine enables SAS procedures to process data from a non-SAS data format such as XML, JSON, Excel, or a database. This is one of the reasons companies invest in SAS: it enables programmers to concentrate on analyzing the data rather than worrying about how to read it.

Many SAS programmers are familiar with the idea of associating a libref with a database or schema. For example, to connect to an ORACLE database, you might run the following LIBNAME statement:

libname mydblib ORACLE user=wicklin;  /* librefs can connect to various data sources */

I mention this because there is a CAS LIBNAME engine that you can use to access CAS tables from SAS procedures in Viya. That means you can access CAS tables from any SAS procedure, just by assigning a libref. But what does the libref refer to? It refers to a caslib in a CAS session. Therefore, it is important to understand how to define and use a caslib, which you can think of as being analogous to a libref, except that a caslib exists on the CAS server, not on the SAS client. As you read through the next sections, you will see many similarities between caslibs and librefs.

What is a caslib?

Caslibs are discussed in the documentation SAS Cloud Analytic Services: Fundamentals. A caslib is associated with a data source and includes the connection information for the data source. It enables you to read and write in-memory tables from CAS. I like to think of it as a "server side" libref, except that you can use a caslib from any client language, not just from SAS.

What is the active caslib?

When you use the CAS statement to start a CAS session, CAS creates a personal caslib for you. The name of your personal caslib is usually CASUSER(userID), although you can usually abbreviate this as CASUSER. By default, this is the active caslib for the session. Here is a partial output of the log when I use the CAS statement to connect to a CAS server:

cas;                 /* connect to CAS session */
   NOTE: The session CASAUTO connected successfully to Cloud Analytic Services <name> using port 5570. <snip>
       The user is frwick and the active caslib is CASUSER(frwick).
Notice that when you connect to a CAS session, the log tells you the name of the active caslib. For me, the log states "the active caslib is CASUSER(frwick)." The log also tells me the session name, which is CASAUTO in this example.

The active caslib is the default location to read and write an in-memory data set. If you store data in the active caslib, you don't need to specify the name of the caslib when you call most actions. By default, the actions look in the active caslib to find the data.

What other caslibs are defined?

Your administrator might have given you access to other predefined caslibs. You can use the CASLIB statement to display all available caslibs in the log, as follows:

caslib _all_ list;   /* display all caslibs to log */
   NOTE: Session = CASAUTO Name = CASUSER(frwick)
         Type = PATH
         Description = Personal File System Caslib
         Path = /cas/data/caslibs/casuserlibraries/frwick/
         Definition = 
         Subdirs = Yes
         Local = No
         Active = Yes
         Personal = Yes
   NOTE: Session = CASAUTO Name = Public
         Type = PATH
         Description = Shared and writeable caslib, accessible to all users.
         Path = /cas/data/caslibs/public/
         Definition = 
         Subdirs = No
         Local = No
         Active = No
         Personal = No

The log contains a list of all caslibs in the current session. One of the caslibs has the text "Active = Yes" as part of its description. That caslib is the active caslib. In this example, the active caslib is CASUSER.

There are different types of caslibs, which provide access to various data sources, such as DB2, Hadoop, Oracle, and Teradata, just to name a few. A complete list of data sources is provided in the documentation. I don't have experience with most of these data sources. I primarily use "Type = PATH" caslibs, which can store audio files, video files, SAS data sets, CSV files, and more. I also primarily use in-memory tables, but for a PATH caslib the "Path" field tells you the location (directory) in which CAS will save the data upon request.

In addition to my personal caslib, I also have access to a global caslib named 'Public'. You can change the active caslib by using the CASLIB= suboption on the CASSESSOPTS= option on the CAS statement. For example, the following statements change the active caslib in the CASAUTO session to 'Public', then immediately changes it back to my personal caslib.

cas CASAUTO sessopts=(caslib='public');    /* make the 'public' caslib active */
cas CASAUTO sessopts=(caslib='casuser');   /* make my personal caslib active */
   NOTE: 'Public' is now the active caslib.
   NOTE: 'CASUSER(frwick)' is now the active caslib.

How does a libref differ from a caslib?

  • A libref and a caslib both point to a data source.
  • The default libref is typically WORK or SASUSER. The analogous concept is the "active" caslib. By default, the active caslib is CASUSER, which is your personal caslib.
  • A libref is defined in that SAS language and is used only by SAS procedures. A caslib exists on the CAS server and can be used by any client language that can access CAS.
  • SAS 9 programmers often use the LIBNAME statements to define new librefs. In CAS, only authorized users can define a caslib (by using the CASLIB statement).
  • If you want to read a CAS table from a SAS procedure, you need to define a libref that points to a caslib. This is shown later in this article.

Load data to a caslib

You can upload data from the SAS client into a caslib by using the CASUTIL procedure. (You can also use the table.loadTable action.) The following statements upload the data in Sashelp.Cars (which is a SAS data set) to the active caslib in the current CAS session. Optionally, you can use the LIST statement to list the tables in that caslib:

proc casutil;
   load data=Sashelp.Cars casout='Cars' replace; /* use active caslib */
   list tables;                                  /* show tables in active caslib */
quit;

As mentioned earlier, when you interact with the active caslib, you don't need to explicitly specify it in most actions and procedures. If you want to upload the data to a different caslib, you can use the OUTCASLIB= option on the LOAD statement.

Read a CAS table by using a libref

If you want to use a SAS procedure to read data from a CAS in-memory table, then you need to define a libref by using the CAS LIBNAME engine. For example, the following statement defines a libref that refers to the active caslib:

libname myLib cas;     /* refers to active caslib, whatever it is */

Alternatively, you can use the CASLIB= option to specify a caslib:

libname mycasu cas caslib=casuser; /* refers to CASUSER, even if not active */

The mycasu libref refers to the CASUSER personal caslib, regardless of whether it is the active caslib.

You can use a libref to download or upload a CAS table by using the SAS DATA step. For example, the following DATA step is an alternative way to upload data from Sashelp.Cars into a table in the active caslib:

data myLib.Cars;       /* upload to active caslib */
   set Sashelp.Cars;   /* these data are on the SAS client */
run;

You can use the libref to read data into any SAS procedure. For example, the following statement computes basic statistics for some of the variables in the CAS in-memory data:

proc means data=myLib.Cars;       /* use libref to read CAS table */
   var EngineSize MPG_City Weight;
run;

PROC MEANS is a CAS-enabled procedure, so it calls an action to perform the computations for data that are in a CAS table. The log tells you that it calls the aggregate action:

NOTE: The CAS aggregation.aggregate action will be used to perform the initial summarization.

Not every CAS-enabled procedure tells you the name of the action that was called. Sometimes the log will only notify you that CAS was involved by displaying a note such as the following:

NOTE: The Cloud Analytic Services server processed the request in 0.0097 seconds.

Summary

In summary, this article discusses the concept of a caslib, including how to learn the names of the defined caslibs and the active caslib. Caslibs are used by all actions and procedures that interact with data table on CAS server. In contrast, a libref is the familiar SAS-only reference, which you can use in SAS procedures to read data from a variety of data sources, including a CAS table.

References

The post Caslibs and librefs in SAS Viya appeared first on The DO Loop.