6月 022023
 

The analytics are mind boggling. But the story isn’t analytics. The designs will wow you. But the story isn’t design. The story, as it has been in each of the past 10 years, is that these amazing analytical designs sprang from the minds of students – North Carolina State University [...]

How college students are using analytics designs to reduce food waste was published on SAS Voices by Bill Abbate

5月 312023
 

As organizations embrace AI, they often handle large volumes of data that power AI systems. Handling data appropriately includes implementing adequate privacy policies and security measures to protect it. Doing so prevents accidental exposure and ensures ethical data use. AI technology often uses sensitive data for creating, training and utilizing models. [...]

Understanding the role of privacy and security in responsible innovation was published on SAS Voices by Kristi Boyd

5月 312023
 

Inserting various programming languages into SAS programSAS consistently expands its support for running non-SAS code inside SAS programs. It’s been a while since SAS introduced explicit SQL pass-through facility allowing SAS users to embed native Database Management Systems (DBMS) Sequel (SQL) code into its PROC SQL and PROC DS2.

Similarly, now you can run R code within PROC IML.

SAS Viya added PROC PYTHON enabling you to embed increasingly popular Python programming language code within your SAS programs.

However, while SAS developers welcome such cross-language functionality, its implementation often leads to making SAS programs bulky and overloaded, not easily readable and difficult to follow. In the spirit of modular programming, it would have been nice just referencing those “foreign” modules without actually inserting their code into SAS programs. Somewhat like %INCLUDE statement brings into SAS program pieces of SAS code stored in files.

Limitations of %INCLUDE

The problem with the %INCLUDE for bringing non-SAS code into a SAS program stems from the fact that % INCLUDE is not a SAS macro statement as it seems to be. That’s right. Despite prominently displaying % sign of distinction in front of itself, it has nothing to do with SAS macro facility. I know, it is misleading and it is a misnomer.  Let’s call it an exception. However, the fact is that %INCLUDE is a SAS global statement, and as such it is not processed by the SAS macro processor.

Being a global statement, %INCLUDE must be placed between other SAS statements (or be the very first or last statement in your SAS program). It cannot be positioned within and be part of other SAS statements.

In case of native SQL, its code resides within a SAS statement, namely inside parentheses of the FROM CONNECTION TO dbase ( ) clause of the explicit SQL pass-through construct. Therefore, placing %INCLUDE where the native SQL code is expected will generate a syntax error. Bummer!

Be %INCLUDE a true macro object the described above problem would not exist.

%EMBED macro function

Well, let’s leave the %INCLUDE alone and instead create a macro function (we'll call it %EMBED) that will do what %INCLUDE does - bringing the contents of an external code file into a SAS program.  The key difference though will be that we will make it a true macro function, which runs by the SAS macro processor during SAS program compilation (before execution). Then by the SAS program execution time, the contents of the external code file will be already injected where this macro function is called. Therefore, one can invoke (place) it inside other SAS statements.

%EMBED macro function does not care whether it embeds SAS code or non-SAS code; the contents of the external file determine what’s embedded. It is up to you, SAS developer, to place this macro call strategically where code from the external file is appropriate.

Here is how we can easily implement such a macro function:

/* -------------------------------------------------------------------
| DESCRIPTION | Macro to embed any code from a file into SAS program.
|---------------------------------------------------------------------
| INVOCATION  | %embed(full-file-name) or %embed(file-reference)
|---------------------------------------------------------------------
| AUTHOR      | Leonid Batkhan, May 2023
|-------------------------------------------------------------------*/
%macro embed(f);
  %local p ref rc fid;
  %let p = %sysfunc(findc(&f,/\:.));
  %if &p %then %let rc = %sysfunc(filename(ref, &f));
         %else %let ref = &f;
  %let fid = %sysfunc(fopen(&ref));
  %if &fid>0 %then
    %do %while(%sysfunc(fread(&fid))=0);
      %let rc = %sysfunc(fget(&fid, line, 32767));
      &line
    %end;
    %else %put ERROR: Macro &sysmacroname - file "&f" cannot be opened.;
  %let rc = %sysfunc(fclose(&fid));
  %if &p %then %let rc = %sysfunc(filename(ref));
%mend embed;

%EMBED macro function has a single argument (parameter) representing either a full-path file name (including extension) or a fileref assigned with a FILENAME statement or function. It is “smart enough” to distinguish between the file name and fileref and process them accordingly.

It returns as its value the contents of the external file specified as its argument (parameter). This value may contain multiple lines of code.

%EMBED macro function code highlights

The first %let p= statement determines whether argument f represents full-path file name (p>0) or a file reference (p=0). We deduce it from the fact that file name must contain at least one of the /\:. symbols, but fileref may not contain any of them.

Then the %if-%then-%else statement creates either its own fileref using %sysfunc(filename(ref, &f)) or assigns macro variable ref to &f (if &p>0). Since we do not provide an initial value for the ref macro variable, filename function will assign a unique system-generated fileref, which starts with #LN followed by 5 digits (e.g. #LN00009). This ensures that the fileref would not interfere with other potential filesref’s assigned outside of this macro.

Then we open this file. If file opening fails (fid=0) then we output an ERROR message in the SAS log.

If the file opens successfully (fid>0) then we loop through this file using fread( ) function (while fread=0) which loads one record per iteration into the file data buffer (FDB) and fget( ) function which copies data from the FDB to a macro variable line. Note, that in %sysfunc(fget(&fid, line, 32767)) second argument (line) do not need & in front of it.

The key here is the following line of code:

&line

This macro variable reference (not a %put &line) just "injects" the value of macro variable line into the return value of this macro function. For each iteration of the do-loop, we read a line of code from the external file and add it to the return value of the %embed macro function.

After the loop, we close the file and conditionally de-assign fileref (if we assigned it within the macro); we do not de-assign the fileref if it is assigned outside the macro in the calling program.

%EMBED macro function usage

While we developed %EMBED macro function to address limitations of %INCLUDE for inserting native SQL code into SAS pass-through clause, its usage is much broader. One can use %EMBED for embedding/including/inserting/injecting/delivering any code, SAS or non-SAS, anywhere in the SAS program. In essence, %embed is a more advanced and robust alternative to the %include.

Let’s explore several scenarios of the %embed macro function usage.

Embedding native DBMS SQL Code into SAS program

Suppose you have a native DBMS SQL query code stored in file C:\project\query1.sql. (File extension is arbitrary, and in this case just indicates what type of code we are going to embed.)

Then you can use SQL Procedure Pass-Through Facility of the SAS PROC SQL and %embed macro function as in the following example:

proc sql;
   connect to odbc (dsn="db_name");
   create table WORK.ABC as
   select * 
      from connection to odbc ( %embed(C:\project\query1.sql) );
   disconnect from odbc;
quit;

During compilation phase, SAS macro processor will replace %embed(C:\project\query1.sql) with the contents of the external file containing DBMS-specific native SQL code (instead of SAS PROC SQL code). Then during execution time, PROC SQL will pass this query code on to the DBMS for processing.  The DBMS will return to SAS the result of this query and SAS will create data table WORK.ABC.

You can also use %embed macro function for "injecting" SQL code from a file into explicit FedSQL pass-through and explicit SQL pass-through in DS2.

Embedding Python or Lua code into SAS program

In PROC PYTHON and PROC LUA, you can use their optional INFILE= clause to reference an external file that contains their statements to run within a SAS session. Alternatively, you can use %embed macro function instead. For example, if you have your Python code stored in a file C:\project\program.py then you can place your %embed(C:\project\program.py) macro function call between submit and endsubmit statements of the PROC PYTHON:

proc python;
  submit;
    %embed(C:\project\program.py)
  endsubmit;
run;

Similar for PROC LUA:

proc lua;
  submit;
    %embed(C:\project\program.lua)
  endsubmit;
run;

Embedding R code into SAS program

Here is an example of %embed usage for inserting R code into SAS using PROC IML:

proc iml;
  submit / R;
    %embed(C:\project\program.r)
  endsubmit;
run;

Embedding SAS code into SAS program

Finally, you can embed a SAS code from external file into SAS program. You can use it instead of %include statement, for example:

data TWO;
   set ONE;
run;
 
%embed(c:\project\data_three.sas)

Unlike %include global statement which can be placed only between SAS statements, %embed can be placed anywhere in a SAS program. You can use it not only within DATA or PROC steps, but also within SAS executable statements. That means %embed can bring in SAS code that can be executed conditionally. For example:

data a;
  set b;
  if x>0 then %embed(c:\project\code1.sas);
         else %embed(c:\project\code2.sas);
run;

Obviously, your embedded code must comply with the context of its surrounding.

Conclusion

As you can see, %embed macro function presents an elegant solution for embedding both “foreign” and SAS native code without cluttering the SAS program. Would you agree that %embed macro is a more robust alternative to the %include?

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions or your own tips and tricks for embedding other code into your programs? Please share with us below.

Additional resources

Embedding any code anywhere into SAS programs was published on SAS Users.

5月 302023
 

Did you know that over 300 people die daily from drug overdoses in the US? As a pharmacist, I have witnessed the harrowing effects of substance use disorder (SUD) and the resulting opioid crisis on individuals, families and communities. The complexity of addiction Who among us does not know someone [...]

Saving lives: Using data to combat the opioid crisis was published on SAS Voices by Ann Cutrell

5月 302023
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In this post I'll discuss how to load multiple CSV files into memory as a single table using the loadTable action.

Load and prepare data on the CAS server

To start, we need to create multiple CSV files in a folder. I created this script to load and prepare the CSV files in the CAS server for this post. This should work in your environment. The script does the following:

  • Loads the WARRANTY_CLAIMS_0117.sashdat files from the Samples caslib into memory.
  • Modifies the in-memory table by renaming and dropping columns.
  • Adds a sub directory in the Casuser caslib named csv_file_blogs.
  • Saves a CSV file for each distinct year in the csv_file_blogs folder.
  • Drops the original CAS table.
import swat
import pandas as pd
 
## Connect to CAS
conn = swat.CAS(Enter your CAS connection information)
 
##
## Data prep functions 
##
def prep_data():
    """
    Load and prepare the warranty_claims_0017.sashdat file in CAS
    """
    ## Load the WARRANTY_CLAIMS_0117.sashdat from the Samples caslib into memory in Casuser
    conn.loadTable(path='WARRANTY_CLAIMS_0117.sashdat', caslib='samples',
                   casout={'name':'warranty_claims', 
                           'caslib':'casuser',
                           'replace':True})
    ##
    ## DATA PREP
    ## 
 
    ## Reference the CAS table in an object
    castbl = conn.CASTable('warranty_claims', caslib = 'casuser')
 
    ## Store the column names and labels in a dataframe
    df_col_names = castbl.columnInfo()['ColumnInfo'].loc[:,['Column','Label']]
 
    ## Create a list of dictionaries of how to rename each column using the column labels
    renameColumns = []
    for row in df_col_names.iterrows():
        colName = row[1].values[0]
        labelName = row[1].values[1].replace(' ','_')
        renameColumns.append(dict(name=colName, rename=labelName))
 
    ## List of columns to keep in the CAS table
    keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model',
                   'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'}
 
    ## Rename and drop columns to make the table easier to use
    castbl.alterTable(columns = renameColumns, keep = keepColumns)
 
    ## Return the CASTable object reference
    return castbl
 
 
 
def save_cas_table_as_csv_files(cas_table_reference):
    """
    Create a subdirectory in Casuser and save mutliple CSV files in it.
    """
    ## Create a subdirectory in the Casuser caslib named csv_file_blogs
    conn.addCaslibSubdir(name = 'casuser', path = 'csv_file_blogs')
 
    ## Create a CSV file for each year
    for year in list(castbl.Model_Year.unique()):      
        (cas_table_reference
         .query(f"Model_Year ='{year}'")
         .save(name = f'csv_file_blogs/warranty_claims_{year}.csv', 
               caslib = 'casuser',
               replace = True)
        )
 
    ## Drop the CAS Table
    cas_table_reference.dropTable()
 
    ## View files in the csv_file_blogs subdirectory
    fi = conn.fileInfo(allFiles = True, caslib = 'casuser')
    fi_subdir = conn.fileInfo(path = 'csv_file_blogs', caslib = 'casuser')
    display(fi, fi_subdir)
 
 
## Create the CAS table
castbl = prep_data()
 
## Save the CAS table as a CSV file for each year
save_cas_table_as_csv_files(castbl)
 
## and the results
NOTE: Cloud Analytic Services made the file WARRANTY_CLAIMS_0117.sashdat available as table WARRANTY_CLAIMS in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2015.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2016.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2017.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2018.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2019.csv in caslib CASUSER(Peter).
NOTE: Cloud Analytic Services dropped table warranty_claims from caslib CASUSER(Peter.Styliadis).

The results show that five CSV files named warranty_claims<year>.csv were created in the subdirectory csv_file_blogs in the Casuser caslib.

Next, I'll use the fileInfo action to view the new csv_file_blogs subdirectory in the Casuser caslib. In the fileInfo CAS action use the includeDirectories parameter to view subdirectories.

conn.fileInfo(includeDirectories = True, caslib = 'casuser')

Lastly, I'll view the available files in the csv_file_blogs subdirectory. To view files in a subdirectory in a caslib add the folder name in the path parameter.

conn.fileInfo(path = 'csv_file_blogs', caslib = 'casuser')

The results show that the subdirectory has five CSV files.

Load all of the CSV files as a single table

To easily load multiple CSV files as a single in-memory CAS table use the table.loadTable CAS action. The only requirements when loading multiple CSV files:

  • All of the CSV files must end with .csv.
  • Set the multiFile option to True in the importOptions parameter.
  • The CSV files must have the same number of columns and the columns must have the same data type.

Here, I'll use the loadTable action with the path parameter to specify the subdirectory csv_file_blogs and the caslib parameter to specify the Casuser caslib. The importOptions parameter uses the fileType option to determine the other parameters that apply. The CSV file type has a multiFile option. If you set the multiFile option to True, it will try to load all CSV files from the path into a single in-memory table. The casOut parameter simply specifies output CAS table information.

conn.loadTable(path="csv_file_blogs", caslib = 'casuser',  ## Specify the subdirectory name (csv_file_blogs) and the input caslib name
               importOptions = {                           ## Specify the import options
                   'fileType' : 'CSV',
                   'multiFile' : True
               },
               casOut = {                                  ## Specify output CAS table information
                   'name' : 'allCSVFiles',
                   'caslib' : 'casuser',
                   'replace' : True
               })
 
## and the results
NOTE: The file, '/cas/data/caslibs/casuserlibraries/peter/csv_file_blogs/warranty_claims_2015.csv' was used to create the CAS Table column names.
NOTE: The CSV file table load for table, 'allCSVFiles' produced 153217 rows from 5 files.
NOTE: Cloud Analytic Services made the file csv_file_blogs available as table ALLCSVFILES in caslib CASUSER(Peter).

The action concatenated each CSV file and loaded them as a single distributed CAS table named ALLCSVFILES.

Next, I'll run the tableInfo action to view available in-memory tables in the Casuser caslib.

conn.tableInfo(caslib = 'casuser')

The action results show one CAS table is in memory.

Then I'll make a client-side reference to the distributed CAS table and run the head method from the SWAT package.

allcsvfilesTbl = conn.CASTable('allcsvfiles', caslib = 'casuser')
allcsvfilesTbl.head()

Finally, I'll run the SWAT value_counts on the Model_year column.

(allcsvfilesTbl   ## CAS table
 .Model_Year      ## CAS column
 .value_counts()  ## SWAT value_counts method
)
## and the results
2017.0    70479
2018.0    43975
2016.0    32707
2019.0     3510
2015.0     2546
dtype: int64

The results show all five years of data were imported into the CAS table, one for each CSV file.

Add file name and path columns to the CAS table

The importOptions parameter has a variety of CSV options you can use to modify how to import the files. Two useful options are showFile and showPath. The showFile option includes a column that shows the CSV file name. The showPath option includes a column that shows the fully-qualified path to the CSV file.

conn.loadTable(path="csv_file_blogs", caslib = 'casuser',  ## Specify the subdirectory name (csv_file_blogs) and the input caslib name
               importOptions = {                           ## Specify the import options
                   'fileType' : 'CSV',
                   'multiFile' : True,                     
                   'showFile' : True,
                   'showPath' : True
               },
               casOut = {                                  ## Specify output CAS table information
                   'name' : 'allCSVFiles_path_info',
                   'caslib' : 'casuser',
                   'replace' : True
               })
 
## and the results
NOTE: The file, '/cas/data/caslibs/casuserlibraries/peter/csv_file_blogs/warranty_claims_2015.csv' was used to create the CAS Table column names.
NOTE: The CSV file table load for table, 'allCSVFiles_path_info' produced 153217 rows from 5 files.
NOTE: Cloud Analytic Services made the file csv_file_blogs available as table ALLCSVFILES_PATH_INFO in caslib CASUSER(Peter).

I'll run the tableInfo action to view available CAS tables.

conn.tableInfo(caslib = 'casuser')


The action shows that two CAS tables are now in-memory.

I'll reference and preview the new CAS table .

allcsvfiles_path_infoTbl = conn.CASTable('allcsvfiles_path_info', caslib = 'casuser')
allcsvfiles_path_infoTbl.head()

The preview shows the new CAS table has a column named path and a column named fileName.

Lastly, I'll use the SWAT value_counts method to view how many rows came from each CSV file. I'll specify the CAS table reference, the column name, then the value_counts method.

(allcsvfiles_path_infoTbl
 .fileName
 .value_counts()
)
 
## and the results
warranty_claims_2017.csv    70479
warranty_claims_2018.csv    43975
warranty_claims_2016.csv    32707
warranty_claims_2019.csv     3510
warranty_claims_2015.csv     2546
dtype: int64

The results show the CSV files were concatenated into a single CAS table. We can see how many rows came from each file.

Summary

The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using table.loadTable CAS action to concatenate multiple CSV files into a single distributed CAS table using a single method.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 19 - Load Multiple CSV Files was published on SAS Users.

5月 302023
 

Real-world data often exhibits extreme skewness. It is not unusual to have data span many orders of magnitude. Classic examples are the distributions of incomes (impoverished and billionaires) and population sizes (small countries and populous nations). The readership of books and blog posts show a similar distribution, which is sometimes called the 80-20 rule or the Pareto principle. In these situations, a traditional histogram can be insufficient to visualize the distribution. For these examples, the X axis will be very long, and the tallest bars will be on the left side of the axis, as shown in the histogram to the right. In this histogram, the bars on the left side are so tall that you cannot even see the other bars.

What to do? Well, the usual advice to visualize data that extends over several order of magnitudes is to use a logarithmic transformation. This article shows three methods to visualize log-transformed values in a histogram in SAS:

  • Use constant bins in X but log-scale the X axis. In this technique, the histogram bins appear to be nonuniform in width. Bins on the left side of the X axis appear wider than bins on the right side of the X axis.
  • Manually log-transform the data, then construct a histogram of the transformed data. This produces a histogram that looks simple but can be hard to interpret. The bins are the same width for the scaled data but are not uniform on the scale of the original data. It can be useful to manually modify the tick values on the X axis if you want the reader to see the original (pre-transformed) data values.
  • If the X axis is acceptable but you want to adjust the heights of the bars, you can log-transform the vertical axis (Frequency or Counts).

Example data that needs to be log-transformed

For this article, we will use simulated data from a mixture of three exponential distributions. This simulation generates data that looks similar to data for "views of articles on my blog" during a one-year period. The following SAS DATA step simulates the data, displays descriptive statistics, and displays a histogram of the data.

/* simulate data from a mixture of three exponential distributions */
data Counts;
call streaminit(123);
do i = 1 to 1100;
   if i < 800 then scale=2000;
   else if i < 1000 then scale = 5000;
   else scale = 14000;
   Views = int( rand("Expo", scale) ); 
   if Views > 200 then
      output;
end;
drop i;
run;
 
proc means data=Counts;
   var Views;
run;
 
title "Pageviews for Top Blog Articles";
proc sgplot data=Counts;
   histogram Views / datalabel binwidth=4000 binstart=2000 scale=count;
   yaxis grid;
run;

The table shows that the least popular post got only 202 views. The most popular post got over 70,000 views. The histogram is shown at the top of this article. It shows the highly skewed distribution of views. The first bar shows that about 70% of the articles got between 0 and 4,000 annual views. There are 21 articles that got more than 20,000 views, but the bars on the right side of the histogram are basically invisible because the first bar is so tall. The following sections show three ways to modify the graph.

Log-scale X axis

One option is to choose histograms bins to be a constant width on the data scale but display the X axis by using a log-scale transformation. on the log scale, the bins on the left side of the X axis appear wider than bins on the right side of the X axis.

If you do this, you must make sure that X=0 is to the left of all bars. Otherwise, PROC SGPLOT will issue the error, "NOTE: The log axis cannot support zero or negative values for the axis from plot data.... The axis type will be changed to LINEAR." In the following example, I use the BINWIDTH= and BINSTART= options to ensure that all bars are to the right of 0. (Recall that Views>200 for these data.) With that change, I can use the TYPE=LOG and LOGBASE=10 options to apply a LOG10 transformation to the horizontal values of the graph:

title "Constant Bin Widths on a Log Scale";
proc sgplot data=Counts;
   histogram Views / binwidth=2000 binstart=1100;
   xaxis type=log logbase=10 label="Views (log10 spacing)";
   yaxis grid;
run;

I'm not fond of this graph. Even data visualization experts would find it unfamiliar, and trying to explain it to a nonexpert would be a challenge. I do not recommend this approach.

Log-transform the data

Instead of using uniform-width bins in the data coordinates, I prefer to manually log-transform the data and then create a histogram in the transformed coordinates. The following SAS DATA step computes the LOG10 value of the Views variable. Although all the Views in this example are positive, I include an IF-THEN statement that ensures that the program does not transform any nonpositive value. This is a good programming practice. A histogram then visualizes the distribution of the new variable:

/* Log-transform the X values and use non-constant bins */
data Pg / view=Pg;
set Counts;
if Views <= 0 then 
   log10Count = .;
else
   log10Count = log10(Views);
run;
 
title "Log(Pageviews) for Top Blog Articles";
proc sgplot data=Pg;
   histogram log10Count;
   xaxis label="log10(Count)" values=(2 to 5 by 0.5);
   yaxis grid;
run;

To be clear, not everyone would be able to interpret what this histogram reveals about the original data. Furthermore, some audiences might struggle to recall their high school math classes where they learned that log10(Views)=3 implies that Views=1,000, that log10(Views)=4 implies that Views=10,000, and so forth. One way to address the second issue is to create a set of custom tick marks for the X axis that show the scale of the original data, as follows:

/* Create custom tick values: https://blogs.sas.com/content/iml/2014/07/11/create-custom-tick-marks.html */
title "Log(Pageviews) for Top Blog Articles";
proc sgplot data=Pg;
   histogram log10Count;
   xaxis label="Count (log10 spacing)"
         values=(2 to 5 by 0.5)
         valuesdisplay=('100' '316' '1000' '3162' '10,000' '31,623' '100,000' );
   yaxis grid;
run;

Log-transform the vertical axis (counts)

For some graphs, it is not the X variable that is the problem, it is the distribution of counts. If one or two bins contain most of the counts, then one option to visualize the data is to log-transform the Y axis. That is, display the log-counts for each bin. For some data, you can do this by using the TYPE=LOG option on the YAXIS statement in PROC SGPLOT.

Unfortunately, this simple modification is not possible for these data. To log-transform an axis, we must ensure that 0 is not included in the range of the axis. This example has bins that have zero counts. For data like these, you cannot transform the Y axis simply by using the TYPE=LOG option.

However, you can use a trick to overcome this issue. I have previously shown how you can use PROC UNIVARIATE to output the histogram bins and counts to a SAS data set. (See the section, "A trick to replace the histogram by a high-low plot" in a previous article.) You can then use a high-low plot (same reference) to emulate the histogram. This technique can be used here: for bins that have a nonzero count, apply the log-transformation; for bins that have zero count, set a missing value for the log-count. This is implemented in the following statements:

/* 
use high-low plot instead of histogram: See last section of
https://blogs.sas.com/content/iml/2023/05/01/overlay-curve-histogram-sas.html
*/
proc univariate data=Counts;
   ods select histogram;
   histogram Views / vscale=count outhist=HistOut midpoints=(2000 to 72000 by 4000);
run;
 
data LogYScale;
set HistOut;
Zero = 0; /* explicitly specify the bottom for highlow bars */
if _Count_=0 then 
   LogCount = .;
else
   LogCount = log10(_Count_);
run;
 
title "Use HIGHLOW Plot to Emulate Histogram";
title2 "Midpoints of Bins";
proc sgplot data=LogYScale;
   highlow x=_MidPt_ low=zero high=LogCount / type=bar barwidth=1; /* emulate histogramparm stmt */
   /* use same tick range as for midpoints= option in PROC UNIVARIATE */
   xaxis values=(2000 to 72000 by 4000) valueshint;
   /* do not pad the Y axis at 0 */
   yaxis offsetmin=0 grid label="Log10(Count)";
run;

Notice the difference between this graph and the previous. In this graph, I have not transformed the X axis. The bins are all the same width on the data scale. What is different is the Y axis. This now shows the log(Count). As before, if your audience does not recall that log10(Count)=2 means that Count=100, you can replace the ticks on the Y axis with numbers like 10, 32, and 100.

Summary

When a data distribution is extremely skewed and has a long tail, you might want to use a log-transformation to visualize the distribution. There are two approaches: Use constant-width bins in the data scale or use constant-width bins in the log-transformed scale. The first results in a graph that appears to have non-uniform bins. The second results in a standard histogram for the log-transformed data. I prefer the second method.

A related issue is that sometimes the histogram is dominated by one or two extremely tall bars, which makes it impossible to see the distribution of the rest of the data values. One option is to log-transform the Y axis. If there are bins that have zero counts, you must use a little extra care to handle those bins.

Further reading

Several ideas in this article have been discussed previously. For more information about plotting data on a log-scale scale, see the following:

The post How to use a log-scale on a histogram appeared first on The DO Loop.

5月 272023
 

This year, SAS Innovate in Orlando brought executives and other industry luminaries together to inspire attendees to use analytics to outpace tomorrow. Data and analytics enables you to get to the true value of a human being.  Michael Lewis, Author of Moneyball Chief Marketing Officer Jenn Chase, who hosted the event, [...]

A look back at SAS Innovate in Orlando was published on SAS Voices by Caslee Sims

5月 252023
 

As AI rapidly advances over the next several years, I’ve been fortunate to have an active role in helping to guide a responsible path forward when it comes to technology’s impact on our daily lives. Currently, this role includes serving as Vice President for the SAS Data Ethics Practice, as an [...]

Fostering AI common sense: The need for critical thinking and healthy skepticism was published on SAS Voices by Reggie Townsend

5月 242023
 

Labeling objects in graphs can be difficult. SAS has a long history of providing support for labeling markers in scatter plots and for labeling regions on a map. This article discusses how the SGPLOT procedure decides where to put a label for a polygon. It discusses the advantages and disadvantages of the built-in labeling method. When the built-in method is not sufficient, you can specify the label positions yourself and use the TEXT statement in PROC SGPLOT to add labels wherever you want them, such as at the centroid of a polygon.

How polygons are labeled in PROC SGPLOT

The POLYGON statement in PROC SGPLOT supports the LABEL= option, which enables you to specify a label for each polygon. The documentation does not specify how the position of the label is determined, but the LABELLOC= option supports the values InsideBBox and OutsideBBox, from which we can infer that a bounding box ("BBox") is used to position the labels. By default, the label for a polygon is placed at the center of the bounding box for the polygon. Optionally, you can use the LABELPOS= option to place the label at the left, right, top, or bottom of the bounding box. You can also use the LABELLOC= option to specify whether the label is inside or outside of the bounding box.

The following example shows the default position of a label for a polygon, which is the center of the bounding rectangle. The data defines three triangles: an acute triangle, a right triangle, and an obtuse triangle. The labels are 'A', 'R', and 'O', respectively.

/* explore how triangles are labeled in PROC SGPLOT */
data Triangles;
length ID $1;
input ID n @;
do vertex = 1 to n;
   input x y @;
   output;
end;
/* ID  x1 y1 x2 y2 x3 y3 */
datalines;
R 3  0 0  1 0  0 1
O 3  5 0  6 0  3 1
A 3  0 2  6 2  2 3
;
 
ods graphics / width=640px height=400;
title "Labels for Polygons";
title2 "Position Are Determined by a Bounding Box";
proc sgplot data=Triangles noautolegend aspect=0.5;
   polygon x=x y=y id=ID / group=ID fill fillattrs=(transparency=0.5) outline
      label=ID LabelLoc=InsideBBox LabelPos=center labelattrs=(size=18); 
   xaxis grid;
   yaxis grid;
run;

In the figure, I have overlaid a semi-transparent bounding rectangle for each triangle so that it is obvious that the label is placed at the center of the bounding rectangle, not in the center of the triangle. You can use the LABELPOS= option to place the label at other positions relative to the bounding rectangle.

Centroids of polygons

The graph shows that the center of a bounding rectangle is inside the acute triangle but is not inside the right and obtuse triangles. If you want to ensure that the label for a triangle is inside the triangle, you can use the centroid of the triangle. In fact, the centroid of any convex polygon is inside the polygon. (For a non-convex polygon, the centroid might not be inside the polygon. The classic example is an L-shaped figure.)

I have previously written about how to compute the centroid of a polygon in SAS, so let's compute the centroids for the three triangles and use that position for the label. This requires the following steps:

  1. Compute the centroids: I will reuse the SAS IML code from my previous article. To reuse the code, I need to convert the character labels into an equivalent set of numeric values. I can do this by computing the rank of the character values.
  2. Write the centroids to a SAS data set and concatenate the polygon data and the centroids.
  3. Call PROC SGPLOT. Use the POLYGON statement to plot the polygons and use the TEXT statement to overlay the labels at the centroid locations.

The following SAS IML program carries out the computation. I first copy/paste the previously written PolyCentroid function, which computes the centroids, and store it by using the STORE statement. The main program uses the LOAD statement to load the PolyCentroid function, which simplifies the main program.

/* "Compute the centroid of a polygon in SAS" (Wicklin, 2016)
   https://blogs.sas.com/content/iml/2016/01/13/compute-centroid-polygon-sas.html */
proc iml;
/* _PolyCentroid: Return the centroid of a simple polygon.
   P  is an N x 2 matrix of (x,y) values for consectutive vertices. N > 2. */
start _PolyCentroid(P);
   lagIdx = 2:nrow(P) || 1;              /* vertices 2, 3, ..., N, 1 */
   xi   = P[ ,1];       yi = P[ ,2];     /* x_i and y_i for i=1..N   */
   xip1 = xi[lagIdx]; yip1 = yi[lagIdx]; /* x_{i+1} and y_{i+1}, x_{N+1}=x_1 */
   d = xi#yip1 - xip1#yi;                /* vector of difference of products */
   A = 0.5 * sum(d);                     /* signed area of polygon */
   cx = sum((xi + xip1) # d) / (6*A);    /* X coord of centroid */
   cy = sum((yi + yip1) # d) / (6*A);    /* Y coord of centroid */
   return( cx || cy );
finish;
/* If the polygon P has two columns, return the centroid. If it has three 
   columns, assume that the third column is an ID variable that identifies
   distinct polygons. Return the centroids of the multiple polygons. */
start PolyCentroid(P);
   if ncol(P)=2 then  return( _PolyCentroid(P) );
   ID = P[,3];
   u = uniqueby(ID);         /* starting index for each group */
   result = j(nrow(u), 2);   /* allocate vector to hold results */
   u = u // (nrow(ID)+1);    /* append (N+1) to end of indices */
   do i = 1 to nrow(u)-1;    /* for each group... */
      idx = u[i]:(u[i+1]-1); /* get rows in group */
      result[i,] = _PolyCentroid( P[idx, 1:2] ); 
   end;
   return( result );
finish;
store module=(_PolyCentroid PolyCentroid);
QUIT;

Now run the main program, which uses the PolyCentroid function. It also uses the rank of the labels to create a numerical vector that has a one-to-one correspondence with the labels:

proc iml;
load module=(PolyCentroid);
 
/* read polygon data */
use Triangles;
   read all var {x  y} into P;
   read all var {ID};
close;
 
/* If the ID variable is character, convert it to a numerical vector. See
   https://blogs.sas.com/content/iml/2023/05/22/rank-character-variables-sas.html */
start Ranktiec_Dense(C);
   D = design(C);     /* an n x k matrix, where k = ncol(unique(C)) */
   rank = D * T(1:ncol(D));
   return rank;   
finish;
 
nID = Ranktiec_Dense(ID);
 
/* use nID as third column in polygon matrix */
P = P || nID;
C = PolyCentroid(P);
/* Get the label for each polygon. For triangles, these are the elements {1, 4, 7, ...} */
firstIdx = do(1, nrow(ID), 3);
cID = ID[firstIdx];
*print C[r=cID];
 
/* write to SAS data set */
create Centroids from cID C[c={'cID' 'cx' 'cy'}];
   append from cID C;
close;
QUIT;
 
/* concatenate polygons and centroids */
data All;
set Triangles Centroids;
run;
 
title "Label Polygons at Centroid";
proc sgplot data=All noautolegend;
   polygon x=x y=y id=ID / group=ID fill fillattrs=(transparency=0.5) outline; 
   text x=cx y=cy text=cID / group=cID vcenter=BBox textAttrs=(Size=18);
   xaxis grid;
   yaxis grid;
run;

Success! The label for each triangle is now displayed at the center of the triangle.

Manually specifying the label positions

For small collections of polygons, you can, of course, manually specify the positions of the labels. You can create a data set that is similar to the Centroids data set and for which the labels are placed at any (cx,cy) coordinates that you specify.

Summary

This article explains that the POLYGON statement in PROC SGPLOT supports the LABEL= option for labeling a polygon. A label is placed according to a bounding box that surrounds the polygon. You can place the label at the center of the bounding box (the default), or you can use the LABELPOS= option to place the label at other positions relative to the bounding rectangle. If the LABELPOS= option does not provide you with enough flexibility, you can use the TEXT statement to manually label the polygons. This article shows how to compute the centroid for a polygon and to place a label at the centroid.

The post How does PROC SGPLOT position labels for polygons? appeared first on The DO Loop.