This post demonstrates how to rank data and how to place these ranks into roughly equal groups.

There are certain variables, such as annual salary, that are highly skewed. There are many who earn between \$50,00 and \$150,000, but some who earn millions or hundreds of millions of dollars a year. Trying to use variables like annual salary in statistical models typically violates assumptions of many popular statistical techniques. There are several solutions to the types of distribution problems we just described. One solution is to use a transformation like a logarithm of a value to "bring in the tail." Another solution is to substitute ranks for the original values. For example, the lowest salary would be assigned a rank of one, the next highest would be assigned a rank of two, and so forth. Another method is to place all of the values into a number of bins. For example, you could place all the salaries into ranges such that there would be approximately an equal number of values in each range.

You can use SAS Studio tasks to create ranks and, with a tiny bit of editing, create salary ranges.

Let's start with a data set called Salary that was created by a small program using a random number function. Shown below is a histogram and a smooth line representing 1,000 values of salary from this data set.

You see a grouping of values on the left side of the distribution and a few very high salaries in the right tail. For curious readers, here is the program that generated these data values.

The RAND function can generate quite a few distributions, such as uniform and normal. For this program, an exponential distribution was used.

Suppose you plan to use yearly salary in a binary logistic regression model. Using the actual values from the Salary data set would not work well. Let's start out by creating a new variable that represents the rank of salary. In SAS Studio, this is easily done using the Rank Data task as one of the selections under the Data tab. You can see this in the figure below.

You choose the data set and variable to rank on the DATA tab, like this.

The Salary data set was selected, and the variable Salary was chosen as the variable (column) to rank. Finally, Rank_Salary was selected for the output data set name. A histogram of the ranks is, as you would expect, uniform ranging from one to 1,000 (see figure below).

How can you place these 1,000 values into 10 bins? To do this, you click the CODE tab and then click Edit (circled in the figure below).

All you need to do is add the PROC RANK option Groups=10 to this program as shown next.

This option groups all the ranks into 10 groups. Below is a histogram of the variable Rank_Salary with the Groups= option included.

This new variable would work quite well in a logistic regression model or other types of regression.

In SAS Studio, the ordering of rows and columns in the Table Analysis task are, by default, arranged by the internal ordering of the values used in the table. The table arranges the variables alphabetically or numerically by increasing value. For example, traditional coding uses 1 for Yes and 0 for No, so the No column is created as the first row because the internal value is 0. There are times when it makes more sense to change the order of the rows and/or columns.

Suppose you have data on risk factors for having a heart attack (including high blood pressure) and outcome data (heart attack). A data set called Risk has data on the status of blood pressure and heart attack (simulated data).

Here are the first 10 observations from that data set:

You can use PROC FREQ to create a 2x2 table or you can use the SAS Studio task called Table Analysis (in the Statistics task list) to create your table. Regardless of whether you decide to write a program or use a SAS Studio task, the resulting table looks like this:

Because we are more interested in what causes heart attacks, we would prefer to have Yes (1) as the first row and column of the table. Here is how to do it with a short SAS program:

You create a format that labels 1 as '1:Yes' and 0 as '2:No' and associate this format with both variables in the PROC FREQ step. You also include the PROC FREQ option ORDER=formatted. This option orders values by their formatted values rather than the default ordering—by the internal values. The original table placed 0 before 1 for that reason. By being tricky and placing the 1: and 2: in the format label, you are forcing the Yes values to come before the No values (otherwise, 'No' would come before 'Yes' – alphabetical order). Here is the output:

If you decided to use a SAS Studio task to create the table, you would open the Code window and click the Edit icon. You could then add PROC FORMAT and the ORDER=formatted option in the TABLES statement.

For the curious readers who would like to see how the Risk data set was created, here is the code:

Here the RAND function is generating a Bernoulli distribution (0 or 1), based on a probability of a getting a 1. You specify this probability using the second argument of the function. One more note: The statement CALL STREAMINIT is used to generate the same series of random numbers every time you run the program. If you omit this statement, the program generates a different series of random numbers every time you run it.

The more I use SAS Studio in the cloud via SAS OnDemand for Academics, the more I like it. To demonstrate how useful the Files tab is, I'm going to show you what happens when you drag a text file, a SAS data set, and a SAS program into the Editor window.

I previously created a folder called MyBookFiles and uploaded several files from my local computer to that folder.  You can see a partial list of files in the figure below.

Notice that there are text files, SAS data sets, SAS programs, and some Excel workbooks. Look what happens when I drag a text file (Blank_Delimiter.txt) into the Editor window.

No need to open Notepad to view this file—SAS Studio displays it for you. What about a SAS data set? As an example, I dragged a SAS data set called blood_pressure into the Editor.

You see a list of variables and some of the observations in this data set.  There are vertical and horizontal scroll bars (not shown in the figure) to see more rows or columns. If you want to see a listing of the entire data set or the first 'n' observations, you can run the List Data task, located under the Tasks and Utilities tab.

For the last example, I dragged a SAS program into the editor. It appears exactly the same as if I opened it in my stand-alone version of SAS.

At this point, you can run the program or continue to write more SAS code. By the way, the tilde (~) used In the INFILE statement is a shortcut for your home directory. Follow it with the folder name and the file name.

You can read more about SAS Studio in the cloud in my latest book, Getting Started with SAS Programming: Using SAS Studio in the Cloud.

As a SAS consultant I have been an avid user of SAS Enterprise Guide for as long as I can remember. It has been not just my go-to tool, but that of many of the SAS customers I have worked with over the years.

It is easy to use, the interface intuitive, a Swiss Army knife when it comes to data analysis. Whether you’re looking to access SAS data or import good old Excel locally, join data together or perform data analysis, a few clicks and ta-dah, you’re there! Alternatively, if you insist on coding or like me, use a bit of both, the ta-dah point still holds.

SAS Enterprise Guide, or EG as it is commonly known as, is a mature SAS product with many years of R&D, an established user base, a reliable and trusted product. So why move to SAS Studio? Why should I leave the comfort of what works?

For the last nine months I have been working with one of the UK’s largest supermarket answering that exact question as they make that journey from SAS Enterprise Guide to SAS Studio. EG is used widely across several supermarket operations, including:

• supply chain (to look at wastage and stock availability)
• marketing analytics (to look at customer behaviour and build successful campaigns)
• fraud detection (to detect misuse of vouchers).

## What is SAS Studio?

Firstly, let's answer the "what is SAS Studio" question. It is the browser-based interface for SAS programmers to run code or use predefined tasks to automatically generate SAS code. Since there is nothing to install on your desktop, you can access it from almost any machine: Windows or Mac. And SAS Studio is brought to you by the same SAS R&D developers who maintain SAS Enterprise Guide.

SAS Studio with Ignite (dark) theme

### 1. Still does the regular stuff

It allows you to access your data, libraries and existing programs and import a range of data sources including Excel and CSV. You can code or use the tasks to perform analysis. You can build queries to join data, create simple and complex expressions, filter and sort data.

But it does much more than that... So what cool things can you do with SAS Studio?

### 2. Use the processing power of SAS Viya

SAS Studio (v5.2 onwards) works on SAS Viya. Previously SAS 9 had the compute server aka the workspace server as the processing engine. SAS Viya has CAS, the next generation SAS run time environment which makes use of both memory and disk. It is distributed, fault tolerant, elastic and can work on problems larger than the available RAM. It is all centrally managed, secure, auditable and governed.

### 3. Cool new functionality

SAS Studio comes with many enhancements and cool new functionality:

• Custom tasks. You can easily build your own custom tasks (software developer skills not required) so others without SAS coding skills can utilise them. Learn more in this Ask the Expert session.
• Code snippets. It comes with pre-defined code snippets, commonly used bits of code that you can save and reuse. Additionally, you can create your own which you can share with colleagues. Coders love that these code snippets can be used with keystroke abbreviations.
• Background submit.  This allows you to run code in the background whilst you continue to work.
• DATA step debugger. First added into SAS Enterprise Guide, SAS Studio now offers an interactive DATA step debugger as well.
• Flexible layout for your workspace, You can have multiple tabs open for each program, and open multiple datasets and items.
• FEDSQL. The query window

DATA step debugger in SAS Studio

### 4. Seamlessly access the full suite of SAS Viya capabilities

A key benefit of SAS Studio is the ease of which you can move from writing code to doing some data discovery, visualisation and model building. Previously in the SAS 9 world you may have used EG to access and join your data and then move to SAS Enterprise Miner, a different interface, installed separately to build a model. Those days are long gone.

To illustrate the point, if I wanted to build a campaign to see who would respond to a supermarket voucher, I could access my customer data and join that to my transaction and products data in SAS Studio. I could then move into SAS Visual Analytics to identify the key variables I would need to build an analytical model and even the best model to build. From there I would move to SAS Visual Data Mining and Machine Learning to build the model. I could very easily use the intuitive point-and-click pipeline interface to build several models, incorporating an R or Python to find the best model. This would all be done within one browser-based interface and the data being loaded only once.

This tutorial from Christa Cody illustrates this coding workflow in action.

## The Road to SAS Studio

SAS Studio clearly has a huge number of benefits, it does the regular stuff you would expect, but additionally brings a host of cool new functionality and the processing power of SAS Viya, not to mention allowing you to move seamlessly to the next steps of the analytical and decisioning journey including model building, creating visualisations, etc.

### Change management + technical enablement = success

Though adoption of modern technology can bring significant benefits to enterprise organisations as this supermarket is seeing, it is not without its challenges. Change is never easy and the transition from EG to Studio will take time. Especially with a mature, well liked and versatile product like EG.

The cultural challenge that new technology provides should not be underestimated and can provide a barrier to adoption. Newer technology requires new approaches, a different way of working across diverse user communities many of whom have well established working practices that may in some cases, resist change. The key is to invest the time with the communities, explain how newer technology can support their activities more efficiently and provide them with broader capability.

Visit the Learn and Support center for SAS Studio.

While working at the Rutgers Robert Wood Johnson Medical School, I had access to data on over ten million visits to emergency departments in central New Jersey, including ICD-9 (International Classification of Disease – 9th edition) codes along with some patient demographic data.

I also had the ozone level from several central New Jersey monitoring stations for every hour of the day for ten years. I used PROC REG (and ARIMA) to assess the association between ozone levels and the number of admissions to emergency departments diagnosed as asthma. Some of the predictor variables, besides ozone level, were pollen levels and a dichotomous variable indicating if the date fell on a weekend. (On weekdays, patients were more likely to visit the personal physician than on a weekend.) The study showed a significant association between ozone levels and asthma attacks.

It would have been nice to have the incredible diagnostics that are now produced when you run PROC REG. Imagine if I had SAS Studio back then!

In the program, I used a really interesting trick. (Thank you Paul Grant for showing me this trick so many years ago at a Boston Area SAS User Group meeting.) Here's the problem: there are many possible codes such as 493, 493.9, 493.100, 493.02, and so on that all relate to asthma. The straightforward way to check an ICD-9 code would be to use the SUBSTR function to pick off the first three digits of the code. But why be straightforward when you can be tricky or clever? (Remember Art Carpenter's advice to write clever code that no one can understand so they can't fire you!)

The following program demonstrates the =: operator:

```*An interesting trick to read ICD codes;
<strong>Data</strong> ICD_9;
input ICD : \$7. @@;
if ICD =: "493" the output;
datalines;
493 770.6 999 493.9 493.90 493.100
;
title "Listing of All Asthma Codes";
<strong>proc</strong> <strong>print</strong> data=ICD_9 noobs;
<strong>run</strong>;```

Normally, when SAS compares two strings of different length, it pads the shorter string with blanks to match the length of the longer string before making the comparison. The =: operator truncates the longer string to the length of the shorter string before making the comparison.

The t-test is a very useful test that compares one variable (perhaps blood pressure) between two groups. T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. It is used to determine whether there is a significant difference between the means of two groups. With all inferential statistics, we assume the dependent variable fits a normal distribution. When we assume a normal distribution exists, we can identify the probability of a particular outcome. The procedure that calculates the test statistic compares your data to what is expected under the null hypothesis. There are several SAS Studio tasks that include options to test this assumption. Let's use the t-test task as an example.

You start by selecting:

Tasks and Utilities □ Tasks □ Statistics □ t Tests

On the DATA tab, select the Cars data set in the SASHELP library. Next request a Two-sample test, with Horsepower as the Analysis variable and Cylinders as the Groups variable. Use a filter to include only 4- or 6-cylinder cars. It should look like this:

On the OPTIONS tab, check the box for Tests for normality as shown below.

All the tests for normality for both 4-cylinder and 6-cylinder cars reject the null hypothesis that the data values come from a population that is normally distributed. (See the figure below.)

### Should you abandon the t-test results and run a nonparametric test analysis such as a Wilcoxon Rank Sum test that does not require normal distributions?

This is the point where many people make a mistake. You cannot simply look at the results of the tests for normality to decide if a parametric test is valid or not. Here is the reason: When you have large sample sizes (in this data set, there were 136 4-cylinder cars and 190 6-cylinder cars), the tests for normality have more power to reject the null hypothesis and often result in p-values less than .05. When you have small sample sizes, the tests for normality will not be significant unless there are drastic departures from normality. It is with small sample sizes where departures from normality are important.

The bottom line is that the tests for normality often lead you to make the wrong decision. You need to look at the distributions and decide if they are somewhat symmetrical. The central limit theory states that the sampling distribution of means will be normally distributed if the sample size is sufficiently large. "Sufficiently large" is a judgment call. If the distribution is symmetrical, you may perform a t-test with sample sizes as small as 10 or 20.

The figure below shows you the distribution of horsepower for 4- and 6-cylinder cars.

With the large sample sizes in this data set, you should feel comfortable in using a t-test. The results, shown below, are highly significant.

If you are in doubt of your decision to use a parametric test, feel free to check the box for a nonparametric test on the OPTIONS tab. Running a Wilcoxon Rank Sum test (a nonparametric alternative to a t-test), you also find a highly significant difference in horsepower between 4- and 6-cylinder cars. (See the figure below.)

You can read more about assumptions for parametric tests in my new book, A Gentle Introduction to Statistics Using SAS Studio.

I have been programming SAS for a LONG time and have never seen much in the way of programming standards. For example, most SAS programmers indent DATA and PROC statements (I like three spaces). Most programmers do not like to see more than one statement on a line and most agree that there should be blank lines between program boundaries (DATA and PROC steps).

I thought I would share some of my thoughts on programming standards, with the hope that others will chime in with their ideas.

• I like to indent all the statements in a DO group or DO loop. If there are nested groups, each one gets indented as well.
• I prefer variable names in proper case.
• I am not a fan of camel-case. For example, I prefer Weight_Kg to WeightKg. The reason that some programmers like camel-case is that SAS will automatically split a variable name at a capital letter in some headings.
• I like my TITLE statements in open code, not inside a PROC. To me, that makes sense because TITLE statements are global.
• There should be no conversion messages (character to numeric or numeric to character) in the SAS log. For example use Num = INPUT(Char_Num,12.); instead of Num = 1*Char_Num;. The latter statement forces an automatic character to numeric conversion and places a message in the log.
• I always use the statement ODS NOPROCTITLE;. This eliminates the default SAS procedure name at the top of the output.
• Although fewer and fewer people are reading raw text data, I like my @ signs to all line up in my INPUT statement.
• I like to use the /* and */ comments to define all macro variables. For example:

Notice that I prefer named parameters in my macros, instead of positional parameters.

If this seems like too much work - SAS Studio has an automatic formatting tool that can help standardize your programs. For example, look at the code below:

Really ugly, right? Here is how you can use the automatic formatting tool in SAS Studio.

When you click this icon, the program now looks like this:

That’s pretty much the way I would write it. By the way, if you don't like how Studio formatted your code, enter a control-z to undo it.

It seems that everyone knows about GitHub -- the service that hosts many popular open source code projects. The underpinnings of GitHub are based on Git, which is itself an open-source implementation of a source management system. Git was originally built to help developers collaborate on Linux (yet another famous open source project) -- but now we all use it for all types of projects.

There are other free and for-pay services that use Git, like Bitbucket and GitLab. And there are countless products that embed Git for its versioning and collaboration features. In 2014, SAS developers added built-in Git support for SAS Enterprise Guide.

Since then, Git (and GitHub) have grown to play an even larger role in data science operations and DevOps in general. Automation is a key component for production work -- including check-in, check-out, commit, and rollback. In response, SAS has added Git integration to more SAS products, including:

• the Base SAS programming language, via a collection of SAS functions.
• SAS Data Integration Studio, via a new source control plugin
• SAS Studio (experimental in v3.8)

You can use this Git integration with any service that supports Git (GitHub, GitLab, etc.), or with your own private Git servers and even just local Git repositories.

## SAS functions for Git

Git infrastructure and functions were added to SAS 9.4 Maintenance 6. The new SAS functions all have the helpful prefix of "GITFN_" (signifying "Git fun!", I assume). Here's a partial list:

 GITFN_CLONE Clones a Git repository (for example, from GitHub) into a directory on the SAS server. GITFN_COMMIT Commits staged files to the local repository GITFN_DIFF Returns the number of diffs between two commits in the local repository and creates a diff record object for the local repository. GITFN_PUSH Pushes the committed files in the local repository to the remote repository. GITFN_NEW_BRANCH Creates a Git branch

The function names make sense if you're familiar with Git lingo. If you're new to Git, you'll need to learn the terms that go with the commands: clone, repo, commit, stage, blame, and more. This handbook provided by GitHub is friendly and easy to read. (Or you can start with this xkcd comic.)

You can

 ```data _null_; version = gitfn_version(); put version=;   rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/", "c:\Projects\sas-dummy-blog"); put rc=; run;```

In one line, this function fetches an entire collection of code files from your source control system. Here's a more concrete example that fetches the code to a work space, then runs a program from that repository. (This is safe for you to try -- here's the code that will be pulled/run. It even works from SAS University Edition.)

```options dlcreatedir; %let repoPath = %sysfunc(getoption(WORK))/sas-dummy-blog; libname repo "&repoPath."; libname repo clear;   /* Fetch latest code from GitHub */ data _null_; rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/", "&repoPath."); put rc=; run;   /* run the code in this session */ %include "&repoPath./rng_example_thanos.sas";```

You could use the other GITFN functions to stage and commit the output from your SAS jobs, including log files, data sets, ODS results -- whatever you need to keep and version.

## Using Git in SAS Data Integration Studio

SAS Data Integration Studio has supported source control integration for many years, but only for CVS and Subversion (still in wide use, but they aren't media darlings like GitHub). By popular request, the latest version of SAS Data Integration Studio adds support for a Git plug-in.

See the documentation for details:

Read more about setup and use in the available here as part of our "Custom Tasks Tuesday" series.

## Using Git in SAS Enterprise Guide

This isn't new, but I'll include it for completeness. SAS Enterprise Guide supports built-in Git repository support for SAS programs that are stored in your project file. You can use this feature without having to set up any external Git servers or repositories. Also, SAS Enterprise Guide can recognize when you reference programs that are managed in an external Git repository. This integration enables features like program history, compare differences, commit, and more. Read more and see a demo of this in action here.

If you use SAS Enterprise Guide to edit and run SAS programs that are managed in an external Git repository, here's an important tip. Change your project file properties to "Use paths relative to the project for programs and importable files." You'll find this checkbox in File->Project Properties.

With this enabled, you can store the project file (EGP) and any SAS programs together in Git, organized into subfolders if you want. As long as these are cloned into a similar structure on any system you use, the file paths will resolve automatically.

The concept of "current working directory" is important within any SAS program that reads or creates external files. In SAS, when you reference a file location with a relative path (for example, "./projects/mydata.pdf"), that file reference resolves to an absolute path by way of the working directory. You can control the initial working directory by modifying the shell scripts that launch the SAS process, or by specifying the simple SAS macro that allows you to learn the current working directory. The macro uses a trick to assign a SAS fileref to the current path ('.'), grab the full path of that fileref by using Read the article for the full source (it's only about 7 lines). Here's how you would use it:

```56         %put Current path is %curdir;
Current path is C:\WINDOWS\system32
```

As you might infer from my example here, I'm running this on a managed Windows environment. Most users cannot write to the "C:\WINDOWS\system32" path (and would not want to), so any relative file paths in my SAS code would cause errors. Maybe you've seen something like this:

```25         ods html file="./test.html";
NOTE: Writing HTML Body file: ./test.html
ERROR: Insufficient authorization to access C:\WINDOWS\system32\test.html.
ERROR: No body file. HTML output will not be created.
```

If I want to use a relative path, I need to change the current working directory. Fortunately, there's a simple way to do that.

## Change the current directory in SAS

Use the

 ```/* working path for my projects */ %let rc = %sysfunc(dlgcdir('u:/projects'));   ods html file="./test.html"; proc print data=sashelp.class; run; ods html close;```

I can use my account-specific environment variables to make these paths work for all users. For example, on Windows I can reference the USERPROFILE environment variable. (On Unix, I can use the HOME environment variable instead.)

```/* working path for my projects */ %let user = %sysget(USERPROFILE); %let rc = %sysfunc(dlgcdir("&user./Documents"));   /* create an output data folder if needed */ options dlcreatedir; libname outdata "./data";   ods html file="./test.html"; data outdata.class; set sashelp.class; run; proc print data=outdata.class; run; ods html close;```

Here's my log output. Notice how the HTML file and the output data folder are both created at locations relative to my home directory.

```25         /* working path for my projects */
26         %let user = %sysget(USERPROFILE);
27         %let rc = %sysfunc(dlgcdir("&user./Documents"));
NOTE: The current working directory is now "C:\Users\sascrh\Documents".
28
29         options dlcreatedir;
30         libname outdata "./data";
NOTE: Library OUTDATA was created.
NOTE: Libref OUTDATA was successfully assigned as follows:
Engine:        V9
Physical Name: C:\Users\sascrh\Documents\data
31
32         ods html file="./test.html";
NOTE: Writing HTML Body file: ./test.html
33         data outdata.class;
34          set sashelp.class;
35         run;
```

If using SAS Enterprise Guide, you can add DLGCDIR function steps to the startup statements that run when you connect to SAS, ensuring that your working directory starts in a valid location for SAS output. You can specify those statements in Tools->Options->SAS Programs, "Submit SAS code when server is connected." A SAS administrator can also add code to the AUTOEXEC file that runs when the SAS session begins, thus helping to manage this for larger groups of SAS users.