I have written several articles about how to work with continuous probability distributions in SAS. I always emphasize that it is important to be able to compute the four essential functions for working with a statistical distribution. Namely, you need to know how to generate random values, how to compute the PDF, how to compute the CDF, and how to compute quantiles. In this article, I describe how to compute each of the four quantities for the geometric distribution, which is a DISCRETE probability distribution. The graphs that visualize a discrete distribution are slightly different than for continuous distributions. Also, the geometric distribution has two different definitions, and I show how to work with either definition in SAS.

### The definition of the geometric distribution

A Bernoulli trial is an experiment that has two results, usually referred to as a "failure" or a "success." The success occurs with probability p and the failure occurs with probability 1-p. "Success" means that a specific event occurred whereas "failure" indicates that the event did not occur. Because the event can be negative (death, recurrence of cancer, ...) sometimes a "success" is not a cause for celebration!

The geometric distribution has two definitions:

1. The number of trials until the first success in a sequence of independent Bernoulli trials. The possible values are 1, 2, 3, ....
2. The number of failures before the first success in a sequence of independent Bernoulli trials. The possible values are 0, 1, 2, ....

Obviously, the two definitions are closely related. If X is a geometric random variable according to the first definition, then Y=X-1 is a geometric random variable according to the second definition. For example, define "heads" as the event that you want to monitor. If you toss a coin and it first shows heads on the third toss, then the number of trials until the first success is 3 and the number of failures is 2. Whenever you work with the geometric distribution (or its generalization, the negative binomial distribution), you should check to see which definition is being used.

### The definition of the geometric distribution in SAS software

It is regrettable that SAS was not consistent in choosing a definition. The first definition is used by the RAND function to generate random variates. The second definition is used by the PDF function, the CDF function, and the QUANTILE function. In this article, I will use the "number of trials," which is the first definition. In my experience, this definition is more useful in applications. I will point out how to adjust the syntax of the SAS functions so that they work for either definition.

### The probability mass function for the geometric distribution

For a discrete probability distribution, the probability mass function (PMF) gives the probability of each possible value of the random variable. So for the geometric distribution, we want to compute and visualize the probabilities for n=1, 2, 3,... trials.

The probabilities depend on the parameter p, which is the probability of success. I will use three different values to illustrate how the geometric distribution depends on the parameter:

• p = 1/2 = 0.5: the probability of "heads" when you toss a coin
• p = 1/6 = 0.166: the probability of rolling a 6 with a six-sided die
• p = 1/13 = 0.077: the probability of drawing an ace from a shuffled deck of 52 cards

The following SAS DATA step uses the PDF function to compute the probabilities for these three cases. The parameters in the PDF function are the number of failures and the probability of success. If you let n be the number of trials until success, then n-1 is the number of failures before success. Thus, the following statements use n-1 as a parameter. Since I will need the cumulative probabilities in the next section, I use the CDF function to compute them in the same DATA step:

```data Geometric; do p = 0.5, 0.166, 0.077; /* We want the probability that the event occurs on the n_th trial. This is the same as the probability of n-1 failure before the event. */ do n = 1 to 16; pdf = pdf("Geometric", n-1, p); /* probability that event occurs on the n_th trial */ cdf = cdf("Geometric", n-1, p); /* probability that event happens on or before n_th Trial*/ output; end; end; run;   title "Probability Mass Function for Geom(p)"; proc sgplot data=Geometric; scatter x=n y=pdf / group=p markerattrs=(symbol=CircleFilled); series x=n y=pdf / group=p lineattrs=(color=Gray); label n="Number of Trials" pdf="Probability of Event"; xaxis grid integer values=(0 to 16 by 2) valueshint; run;```

If you are visualizing the PMF for one value of p, I suggest that you use a bar chart. Because I am showing multiple PMFs in one graph, I decided to use a scatter plot to indicate the probabilities, and I used gray lines to visually connect the probabilities that belong to the same distribution. This is the same technique that is used on the Wikipedia page for the geometric distribution.

For large probabilities, the PMF probability decreases rapidly. When the probability for success is large, the event usually occurs early; it is rare that you would have to wait for many trials before the event occurs. This geometric rate of decrease is why the distribution is named the "geometric" distribution! For small probabilities, the PMF probability decreases slowly. It is common to wait for many trials before the first success occurs.

### The cumulative probability function

If you want to know the cumulative probability that the event occurs on or before the nth trial, you can use the CDF function. The CDF curve was computed in the previous section for all three probabilities. The following call to PROC SGPLOT creates a graph. If I were plotting a single distribution, I would use a bar chart. Because I am plotting several distributions, the call uses the STEP statement to create a step plot for the discrete horizontal axis. However, if you want to show the CDF for many trials (maybe 100 or more), then you can use the SERIES statement because at that scale the curve will look smooth to the eye.

```title "Cumulative Distribution for Geom(p)"; title2 "Probability That Event Occurs on or before n_th Trial"; proc sgplot data=Geometric; step x=n y=cdf / group=p curvelabel markers markerattrs=(symbol=CircleFilled); label n="Number of Trials" cdf="Cumulative Probability"; xaxis grid integer values=(0 to 16 by 2) valueshint; yaxis grid; run;```

The cumulative probability increases quickly when the probability of the event is high. When the probability is low, the cumulative probability is initially almost linear. You can prove this by using a Taylor series expansion of the CDF, as follows. The formula for the CDF is F(n) = 1 – (1 – p)n. When p is much much less than 1 (written p ≪ 1), then (1 – p)n ≈ 1 – n p + .... Consequently, the CDF is nearly linear (with slope p) as a function of n when p ≪ 1.

### Quantiles of the geometric distribution

When the density function (PDF) of a continuous distribution is positive, the CDF is strictly increasing. Consequently, the inverse CDF function is continuous and increasing. For discrete distributions, the CDF function is a step function, and the quantile is the smallest value for which the CDF is greater than or equal to the given probability. Consequently, the quantile function is also a step function. The following DATA step uses the QUANTILE function to compute the quantiles for p=1/6 and for evenly spaced values of the cumulative probability.

```data Quantile; prob = 0.166; do p = 0.01 to 0.99 by 0.005; n = quantile("Geometric", p, prob); * number of failures until event; n = n+1; * number of trials until event; output; end; run;   title "Quantiles for Geom(0.166)"; proc sgplot data=Quantile; scatter x=p y=n / markerattrs=(symbol=CircleFilled); label n="Number of Trials" p="Cumulative Probability"; xaxis grid; yaxis grid; run;```

### Random variates

For a discrete distribution, it is common to use a bar chart to show a random sample. For a large sample, you might choose a histogram. However, I think it is useful to plot the individual values in the sample, especially if some of the random variates are extreme values. You can Imagine 20 students who each roll a six-side die until it shows 6. The following DATA step simulates this experiment.

```%let numSamples = 20; *sample of 20 people; data RandGeom; call streaminit(12345); * random seed for covid-19; p = 0.166; * 1/6 ~ 0.166; do ID = 1 to &numSamples; x = rand("Geometric", p); *number of trials until event occurs; /* if you want the number of failures before the event: x = x-1 */ Zero = 0; output; end; run;```

In the simulated random sample, the values range from 1 (got a 6 on the first roll) to 26 (waited a long time before a 6 appeared). You can create a bar chart that shows these values. You can also add reference lines to the chart to indicate the mean and median of the Geom(1/6) distribution. For a Geom(p) distribution, the expected value is 1/p and the median of the distribution is ceil( -1/log2(1-p) ). When p = 1/6, the expected value is 6 and the median value is 4.

When a bar chart contains a few very long bars, you might want to "clip" the bars at some reference value so that the smaller bars are visible. The HIGHLOW statement in PROC SGPLOT supports the CLIPCAP option, which truncates the bar and places an arrow to indicate that the true length of the bar is longer than indicated in the graph. If you add labels to the end of each bar, the reader can see the true value even though the bar is truncated in length, as follows:

```proc sort data=RandGeom; by x; run; /* compute mean and median of the Geom(1/6) distribution */ data Stats; p = 0.166; PopMean = 1/p; PopMedian = ceil( -1/log2(1-p) ); call symputx("PopMean", PopMean); call symputx("PopMedian", PopMedian); run;   /* use CLIPCAP amd MAX= option to prevent long bars */ title "&numSamples Random Rolls of a Six-side Die"; title2 "Random Sample of Geom(0.166)"; proc sgplot data=RandGeom; highlow y=ID low=Zero high=x / type=bar highlabel=x CLIPCAP; refline &PopMean / axis=x label="Mean" labelpos=min; refline &PopMedian / axis=x label="Population Median" labelpos=min; xaxis grid label="Number of Trials until Event" minor MAX=10; yaxis type=discrete discreteorder=data reverse valueattrs=(size=7); run;```

This chart shows that four students got "lucky" and observed a 6 on their first roll. Half the students observed a 6 is four or fewer rolls, which is the median value of the Geom(1/6) distribution. Some students rolled many times before seeing a 6. The graph is truncated at 10, and students who required more than 10 rolls are indicated by an arrow.

### Conclusions

This article shows how to compute the four essential functions for the geometric distribution. The geometric distribution is a discrete probability distribution. Consequently, some concepts are different than for continuous distributions. SAS provides functions for the PMF, CDF, quantiles, and random variates. However, you need to be a careful because there are two common ways to define the geometric distribution. The SAS statements in this article show how to define the geometric distribution as the number of trials until an event occurs. However, with minor modifications, the same program can handle the second definition, which is the number of failures before the event.

The post The geometric distribution in SAS appeared first on The DO Loop.

At SAS Press, we agree with the saying “The best things in life are free.” And one of the best things in life is knowledge. That’s why we offer free e-books to help you learn SAS or improve your skills. In this blog post, we will introduce you to one of our amazing titles that is absolutely free.

### SAS Programming for R Users

Many data scientists today need to know multiple programming languages including SAS, R, and Python. If you already know basic statistical concepts and how to program in R but want to learn SAS, then SAS Programming for R Users by Jordan Bakerman was designed specifically for you! This free e-book explains how to write programs in SAS that replicate familiar functions and capabilities in R. This book covers a wide range of topics including the basics of the SAS programming language, how to import data, how to create new variables, random number generation, linear modeling, Interactive Matrix Language (IML), and many other SAS procedures. This book also explains how to write R code directly in the SAS code editor for seamless integration between the two tools.

The book is based on the free, 14-hour course of the same name offered by SAS Education available here. Keep reading to learn more about the differences between SAS and R.

### SAS versus R

R is an object-oriented programming language. Results of a function are stored in an object and desired results are pulled from the object as needed. SAS revolves around the data table and uses procedures to create and print output. Results can be saved to a new data table.

Let’s briefly compare SAS and R in a general way. Look at the following table, which outlines some of the major differences between SAS and R.

Here are a few other things about SAS to note:

• SAS has the flexibility to interact with objects. (However, the book focuses on procedural methods.)
• SAS does not have a command line. Code must be run in order to return results.

### SAS Programs

A SAS program is a sequence of one or more steps. A step is a sequence of SAS statements. There are only two types of steps in SAS: DATA and PROC steps.

• DATA steps read from an input source and create a SAS data set.
• PROC steps read and process a SAS data set, often generating an output report. Procedures can be called an umbrella term. They are what carry out the global analysis. Think of a PROC step as a function in R.

Every step has a beginning and ending boundary. SAS steps begin with either of the following statements:

• a DATA statement
• a PROC statement

After a DATA or PROC statement, there can be additional SAS statements that contain keywords that request SAS perform an operation or they can give information to the system. Think of them as additional arguments to a procedure. Statements always end with a semicolon!

SAS options are additional arguments and they are specific to SAS statements. Unfortunately, there is no rule to say what is a statement versus what is an option. Understanding the difference comes with a little bit of experience. Options can be used to do the following:

• generate additional output like results and plots
• save output to a SAS data table
• alter the analytical method

SAS detects the end of a step when it encounters one of the following statements:

• a RUN statement (for most steps)
• a QUIT statement (for some procedures)

Most SAS steps end with a RUN statement. Think of the RUN statement as the right parentheses of an R function. The following table shows an example of a SAS program that has a DATA step and a PROC step. You can see that both SAS statements end with RUN statements, while the R functions begin and end with parentheses.

Free e-book: SAS Programming for R Users was published on SAS Users.

Let’s be honest, there is a lot of SAS content available on the web. Sometimes it gets difficult to navigate through everything to find what you need, especially if you are looking for complimentary resources.

Training budgets can be limited or already used for the year, but you’re still interested in learning a new SAS product or diving deeper into a specific subject to facilitate any current projects you are working on. Or you’re a real over-achiever (go, you!) and you’re looking to expand your personal SAS skills outside of your day-to-day work.

You start asking, “How do I find what I need?”

Don’t worry, SAS has you covered!

### SAS learn & support

Let’s start with a favorite resource (in a Customer Success Manager’s opinion) – SAS’ learn and support pages. SAS recently released updated learn and support pages for SAS products. These pages provide a great overview of SAS’ product offerings, and they provide resources for those who are new to SAS or those looking to expand their knowledge. The learn and support pages cover the most current product release, information on getting started, tutorials, training courses, books, and documentation for current and past releases.

Not sure how to locate the learn and support page for the SAS product you are using? Search the SAS Product Support A to Z page and select the product of your choice.

### SAS documentation

Browsing the web for resources is a great way to find answers to your SAS questions. But as mentioned previously, it can sometimes get tricky to find what you are looking for.

A great place to start your search is on the SAS documentation site. You can use the search bar to enter what you are looking for, or browse by products, titles or system requirements.

### What’s new in SAS

You may have heard the saying, “There are three ways to do anything in SAS.” (Or four, or five or six!) Which raises the question, “How do I know what I’m doing is the most efficient?”

One way to stay on top of the most efficient way to do things is to stay current with your SAS knowledge. Knowing what’s new in SAS helps users know and understand what new features and enhancements are available. When a SAS product release occurs, SAS provides documentation on what’s new.

To know what’s new in the SAS release you’re using, check out the What’s New documentation. The documentation is broken into two parts: SAS 9.4 and SAS Viya 3.5. You can use the ‘Version’ tab on the left-hand side of the page to select the version currently installed at your organization.

If you are not sure what version you are running, you can run PROC PRODUCT_STATUS. This PROC will return what version numbers are running for the SAS products installed.

`proc product_status;`
`run;`

Another great resource to stay on top of what’s new from SAS is to check out SAS webinars. SAS offers live and on-demand webinars hosted by SAS experts. There are topics for every level of SAS user and every level of an organization, from SAS programmers to executives.

To attend a live webinar, select the webinar of your choice, register to attend, and you will be sent an email with the calendar invite.

If you’re interested in checking out an on-demand webinar, you can search by topic or industry to find a topic that fits what you’re looking for.

Looking for a webinar that focuses on a SAS tool? Check out the SAS Ask the Expert webinars. These are one-hour live and on-demand webinars for SAS users and administrators. The sessions cover a wide range of topics from what’s new in new releases of SAS products, to overviews on getting started, to tips and tricks that help take your SAS knowledge to the next level.

With SAS’ extensive catalog of webinars to choose from you will be a SAS pro in no time!

### SAS training and education

Did you know that SAS offers free e-learning for some of our training courses? These courses are self-paced and cover a wide range of topics. With 180 days of access to these courses, it allows you to work through them at your own speed. It’s also very easy to get started!

Step 1: Select a course from the course library

Step 4: Work through the course lessons

### Leverage expertise worldwide

SAS recently released SAS Analytics Explorer. This is an interactive way to connect with other SAS professionals, expand your SAS knowledge, and access private SAS events and resource all while earning points that can be exchanged for rewards.

Are you up for the challenge? No really, are you? The SAS Analytics Explorer has fun and educational challenges that allow you to showcase your SAS skills to climb the ranks in the network. Show off your SAS talent and get some cool rewards while you’re at it!

Interested in joining? Fill out the form on the bottom of the SAS Analytics Explorer page to request an invitation.

Don’t forget about the SAS Communities! Connect with other SAS professionals and experts to ask questions, assist other SAS professionals with their questions, connect with users, and see what’s going on at SAS.

You can also connect with SAS on our website using the chat feature. We love SAS users, and we are here to help you!

As a long-time SAS 9 programmer, I typically accomplish my data preparation tasks through some combination of the DATA Step, Proc SQL, Proc Transpose and some housekeeping procs like Proc Contents and Proc Datasets. With the introduction of SAS Viya, SAS released a new scripting language called CASL – a language that interacts with SAS Cloud Analytics Services (CAS).

CASL statements include actions, logically organized into action sets based on common functionality. For example, the Table action set allows you to load a table in CAS, view table metadata, change table metadata such as drop or rename a column, fetch (print) sample rows, save or drop a table from CAS, among other things. Steven Sober provides a great overview of CASL in his 2019 SAS Global Forum paper.

Learning CASL is a good idea assuming you want to leverage the power of CAS, because CASL is the language of CAS. While you can continue to use Viya-enabled procs for many of your data processing needs, certain new functionality is only available through CASL. CAS actions also provide a more granular access to options which otherwise may not be available as procedure options. But old habits die hard, and for a while I found myself bouncing between SAS 9.4 and CASL. I'd pull the data down from CAS just to get it to process in the SAS Programming Runtime Environment (SPRE) because it took less effort than figuring out how to get it done properly in CAS.

Then I started a project with a seriously large data set and quickly hit the limit on how much data I could pull down to process in SPRE. And although I could adjust the DATALIMIT option to retrieve more data than the default limit, I was wasting time and server resources unnecessarily moving the data between CAS and SPRE. All this, just so I could process the data “old school.”

I decided to challenge myself to do ALL my data preparation in CASL. I love a good challenge! I started collecting various useful CASL code snippets. In this post, I am sharing the tidbits I’ve accumulated, along with some commentary. Note, you can execute CAS actions from multiple clients, including SAS, Python, R, Lua and Java. Since my objective was to transition from traditional SAS code to CASL, I’ll focus solely on CAS actions from the SAS client perspective. While I used SAS Viya 3.5 for this work, most of the code snippets should work on prior versions as well.

The sections below cover: how to submit CASL code; loading, saving, dropping and deleting data; exploring data; table metadata management; and data transformation. Feel free to jump ahead to any section of interest.

## How do you submit CASL code?

You use PROC CAS to submit CASL code from a SAS client. For example:

```proc cas; <cas action 1>; <cas action 2>; …; quit;```

Similarly to other interactive procs that use run-group processing, separate CAS actions by run; statements. For example:

```proc cas; <cas action 1>; run; <cas action 2>; run; quit;```

In fact, you can have the entire data preparation and analysis pipeline wrapped inside a single PROC CAS, passing data and results in the form of CASL variables from one action to the next. It can really be quite elegant.

## Moving Data Using PROC CAS

```proc cas; table.loadtable / path="TABLE_NAME.sashdat" casOut="TABLE_NAME"; table.promote /name="TABLE_NAME" drop=true; quit;```

The table.promote action elevates your newly loaded CAS table to global scope, making it available to other CAS sessions, including any additional sessions you start, or to other users assuming they have the right privileges. I can’t tell you how many times I forgot to promote my data, only to find that my hard-earned output table disappeared because I took a longer coffee break than expected! Don’t forget to promote or save off your data (or both, to be safe).

If you are loading from a directory other than the one associated with your active caslib, modify the path= statement to include the relative path to the source directory – relative to your active caslib. If you are looking to load to a different caslib, modify the casOut= statement by placing the output table name and library in curly brackets. For example:

```proc cas; table.loadtable / path="TABLE_NAME.sashdat" casOut={name="TABLE_NAME" caslib="CASLIB2"}; table.promote /name="TABLE_NAME" drop=true; quit;```

You can also place a promote=true option inside the casOut= curly brackets instead of calling the table.promote action, like so:

```proc cas; table.loadtable / path="TABLE_NAME.sashdat" casOut={name="TABLE_NAME" caslib="CASLIB2" promote=true}; quit;```

Curly brackets are ubiquitous in CASL (and quite unusual for SAS 9.4). If you take away one thing from this post, make it “watch your curly brackets.”

If you have a SAS7BDAT file already on disk, load it in CAS with this code:

```proc cas; table.loadtable /path="TABLE_NAME.sas7bdat" casout="TABLE_NAME" importoptions={filetype="basesas"}; quit;```

Other file formats load similarly – just use the corresponding filetype= option to indicate the type of data you are loading, such as CSV, Excel, Document (.docx, .pdf, etc.), Image, Video, etc. The impressive list of supported file types is available here.

```proc cas; table.loadtable / path="TABLE_NAME.csv" casout="TABLE_NAME" importoptions={filetype="csv"}; run; quit;```

You can include additional parameters inside the importOptions= curly brackets, which differ by the file type. If you don’t need any additional parameters, use the filetype=”auto” and let CAS determine the best way to load the file.

Imagine you are building a script to load data conditionally – only if it’s not already loaded. This is handy if you have a reason to believe the data might already be in CAS. To check if the data exists in CAS and load conditionally, you can leverage the table.tableExists action in combination with if-then-else logic. For example:

```proc cas; table.tableExists result =r / name="TABLE_NAME"; if r=0  then do; table.loadtable / path="TABLE_NAME.sashdat" casOut={name="TABLE_NAME"}; table.promote /name="YOUR_TABLE_NAME" drop=true; end; else print("Table already loaded"); quit;```

Notice that the result=r syntax captures the result code from the tableExists action, which is evaluated before the loadtable and promote actions are executed. If the table is already loaded in CAS, “Table already loaded” is printed to the log. Otherwise, the loadtable and promote actions are executed.

The ability to output CAS action results to a CASL variable (such as result=r in this example) is an extremely powerful feature of CASL. I include another example of this further down, but you can learn more about this functionality from documentation or this handy blog post.

To save off CAS data, naturally, you use the table.save action. For example:

```proc cas; table.save / table="TABLE_NAME" name="TABLE_NAME.sashdat" replace=true; quit;```

In this example, you save off the CAS table to disk as a SASHDAT file, defaulting to the location associated with your active caslib. You can modify the table.save parameters to save or export the data to an alternative data storage solution with full control over the file format (including but not limited to such popular options as HDFS, Oracle, SQL Server, Salesforce, Snowflake and Teradata), compression, partitioning and other options.

### Dropping and deleting data

To drop a table from CAS, execute a table.droptable action. For example:

```proc cas; table.droptable / name="TABLE_NAME" quiet=true; quit;```

The quiet=true option prevents CAS from generating an error if the table does not exist in CAS. Dropping a table deletes it from memory. It’s a good practice to drop tables you no longer need, particularly the one you have promoted. Local-scope tables disappear on their own when the session expires, whereas global tables will stay in memory until they are unloaded.

Dropping a table does not delete the underlying source data. To delete the source of a CAS table, use the table.deleteSource action. For example:

```proc cas; table.deletesource / source="TABLE_NAME.sashdat" quiet=true; quit;```

## Exploring Data Using PROC CAS

After taking a close look at moving the data using PROC CAS, let’s look at some useful ways to start exploring and manipulating CAS data.

### Fetching sample data

When preparing data, I find it useful to look at sample data. The table.fetch action is conceptually similar to PROC PRINT and, by default, outputs the first 20 rows of a CAS table:

```proc cas; table.fetch / table="Table_Name"; quit;```

You can modify the table.fetch options to control which observations and variables to display and how to display them. For example:

```proc cas; table.fetch / table={name="TABLE_NAME" where="VAR1 in ('value1','value2')"}, /*1*/ orderby={{name="VAR1"}, /*2*/ {name="VAR2", order="descending"} }, fetchvars={{name="VAR1", label="Variable 1"}, /*3*/ {name="VAR2", label="Variable 2"}, {name="VAR3", label="Variable 3", format=comma12.1} }, to=50, /*4*/ index=false; /*5*/ quit;```

In the code snippet above:

• #1 – where= statement limits the records to those meeting the where criteria.
• #2 – orderby= option defines the sort order. Ascending is the default and is not required. If sorting by more than one variable, put them in a list inside curly brackets, as shown in this example. If a list item has a subparameter (such as order= here), encase each item in curly brackets.
• #3 – fetchvars= option defines the variables to print as well as their display labels and formats. If you select more than one variable, put them in a list inside curly brackets, as shown here. And again, if a list item includes a subparmeter, then enclose each list item in curly brackets.
• #4 – to= option defines the number of rows to print.
• #5 – index= false option deactivates the index column in the output (the default is index=true). This is similar to the noobs option in PROC PRINT.

As mentioned earlier, make sure to watch your curly brackets!

### Descriptive statistics and variable distributions

The next step in data exploration is looking at descriptive statistics and variable distributions. I would need a separate blog post to cover this in detail, so I only touch upon a few of the many useful CAS actions.

To look at statistics for numeric variables, use the simple.summary action, which computes standard descriptive statistics, such as minimum, maximum, mean, standard deviation, number missing, and so on. For example:

```proc cas; simple.summary / table="TABLE_NAME"; quit;```

Among its other features, the simple.summary action allows analysis by one or more group-by variables, as well as define the list of desired descriptive statistics. For example:

```proc cas; simple.summary / table={name="TABLE_NAME", groupBy="VAR1", vars={"NUMVAR1","NUMVAR2”}}, subSet={"MAX", "MIN", "MEAN", "NMISS"}; quit;```

Another useful action is simple.topK, which selects the top K and bottom K values for variables in a data set, based on a user-specified ranking order. The example below returns the top 5 and bottom 5 values for two variables based on their frequency:

```proc cas; simple.topk / table="TABLE_NAME" aggregator="N", inputs={"VAR1","VAR2"}, topk=5, bottomk=5; quit;```

Simple is a rich action set with heaps of useful options covered in the documentation.

You may be wondering – what about crosstabs and frequency tables? The simple action set includes freq and crosstab actions. In addition, the action closely imitating the functionality of the beloved PROC FREQ is freqTab.freqTab. For example, the code snippet below creates frequency tables for VAR1, VAR2 and a crosstab of the two.

```proc cas; freqtab.freqtab / table="TABLE_NAME" tabulate={"VAR1","VAR2", {vars={"VAR1","VAR2"}} }; quit;```

## Managing CAS Table Variables

One of the basic tasks after exploring your data is changing table metadata, such as dropping unnecessary variables, renaming tables and columns, and changing variable formats and labels. The table.altertable action helps you with these housekeeping tasks. For example, the code snippet below renames the table, drops two variables and renames and changes labels for two variables:

```proc cas; table.altertable / table="TABLE_NAME" rename="ANALYTIC_TABLE" drop={"VAR1",”VAR2”} columns={{name="VAR3" rename="ROW_ID" label="Row ID"}, {name="VAR4" rename="TARGET" label="Outcome Variable"} } ; quit;```

### Outputting variable list to a data set

Another useful trick I frequently use is extracting table columns as a SAS data set. Having a list of variables as values in a data set makes it easy to build data-driven scripts leveraging macro programming. The code snippet below provides an example. Here we encounter another example of capturing action result as a CASL variable and using it in further processing – I can’t stress enough how helpful this is!

```proc cas; table.columninfo r=collinfo / table={name="TABLE_NAME"};       /*1*/ collist=collinfo["ColumnInfo"];                                /*2*/ saveresult collist casout="collist";                           /*3*/ quit;```

In the snippet above:

• #1 - the columninfo action collects column information. The action result is passed to a CASL variable collinfo. Notice, instead of writing out result=, I am using an alias r =.
• #2 - the portion of the a CASL variable collinfo containing column data is extracted into another CASL variable collist.
• #3 - the saveresult statement sends the data to a CAS table collist. If you want to send the results to a SAS7BDAT data set, replace casout= with dataout=, and provide the library.table_name information.

## Transforming the Data

Lastly, let’s look at some ways to use CAS actions to transform your data. Proc SQL and DATA step are the two swiss-army knives in SAS 9 developers’ toolkit that take care of 90% of the data prep. The good news is you can execute both DATA Step and SQL directly from PROC CAS. In addition, call the transpose action to transpose your data.

### Executing DATA Step code

The dataStep.runCode action enables you to run DATA step code directly inside PROC CAS. You must enclose your DATA step code in quotation marks after the code= statement. For example:

```proc cas; dataStep.runCode / code=" data table_name; set table_name; run; "; quit;```

Running DATA step code in CAS allows access to sophisticated group-by processing and the use of such popular programming techniques as first- and last-dot. Refer to the documentation for important nuances related to processing in a distributed, multi-threaded environment of CAS.

### Executing FedSQL

To run SQL in CASL, use the fedSQL.execDirect action. Enclose the SQL query in quotation marks following the query= statement. Optionally, you can use the casout= statement to save the results to a CAS table. For example:

```proc cas; fedsql.execDirect/ query= " select  * from TABLE1 a  inner join TABLE2 b on a.VAR1 = b.VAR1 " casout={name="TABLE3", replace=True}; quit;```

Similarly to DATA step, be aware of the many nuances when executing SQL in CAS via FedSQL. Brian Kinnebrew provides an excellent overview of FedSQL in his SAS Communities article, and the documentation has up-to-date details on the supported functionality.

### Transposing data

Transposing data in PROC CAS is a breeze. The example below uses transpose.transpose action to restructure rows into columns.

```proc cas; transpose.transpose / table={name="TABLE_NAME", groupby={"VAR1"}} transpose={"VAR2"} id={"VAR3"} prefix="Prefix" casout={name="TRANSPOSED" replace=true}; run;```

You can transpose multiple variables in the same transpose action. Simply place additional variables inside the curly brackets following transpose=, in quotes, separated by a comma.

## Conclusion

PROC CAS is a wrapper procedure enabling you to leverage SAS’ new programming language - CASL. CASL enables you to submit CAS actions directly to SAS Cloud Analytic Services engine from a SAS client. This post provided examples of loading, managing, exploring and transforming your data through CAS actions. Certain new functionality in CAS is only available through CAS actions, so getting comfortable with CASL makes sense. Fear not, and let the curly brackets guide the way .

### Acknowledgement

I would like to thank Brian Kinnebrew for his thoughtful review and generous help on my journey learning CASL.

Challenge accepted: Learning data prep in CASL was published on SAS Users.

You can represent every number as a nearby integer plus a decimal. For example, 1.3 = 1 + 0.3. The integer is called the integer part of x, whereas the decimal is called the fractional part of x (or sometimes the decimal part of x). This representation is not unique. For example, you can also write 1.3 = 2 + (-0.7). There are several ways to produce the integer part of a number, depending on whether you want to round up, round down, round towards zero, or use some alternative rounding method.

Just as each rounding method defines the integer part of a number, so, too, does it define the fractional part. If [x]denotes the integer part of x (by whatever rounding method you choose), then the fractional part is defined by frac(x) = x - [x]. For some choices of a rounding function (for example, FLOOR), the fractional part is positive for all x. For other choices, the fractional part might vary according to the value of x.

In applications, two common representations are as follows:

• Round x towards negative infinity. The fractional part of x is always positive. You can round towards negative infinity by using the FLOOR function in a computer language. For example, if x = -1.3, then FLOOR(x) is -2 and the fractional part is 0.7.
• Round x towards zero. The fractional part of x always has the same sign as x. You can round towards zero by using the INT function. For example, if x = -1.3, then INT(x) is -1 and the fractional part is -0.3.

Here is an interesting fact: for the second method (INT), you can compute the fractional part directly by using the MOD function in SAS. In SAS, the expression MOD(x,1) returns the signed fractional part of a number because the MOD function in SAS returns a result that has the same sign as x. This can be useful when you are interested only in the fraction portion of a number.

The following DATA step implements both common methods for representing a number as an integer and a fractional part. Notice the use of the MOD function for the second method:

```data Fractional; input x @@; /* Case 1: x = floor(x) + frac1(x) where frac1(x) >= 0 */ Floor = floor(x); Frac1 = x - Floor; /* always positive */   /* Case 2: x = int(x) + frac2(x) where frac1(x) has the same sign as x */ Int = int(x); Frac2 = mod(x,1); /* always same sign as x */ label Floor = 'Floor(x)' Int='Int(x)' Frac2='Mod(x,1)'; datalines; -2 -1.8 -1.3 -0.7 0 0.6 1.2 1.5 2 ;   proc print data=Fractional noobs label; run;```

The table shows values for a few positive and negative values of x. The second and third columns represent the number as x = Floor(x) + Frac1. The fourth and fifth columns represent the number as x = Int(x) + Mod(x,1). For non-negative values of x, the two methods are equivalent. When x can be either positive or negative, I often find that the second representation is easier to work with.

In statistics, the fractional part of a number is used in the definition of sample estimates for percentiles and quantiles. SAS supports five different definitions of quantiles, some of which look at the fractional part of a number to decide how to estimate a percentile.

The post Find the fractional part of a number appeared first on The DO Loop.

Way too often, SAS programmers run into a task when for a given date (e.g. event date) there is a need to shift (add or subtract) it by a specified number of days excluding weekends and holidays — in other words to move a date by a given number of workdays. It does not matter how many days off are in our date span as long as it contains exactly the required number of workdays.

For the purpose of this blog post, we will use the following words as synonyms: workdays, work days, working days, business days; as opposed to their antonym: days off.

## In the ideal world

If not for gifts from governments and employers called holidays, shifting (incrementing or decrementing) a date by a number of workdays using SAS would be a piece of cake. It’s literally a single line of code using INTNX function with the WEEKDAY date interval. For example, the following DATA Step code:

```data _null_; event_date = '02JAN2020'd; shift_date = intnx('weekday', event_date, -10); put shift_date= date9.; run;```

produces in the SAS log:

shift_date=19DEC2019

Here:

• 'weekday' is date interval covering Monday through Friday;
• event_date is starting date point;
• -10 is number of workdays to shift by (positive number means increment; negative number means decrement).

Note, that the WEEKDAY date interval can be modified to accommodate different weekend days. For example:

• WEEKDAY17W - five-day work week with a Sunday (1) and Saturday (7) weekend (same as WEEKDAY);
• WEEKDAY1W - six-day week with Sunday (1) as the only weekend day;
• WEEKDAY67W - five-day week with Friday (6) and Saturday (7) as weekend days, etc.

## Holidays schedule

In the real world, however, weekends defined by the WEEKDAY interval are not the only days off, as they do not account for holidays. In the example above, when we shifted our starting date (2 January 2020) by -10 we arrived at 19 December 2019 which means we miscounted several holidays as workdays.

Which holidays (and how many) we miscounted depends on jurisdiction (country, state, business), as their holidays schedules vary. For example, for US federal agencies we would miss (1. New Year – 1Jan2020, 2. Christmas Day – 25Dec2019, and 3. Christmas Eve Day – 24Dec2019 – although this is not an official US federal holiday, most federal employees were given that day off by presidential executive order).

For SAS Institute (USA), we would miscue 6 non-weekend holiday days (Winter Holiday 25Dec2019 – 27Dec2019 and 30Dec2019 - 1Jan2020).

In other countries or businesses, this holidays schedule might be quite different, and that is why this date-shifting task that would account for holidays schedule causes so much confusion. Let’s straighten it out with the help of our old friend – SAS user-defined format. But first, let’s create a workday calendar – a data table listing all OUR work days as well as days off.

## Workday calendar

Practically every organization has (or must have) a workday calendar that defines the valid working days and consists of a repeating pattern of days on and days off, as well as exceptions to that pattern. While such a calendar may span multiple years, for our purposes, we can use a subset of that calendar, which reliably covers the date range of our interest.

Let’s create an example of the workday calendar as a SAS data table:

```data DAYS_OF_WEEK; format DATE date9.; do DATE='01JAN2019'd to '31JAN2020'd; WEEK_DAY = weekday(DATE); DAY_NAME = put(DATE,downame.); WORK_DAY = 1<WEEK_DAY<7; output; end; run;   data DAYS_HOLIDAY; format DATE date9.; input DATE date9.; WORK_DAY = 0; datalines; 01JAN2019 21JAN2019 18FEB2019 27MAY2019 04JUL2019 02SEP2019 11NOV2019 28NOV2019 24DEC2019 25DEC2019 01JAN2020 20JAN2020 ;   /* Overlay holidays onto weekdays */ data DAYS_WEEKENDS_AND_HOLIDAYS; merge DAYS_OF_WEEK DAYS_HOLIDAY; by DATE; run;```

Here is a fragment of the resulting workday calendar table:

If date shifting is needed on an individual-level, then workday calendars should be created for every person and must include working days, weekends, holidays as well as personal days off such as vacations, sick days etc.

## SAS format to distinguish workdays from days off

Now, for the dates range of our interest, we want to create a SAS user-defined format that lists all the days off while workdays are lumped into the other category. It’s just more efficient that way, as the number of days off is usually smaller than the number of work days so our explicit list of dates will be shorter. For example:

```proc format; value dayoff '01DEC2019'd = 'Y' '07DEC2019'd = 'Y' '08DEC2019'd = 'Y' . . . '24DEC2019'd = 'Y' '25DeC2019'd = 'Y' '01JAN2020'd = 'Y' '20JAN2020'd = 'Y' other = 'N' ; run;```

In this user-defined SAS format values labeled 'Y' mean day off, and values labeled 'N' mean workday. That includes and takes care of both weekends and holidays.

The proc format above serves only for illustrational purposes of what kind of format we are going to create. However, by no means do I suggest implementing it this hard-coded way. Quite the contrary, we are going to create format dynamically and 100% data-driven. Here is how we do it:

```data WORK.DAYSOFF (rename=(DATE=START)); set DAYS_WEEKENDS_AND_HOLIDAYS(where=(WORK_DAY=0)) end=last; retain FMTNAME 'dayoff' TYPE 'n' LABEL 'Y'; output; if last then do; HLO = 'O'; LABEL = 'N'; output; end; run;   proc format cntlin=WORK.DAYSOFF; run;```

In the above code, HLO='O' and LABEL='N' are responsible for generating the OTHER category for the dayoff format.

## Shifting dates by a number of workdays using dayoff format

With the dayoff user-defined format at hands, we can easily increment or decrement dates by a number of workdays. Here is how:

```/* data table of some dates */ data EVENTS; do EVENT_DATE='01DEC2019'd to '31DEC2019'd; output; end; format EVENT_DATE date9.; run;   /* Calculating new dates shifted by a number of workdays */ data EVENTS_WITH_SHIFTS; set EVENTS;   /* Decrement EVENT_DATE by 10 workdays */ d = EVENT_DATE; do i=1 to 10; d = d - 1; if put(d, dayoff.)='Y' then i = i - 1; end; BEFORE_DATE = d;   /* Increment EVENT_DATE by 12 workdays */ d = EVENT_DATE; do i=1 to 12; d = d + 1; if put(d, dayoff.)='Y' then i = i - 1; end; AFTER_DATE = d;   format BEFORE_DATE AFTER_DATE date9.; drop d i; run;```

In this code, we decrement (d=d-1) or increment (d=d+1) our event date every time the do-loop iterates. It will iterate while counter i does not exceed the number of workdays. However, within the do-loop we modify counter i to i-1 every time we come across a day off as determined by condition put(d,dayoff.)='Y'. This will effectively exclude days off from counting towards the number of workdays. The do-loop will iterate the number of workdays plus the number of days off thus moving date d by the number of days that includes exactly the given number of workdays (plus some number of days off which we don’t care about). Just pause for a second and think to absorb this.

This simple technique can be modularized by implementing it as a SAS user-defined function or a SAS data-step macro.

## User-defined function to shift a date by a number of workdays

Here is the user-defined function shiftwd() that shifts a beginning date specified in the first argument from_date by a number of workdays specified in the second argument shift_by. The second argument can be either positive or negative. Positive second argument means advancing the first argument (incrementing); negative second argument means subtracting workdays from the first argument (decrementing). Both arguments can be either variable names or numerals representing whole numbers.

```libname funclib 'c:\projects\shift\functions'; proc fcmp outlib=funclib.funcs.dates; function shiftwd(from_date, shift_by); d = from_date; do i=1 to abs(shift_by); d = d + sign(shift_by); if put(d,dayoff.)='Y' then i = i - 1; end; return(d); endfunc; run;```

Function usage example:

```libname funclib 'c:\projects\shift\functions'; options cmplib= funclib.funcs;   data EVENTS_WITH_SHIFTS; set EVENTS; BEFORE_DATE = shiftwd(EVENT_DATE,-10); /* Decrement EVENT_DATE by 10 workdays */ AFTER_DATE = shiftwd(EVENT_DATE, 12); /* Increment EVENT_DATE by 12 workdays */ format BEFORE_DATE AFTER_DATE date9.; run;```

## SAS macro to shift a date by a number of workdays

Similarly, the same can be implemented as a data-step macro:

```%macro shiftwd (fromvar=,endvar=,wdays=,sign=); &endvar = &fromvar; do i=1 to &wdays; &endvar = &endvar &sign 1; if put(&endvar, dayoff.)='Y' then i = i - 1; end; drop i; %mend;```

This macro has 4 required parameters:

• fromvar - variable name of the beginning date;
• endvar - variable name of the ending date;
• wdays - variable name or numeral representing number of workdays to move from the beginning date;
• sign - operation sign defining direction of the date move (+ for incrementing, - for decrementing).

Macro usage example:

```data EVENTS_WITH_SHIFTS; set EVENTS; %shiftwd(fromvar=EVENT_DATE,endvar=BEFORE_DATE,wdays=10,sign=-); /* Decrement EVENT_DATE by 10 workdays */ %shiftwd(fromvar=EVENT_DATE,endvar=AFTER_DATE, wdays=12,sign=+); /* Increment EVENT_DATE by 12 workdays */ format BEFORE_DATE AFTER_DATE date9.; run;```

## Related materials

Custom Time Intervals (SAS Documentation)

Do you find this material useful? How do you handle the task of adding or subtracting workdays from a date? Please share in the comments section below.

Shifting a date by a given number of workdays was published on SAS Users.

In SAS 9.3 and earlier, the default value of the YEARCUTOFF= option is 1920. This default setting could trigger data integrity issues because any 2-digit years of "20" in dates will be assumed to occur in 1920 instead of 2020. If the intended year in the date is 2020, you must set the YEARCUTOFF= option to a value larger than 1920. Of course, the best alternative is to always specify date values with 4-digit years.

Luckily, SAS makes it easy to change the YEARCUTOFF= option so that it works best for your data. The default value for the YEARCUTOFF= option changed in SAS 9.4 to 1926. This change makes it easier for customers who are still using 2-digit years of "20" to make sure that the date is assigned to 2020. Let's review some of the frequently asked questions that customers ask about how SAS works with 2-digit years.

## What is the YEARCUTOFF= option?

The YEARCUTOFF= option lets you specify which century SAS software should assign to dates with 2-digit years.

## How do I specify the YEARCUTOFF= option in my SAS programs?

The option is specified in an OPTIONS statement. Here is an example:

`options yearcutoff=1930;`

You can also specify the option in an autoexec file or a config file. If you don't specify the YEARCUTOFF= option, the SAS system default is used. Remember that 1920 is the default for SAS 9.3 and earlier releases and 1926 is the default for SAS 9.4. (For reference, SAS 9.3 was released in 2011. The first release of SAS 9.4 was released in 2013.)

## How does the YEARCUTOFF= option work?

The YEARCUTOFF= option specifies the first year of a 100-year window in which all 2-digit years are assumed to occur. For example, if the YEARCUTOFF= option is set to 1920, all 2-digit years are assumed to occur between 1920 and 2019. This means that two-digit years from 20 - 99 are assigned a century prefix of "19" and all 2-digit years from 00 - 19 have a century prefix of "20."

## Which types of date values are affected by the YEARCUTOFF= option?

The YEARCUTOFF= option affects the interpretation of 2-digit years in the following cases:

• Reading date values from external files
• Specifying dates or year values in SAS functions
• Specifying SAS date literals

The YEARCUTOFF= option does not influence the following cases:

• Processing dates with 4-digit years
• Processing dates already stored as SAS date values (the number of days since January 1, 1960)
• Displaying dates with SAS date formats

## Which value should I set the YEARCUTOFF= option to?

The optimal value depends on the range of dates in your data. The YEARCUTOFF= option should be set so that the 100-year range encompasses the range of your data values. In general, SAS recommends setting the YEARCUTOFF= option to a value equal to or slightly less than the first year in your data. For example, if the range of dates that you are processing is from 1930 - 2010, a YEARCUTOFF value of 1925 or 1930 would be appropriate. If you set YEARCUTOFF=1925, then all 2-digit years are assumed to be in the 100-year period from 1925 to 2024. If all the dates in your data fall within that range, they will be interpreted correctly.

## What do I do if my dates with 2-digit years span more than 100 years?

The YEARCUTOFF= option cannot reliably assign centuries to 2-digit years if the range of dates for a variable is greater than 100 years. If the date ranges for a variable span more than 100 years, you must either specify the dates with 4-digit years or use DATA step logic to assign a century to each year (perhaps based on the value of another variable).

But why does the YEARCUTOFF= option allow only a 100-year span? If the YEARCUTOFF= option allowed for more than a 100-year span, there would be no way to determine which century a 2-digit year should have. For example, let’s assume that YEARCUTOFF=1950 with a 150-year span and your external data file had 2-digit years. In this scenario, your 150-year span would be from 1950 to 2100. Since you have 2-digit years, there would be no way to determine if the year 00 was meant to occur during 2000 or 2100.

## How do I change the default setting for all the SAS users at my site?

Setting system default option values is usually done by a site SAS Installation Representative. The recommended method for setting a system default YEARCUTOFF= value is to specify the desired value in the system SAS configuration file. Note that even if you set a default value for all the users at your site, they can override the default value in their SAS programs, in personal autoexec files, in config files, by setting an environment variable, or when invoking SAS software.

## How do I change the default setting for my own programs if I want a default that is different from the rest of the users at my site?

You can specify personal default values either in a personal configuration file or in an autoexec file. If you specify the value in a personal configuration file, the syntax depends on your operating system and is the same as that for setting the value in the system-wide configuration file on each system. If you use an autoexec file, you can specify the YEARCUTOFF= option in an OPTIONS statement.  Here is an example:

`options yearcutoff=1930;`

• YEARCUTOFF= System Option section in SAS® 9.3 System Options: Reference, Second Edition
• YEARCUTOFF= System Option section in SAS® 9.4 System Options: Reference, Fifth Edition
• SAS Note 46368, "The default value for the YEARCUTOFF= system option has changed in SAS® 9.4 and beyond"
• SAS Note 65307, "You might encounter an issue in which 2-digit year dates have the wrong century in SAS® 9.3 and earlier releases"

Why does my SAS date have the wrong century? was published on SAS Users.

Last year, I wrote more than 100 posts for The DO Loop blog. The most popular articles were about SAS programming tips for data analysis, statistical analysis, and data visualization. Here are the most popular articles from 2019 in each category.

### SAS programming tips

The Essential Guide to Binning

• Create training, testing, and validation data sets:: This post shows how to create training, validation, and test data sets in SAS. This technique is popular in data science and machine learning because you typically want to fit the model on one set of data and then evaluate the goodness of fit by using a different set of data.
• 5 reasons to use PROC FORMAT to recode variables in SAS: Often SAS programmers use PROC SQL or the SAS DATA step to create new data variables to recode raw data. This is not always necessary. It is more efficient to use PROC FORMAT to recode the raw data. Learn five reasons why you should use PROC FORMAT to recode variables.
• Conditionally append observations to a data set: In preparing data for graphing. you might want to add additional data to the end of a data set. (For example, to plot reference lines or text.) You can do this by using one DATA step to create the new observations and another DATA step to merge the new observations to the end of the original data. However, you can also use the END= option and end-of-file processing to append new observations to data. Read about how to use the END= option to append data. The comments at the end of the article show how to perform similar computations by using a hash table, PROC SQL, and a DOW loop.
• Use PROC HPBIN to bin numerical variables: In machine learning, it is common to bin numerical variables into a set of discrete values. You can use PROC HPBIN to bin multiple variables in a single pass through the data. PROC HPBIN can bin data into equal-length bins (called bucket binning) or by using quantiles of the data. This article and the PROC FORMAT article are both referenced in my Essential Guide to Binning in SAS.

### Statistical analyses

Geometric Meaning of Kolmogorov's D

### Data visualization

Visualize an Interaction Effect

I always enjoy learning new programming methods, new statistical ideas, and new data visualization techniques. If you like to learn new things, too, read (or re-read!) these popular articles from 2019. Then share this page with a friend. I hope we both have many opportunities to learn and share together in the new year.

The post Top posts from <em>The DO Loop</em> in 2019 appeared first on The DO Loop.

Parts 1 and 2 of this blog post discussed exploring and preparing your data using SASPy. To recap, Part 1 discussed how to explore data using the SASPy interface with Python. Part 2 continued with an explanation of how to prepare your data to use it with a machine-learning model. This final installment continues the discussion about preparation by explaining techniques for normalizing your numerical data and one-hot encoding categorical variables.

Some of the numerical features in the examples from parts 1 and 2 have different ranges. The variable Age spans from 0-100 and Hours_per_week from 0-80. These ranges affect the calculation of each feature when you apply them to a supervised learner. To ensure the equal treatement of each feature, you need to scale the numerical features.

The following example uses the SAS STDIZE procedure to scale the numerical features. PROC STDIZE is not a standard procedure available in the SASPy library.  However, the good news is you can add any SAS procedure to SASPy! This feature enables Python users to become a part of the SAS community by contributing to the SASPy library and giving them the chance to use the vast number of powerful SAS procedures. To add PROC STDIZE to SASPy, see the instructions in the blog post Adding SAS procedures to the SASPy interface to Python.

After you add the STDIZE to SASPy, run the following code to scale the numerical features. The resulting values will be between 0 and 1.

```# Creating a SASPy stat objcect stat = sas.sasstat()```
```# Use the stdize function that was added to SASPy to scale our features stat_result = stat.stdize(data=cen_data_logTransform, procopts = 'method=range out=Sasuser.afternorm', var = 'age education_num capital_gain capital_loss hours_per_week')```

To use the STDIZE procedure in SASPy we need to specify the method and the output data set in the statement options. For this we use the "procopts" option and we specify range as our method and our "out" option to a new SAS data set, afternorm.

After running the STDIZE procedure we assign the new data set into a SAS data object.

`norm_data = sas.sasdata('afternorm', libref='SASuser')`

Now let's verify if we were successful in transforming our numerical features

`norm_data.head(obs=5)`

The output looks great! You have normalized the numerical features. So, it's time to tackle the last data-preparation step.

## One-Hot Encoding

Now that you have adjusted the numerical features, what do you do with the categorical features? The categories Relationship, Race, Sex, and so on are in string format. Statistical models cannot interpret these values, so you need to transform the values from strings into a numerical representation. The one-hot encoding process provides the transformation you need for your data.

To use one-hot encoding, use the LOGISTIC procedure from the SASPy Stat class. SASPy natively includes the LOGISTIC procedure, so you can go straight to coding. To generate the syntax for the code below, I followed the instructions from Usage Note 23217: Saving the coded design matrix of a model to a data set.

```stat_proc_log = stat.logistic(data=norm_data, procopts='outdesign=SASuser.test1 outdesignonly', cls = "workclass education_level marital_status occupation relationship race sex native_country / param=glm", model = "age = workclass education_level marital_status occupation relationship race sex native_country / noint")```

To view the results from this code, create a SAS data object from the newly created data set, as shown in this example:

```one_hot_data = sas.sasdata('test1', libref='SASuser') display(one_hot_data.head(obs=5))```

The output:

Our data was successfully one-hot encoded! For future reference, due to SAS’ analytical power, this step is not required. When including a categorical feature in a class statement the procedure automatically generates a design matrix with the one-hot encoded feature. For more information, I recommend reading this post about different ways to create a design matrix in SAS.

## Finally

You made it to the end of the journey! I hope everyone who reads these blogs can see the value that SASPy brings to the machine-learning community. Give SASPy  a try, and you'll see the power it can bring to your SAS solutions.

Stay curious, keep learning, and (most of all) continue innovating.

The t-test is a very useful test that compares one variable (perhaps blood pressure) between two groups. T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. It is used to determine whether there is a significant difference between the means of two groups. With all inferential statistics, we assume the dependent variable fits a normal distribution. When we assume a normal distribution exists, we can identify the probability of a particular outcome. The procedure that calculates the test statistic compares your data to what is expected under the null hypothesis. There are several SAS Studio tasks that include options to test this assumption. Let's use the t-test task as an example.

You start by selecting:

On the DATA tab, select the Cars data set in the SASHELP library. Next request a Two-sample test, with Horsepower as the Analysis variable and Cylinders as the Groups variable. Use a filter to include only 4- or 6-cylinder cars. It should look like this:

On the OPTIONS tab, check the box for Tests for normality as shown below.

All the tests for normality for both 4-cylinder and 6-cylinder cars reject the null hypothesis that the data values come from a population that is normally distributed. (See the figure below.)

### Should you abandon the t-test results and run a nonparametric test analysis such as a Wilcoxon Rank Sum test that does not require normal distributions?

This is the point where many people make a mistake. You cannot simply look at the results of the tests for normality to decide if a parametric test is valid or not. Here is the reason: When you have large sample sizes (in this data set, there were 136 4-cylinder cars and 190 6-cylinder cars), the tests for normality have more power to reject the null hypothesis and often result in p-values less than .05. When you have small sample sizes, the tests for normality will not be significant unless there are drastic departures from normality. It is with small sample sizes where departures from normality are important.

The bottom line is that the tests for normality often lead you to make the wrong decision. You need to look at the distributions and decide if they are somewhat symmetrical. The central limit theory states that the sampling distribution of means will be normally distributed if the sample size is sufficiently large. "Sufficiently large" is a judgment call. If the distribution is symmetrical, you may perform a t-test with sample sizes as small as 10 or 20.

The figure below shows you the distribution of horsepower for 4- and 6-cylinder cars.

With the large sample sizes in this data set, you should feel comfortable in using a t-test. The results, shown below, are highly significant.

If you are in doubt of your decision to use a parametric test, feel free to check the box for a nonparametric test on the OPTIONS tab. Running a Wilcoxon Rank Sum test (a nonparametric alternative to a t-test), you also find a highly significant difference in horsepower between 4- and 6-cylinder cars. (See the figure below.)

You can read more about assumptions for parametric tests in my new book, A Gentle Introduction to Statistics Using SAS Studio.

For a sneak preview check out the free book excerpt. You can also learn more about SAS Press, check out the up-and-coming titles, and receive exclusive discounts. To do so, make sure to subscribe to the newsletter.

Testing the Assumption of Normality for Parametric Tests was published on SAS Users.