10月 212019
 

Computing rates and proportions is a common task in data analysis. When you are computing several proportions, it is helpful to visualize how the rates vary among subgroups of the population. Examples of proportions that depend on subgroups include:

  • Mortality rates for various types of cancers
  • Incarceration rates by race
  • Four-year graduation rates by academic major

The first two examples are somewhat depressing, so I will use graduation rates for this article.

Uncertainty in estimates

An important fact to remember is that the uncertainty in an estimate depends on the sample size. If a small college has 8 physics majors and 5 of them graduate in four years, the graduation rate in physics is 0.6. However, because of the small sample size, the uncertainty in that estimate is much greater than for a larger group, such as if the English department graduates 50 out of 80 students. Specifically, if the estimate of a binomial proportion is p, the standard error of the estimate is sqrt(p(1–p)/n), where n is the sample size. Thus for the physics students, the standard error is sqrt(0.6*0.4/8) = 0.17, whereas for the English majors, the standard error is sqrt(0.6*0.4/80) = 0.05.

Therefore, it is a good idea to incorporate some visual aspect of the uncertainty into any graph of proportions and rates. For analyses that involve dozens or hundreds of groups, you can use a funnel plot of proportions, which I have used to analyze adoption rates for children and immunization rates for kindergartens in North Carolina. When you have a smaller number of groups, a simple alternative is a dot plot with error bars that indicate either the standard error or a 95% confidence interval for the estimate. As I've explained, I prefer to display the confidence interval.

Sample data: Graduation rates

The Chronicle of Higher Education web site enables you to find the graduation rates for US colleges and universities. You can find the average graduation rate by states (50 groups) or by college (hundreds of groups). You can also find the graduation rate by race (five groups) for any individual college. Because most colleges have fewer Hispanic, Asian, and Native American students, it is important to indicate the sample size or the uncertainty in the empirical estimates.

I don't want to embarrass any small college, so the following data are fake but are typical of the group sizes that you might see in real data. Suppose a college has six majors, labeled as A, B, C, D, E, and F. The following SAS DATA step defines the number of students who graduated in four years (Grads) and the number of students in each cohort (Total).

data Grads;
input Major $ Grads Total @@;
datalines;
A 10 22  B 10 32  C 17 25
D  4  7  E  8 14  F 16 28
;

Manual computations of confidence intervals for proportions

If you use a simple Wald confidence interval, it is easy to write a short DATA step to compute the empirical proportions and a 95% confidence interval for each major:

data GradRate;
set Grads;
GradRate = Grads / Total;
p = Grads / Total;               /* empirical proportion */
StdErr = sqrt(p*(1-p)/Total);    /* standard error */
/* use Wald 95% CIs */
z = quantile("normal", 1-0.05/2);
LCL = max(0,  p - z*StdErr);     /* LCL can't be less than 0 */
UCL = min(1,  p + z*StdErr);     /* UCL can't be less than 0 */
label p = "Proportion" LCL="Lower 95% CL" UCL="Upper 95% CL";
run;
 
proc print data=GradRate noobs label;
   var Major Grads Total p LCL UCL;
run;

The output shows that although majors D, E, and F have the same four-year graduation rate (57%), the estimate for the D group, which has only seven students, has twice as much variability as the estimate for the F group, which has four times as many students.

Automating the computations by using PROC FREQ

Although it is easy enough to write a DATA step for the Wald CI, other types of confidence intervals are more complicated. The BINOMIAL option in the TABLES statement of PROC FREQ enables you to compute many different confidence intervals (CIs), including the Wald CI. In order to use these data in PROC FREQ, you need to convert the data from Event-Trials format to Event-Nonevent format. For each major, let Graduated="Yes" indicate the count of students who graduated in four years and let Graduated="No" indicate the count of the remaining students. The following data step converts the data and estimates the binomial proportion for each group:

/* convert data to Event/Nonevent format */
data GradFreq;
set Grads;
Graduated = "Yes"; Count = Grads;       output;
Graduated = "No "; Count = Total-Grads; output;
run;
 
/* Use PROC FREQ to analyze each group separately and compute the binomial CIs */
proc freq data=GradFreq noprint;
   by notsorted Major; 
   tables Graduated / binomial(level='Yes' CL=wald); /* choose from among many confidence intervals */
   weight Count;
   output out=FreqOut binomial;
run;
 
proc print data=FreqOut noobs label;
   var Major N _BIN_ L_BIN U_BIN ;
   label _BIN_ = "Proportion" L_BIN="Lower 95% CL" U_BIN="Upper 95% CL";
run;

The output is not shown because the estimates and CIs from PROC FREQ are identical to the estimates from the "manual" calculations in the previous section. However, by using PROC FREQ you can easily compute more sophisticated confidence intervals.

Visualizing binomial proportions

As indicated earlier, it is useful to plot the proportions and confidence intervals. When you plot several proportions on the same graph, I recommend that you sort the data in some way, such as by the estimated proportions. If there are two groups that have the same proportion, you can use the size of the group to break the tie.

It can also be helpful to draw a reference line for the overall rate, regardless of group membership. (You can get the overall proportion by repeating the previous call to PROC FREQ but without using the BY statement.) For these data, the overall proportion of students who graduate in four years is 65/128 = 0.5078. Lastly, I think it is a good idea to add a table that shows the number of students in each major. You can use the YAXISTABLE statement to add that information to the graph, as follows:

/* sort by estimated proportion; break ties by using CI */
proc sort data=FreqOut;
   by _BIN_ N;
run;
 
title "Graduation Rates by College Major";
proc sgplot data=FreqOut;
   label _BIN_ = "Proportion" N="Number of Students";
   scatter y=Major x=_BIN_ / xerrorlower=L_BIN xerrorupper=U_BIN;
   yaxistable N / y=Major location=inside position=left valueattrs=(size=9);  
   refline 0.5078 / axis=x labelloc=inside label="Overall";
   yaxis discreteorder=data offsetmax=0.1;       /* preserve order of categories */
   xaxis grid values=(0 to 1 by 0.1) valueshint;
run;

The graph shows the range of graduation rates. The "error bars" are 95% CIs, which show that majors that have few students have larger uncertainty than majors that have more students. If there are 10 or more categories, I recommend that you use alternating color bands to make it easier for the reader to associate intervals with the majors.

In summary, this article shows how to use PROC FREQ to estimate proportions and confidence intervals for groups of binary data. A great way to convey the proportions to others is to graph the proportions and CIs. By including the sample size on the graph, readers can connect the uncertainty in the estimates to the sample size.

The post Compute and visualize binomial proportions in SAS appeared first on The DO Loop.

10月 192019
 

Newcomers to SAS Viya Administration may appreciate these tried-and-tested patterns for securing folders, and the content within them (reports, data plans, models etc.). Nothing too fancy today; if you are new to security model design in SAS Viya, this post is for you.

It presents five patterns for permissions on folders, reports, data plans and the like, and shows how you can combine them and apply them to design a simple but effective security model (or authorization model) for a folder structure.

SAS Viya has two authorization systems

If you are just starting work with permissions in SAS Viya, then you should be aware that there are TWO authorization systems in SAS Viya:

  • The general authorization system is used to secure (manage access to) folders, reports, data plans, models and other content stored in SAS Viya’s database (the SAS Infrastructure Data Platform, which uses PostgrSQL behind the scenes). It is also used to manage access to SAS Viya applications and some of their features.
  • The CAS authorization system is used to secure (manage access to) data, held in or accessed through SAS Cloud Analytics Services (CAS).

There is no overlap between the two systems. If you are interested in the permissions on a CAS library or table, you use the CAS authorization system. For everything else stored inside SAS Viya, it’s the general authorization system. Click the links above for the documentation. It’s worth mentioning that SAS Viya does make use of the host operating system’s filesystem, and can also access data in other databases or in SAS 9, each of which has its own authorization system too. These are therefore important too, but they usually take up less of a SAS Administrator’s attention.

SAS Viya has a built-in custom group called SAS Administrators, which is granted broad administrator access to much of the SAS Viya system out of the box. It is not a good idea to make users members of SAS Administrators unless they will bear at least joint responsibility for the health and stability of your SAS Viya deployment. Departmental ‘power users’ who just need read and write access to all your users’ content should not be made members of your SAS Administrators group! It’s better to make a content administrators group (or equivalent) and grant that group more access than you grant to most users.

Editing folder permissions in SAS Environment Manager

In the Content page in SAS Environment Manager (and some other places too), you can edit the authorization settings for folders, reports and other objects stored by SAS Viya. For most objects, e.g. reports, data plans, models etc, the Edit Authorization dialog looks like this. I’ve overlaid a blue shape, intended to emphasize that all four permissions apply to the object itself:

The Edit Authorization dialog for a folder has two more permissions (add and remove), and also has a second set of permissions. Those on the left apply to the folder itself, while those on the right, with ‘(convey)’ in their names, apply to the folder’s contents (technically, they apply to the folder’s ‘container’ and are conveyed to its contents). Again I’ve overlaid blue shapes to emphasize this:

This allows you to easily set different permissions on the folder itself from those you set on the folder’s contents: you can prevent users from renaming, deleting or moving a folder, but allow them to edit everything inside it. That’s nice!

When securing folders and the content inside them in SAS Viya, three basic guidelines to follow are:

  • Secure folders, not individual objects (reports etc.) wherever possible. It’s simpler to manage, and easier to understand later. If two reports in the same folder should have different permissions, they should probably be in different folders.
  • Grant permissions to groups, not individual users. It’s simpler to manage, and easier to understand later. If there isn’t a group that contains the set of users you need, either make a Custom Group, ask your Active Directory or LDAP administrator to make a new group for you, or consider using a larger existing group if that will work without granting too much access to someone who should not have it.
  • Avoid prohibits, especially ‘(convey)’-ed prohibits. Try to design your authorization scheme or security model so that you grant permissions wherever you need them, to give just enough access to content, and not too much. If a group of users should have access to a folder, and some of its contents, don’t grant Read (convey) permission on the parent folder, and try to prohibit Read permission on the subfolder(s) that the group of users should not be able to see. That would initially work, but it’s poor design and will likely become an inconvenience later. Instead, grant the group Read access, not Read (convey), on the folder, and then selectively grant that group Read and Read (convey) on the contents that they should be able to see.

These guidelines give rise to some recurring patterns of access. Note that you won’t find these patterns anywhere in the software: there is no screen, or button, or command for defining and applying them. They are design concepts only. This post’s five simple general authorization patterns are as follows (click any image to enlarge it):

No Access

The default. SAS Viya does not give user access to anything unless it is directly or indirectly granted.

This pattern is of course only effective if the user or group does not have access conveyed from a parent folder. So long as it does not, your masterly inactivity on this folder’s permissions ensures it too is not accessible to this user or group. If the folder does have access conveyed for the user or group in question from higher up the folder hierarchy, remove those conveyed permissions, and grant the permissions you require (whether conveyed or not) more selectively to subfolders or to a smaller group, so that the group in question does not get access they should not get.

<Group> Read

Grants a group permission to see this object only.

When a group should have read access to an object (ideally a folder), but not the objects’ contents, apply this pattern for that group. To apply it, grant a specific group Read permission on this folder, on the left-hand side of the authorization grid (or if you are using a programmatic method to set permissions, on the object URI), and grant the group nothing on the right-hand side of the grid (or on the container URI).

The 'Group' in angled brackets is written that way because that you might use variants of this pattern for several different groups: you may use an HR Read pattern, a Sales Read pattern, or a North American Finance Modelers Read pattern, as demanded by your requirements. Substitute the name of an actual group to turn the generalised, abstract pattern into a specific concrete pattern.

<Group> Read Convey

Grants a group permission to see this object and everything inside it.

To apply this pattern, grant a specific group Read permission on this folder, on the left-hand side of the authorization grid (or if you are using a programmatic method to set permissions, on the object URI), and also grant it the Read permission on the right-hand side of the grid (or on the container URI).

<Group> Edit Contents

Grants a group permission to make changes to all of the content of this folder. When you grant permissions in the Group Edit Contents pattern, you will usually also grant the permissions in the Group Read Convey pattern too: users who edit things need to see them.

For this pattern, grant the group Add and Remove on the object itself, and grant it Update, Delete, Add and Remove on the object’s container. These permissions allow the group to add and remove objects from the folder, to modify the objects inside the folder, and to add and remove objects from any subfolders.

Usually, you wouldn’t apply this pattern on its own. If a group is allowed to edit the contents of a folder, they usually need to be able to see the folder too. So, apply this Group Edit Contents pattern plus the Group Read Convey pattern together, for the same group on the same object:

<Group> Secure

Grants a group permission to add, remove, or change permissions on this object.

To apply this pattern, grant a specific group Secure on this folder, and Secure (convey) on the container.

When you grant permissions in the Group Secure pattern, you will usually also grant the permissions in the Group Read Convey and Group Edit Contents patterns too: users who manage objects’ permissions are generally super-users, content administrators or similar, and need to be able to see and edit the things they are allowed to secure:

Using the patterns

We have seen five simple regularly-occurring general authorization system permissions patterns, and a couple of ways in which some of them are usually combined. You may have instances of each of these patterns for each of several different groups in a security model design. Let’s see how they can be applied to a fictitious folder structure for the made-up organization we use in some of our GEL Administration workshops, GELCorp:

You may notice that the symbols in this table’s cells correspond to the symbols for the patterns presented above. The white symbols show patterns of permissions that should be granted to the group in the column header, on the folder in the row header. The grey symbols show inherited permissions, to remind us that they are in effect on a folder, but do not need to be directly granted on it again.

I’d encourage you to attend one our GEL SAS Viya Administration workshops, or the Securing SAS Viya Deployments workshop both available in the VLE or as a face-to-face workshop, for a fuller explanation of why we’ve applied the permissions patterns in this way. This security model design is certainly not the only way you could choose to secure GELCorp’s folders. But I expect you can see, at a glance, how the folders are intended to be secured, even if it is not always obvious from this diagram why. Plus, if you were to follow this design for each group and each folder where a symbol appears, you would get exactly the same set of permissions we apply in our workshop.

Hopefully you would agree that this is quite an easy way to represent and the basic permissions design for a set of folders, for multiple groups!

Simple general authorization patterns was published on SAS Users.

10月 172019
 

CAS Table

SAS Viya’s in-memory tables are referred to as a CAS table and are accessed using a CAS Engine. In this post, we will explore how one can parallel load and compress a CAS table in one pass of the CAS table.

Note: When not using this technique (i.e. PROC CASUTIL with a COMPRESS option) your loading of a CAS table will be a single-threaded process (slower). To understand the following code, which can act as a template for you, see Figure 1, we will review the SAS Log of this code in Figure 2.

proc cas;
  file log;
  table.dropCaslib /
   caslib='sas7bdat' quiet = true;
  addcaslib /
    datasource={srctype="path"}
    name="sas7bdat"
    path="/viyafiles/sasss1/data"
  ; run;
 
  index /
    table={caslib="sas7bdat" name="cars.sas7bdat" singlepass=true}
    casout={caslib="sas7bdat" name="cars" compress=true replication=0}
  ; run;
  print _status; run;
 
  tabledetails /
    caslib="sas7bdat"
    name="cars"
  ; run;
quit;

Figure 1. Template of SAS Code to Parallel Load and Compress a CAS Table in One Pass of the CAS Table

To accomplish the parallel load and compression of a CAS table, we will leverage PROC CAS. Let’s review the SAS log in Figure 2:

  • Line 85 utilizes the FILE LOG statement to redirect the contents we would normally see in the results window directly to the SAS Log i.e. the information between lines 92 and 93 as well as lines 103 and 104. Note: using the FILE LOG statement is optional.
  • On lines 86-92 we are dropping and creating our CASLIB to point to the file system path that contains the SAS7BDAT data set that we want to load and compress, i.e. CARS.
  • On line 87 we added the option QUIET = TRUE to our statement. This is a very handy trick to avoid the ERROR message we get in the SAS Log in Figure 3. If you omit this option an ERROR message will be produced if you have a brand-new CAS Session and our SAS7BDAT CASLIB has not be defined to that session.
  • Lines 88-91 created our CASLIB to our SAS7BDAT data sets.
  • Lines 94-96 are the statements that accomplish the parallel load and compression of our CAS table.
  • On line 94 we use the INDEX statement which is always executed in parallel. Notice we are not creating any indexes in our example but simply using the INDEX statement to activate the parallel load.
  • On line 95 we identify the CASLIB pointing to our source data set CARS.SAS7BDAT. We are also using the option SINGLEPASS = TRUE which means our CAS table will be loaded and compressed as each thread adds a row to our CAS table.
  • On line 96 we are saving our CAS table to our CASLIB SAS7BDAT and naming it CARS. The COMPRESS = TRUE options ensures our CAS table will be compressed and the REPLICATION = 1 ensures our CAS table is replicated i.e. 2 copies of the CAS table, to ensure high availability of the CAS table.
  • On line 98 will print to the SAS Log information telling use that the table was loaded and compressed successfully i.e. {severity=0,reason=0,,statusCode=0}.
  • Lines 101-103 provide information on our compressed CAS table. Reviewing this information, we can see our CAS table has a compression ratio of 5.

Figure 2. SAS Log to Parallel Load and Compress a CAS Table in One Pass of the CAS Table

 

Figure 3. To Avoid this ERROR Message, We Will Add to Line 76 the option QUIET = TRUE, See Line 87 in Figure 2.

Conclusion

When loading multiple CAS tables it is a common practice to compress CAS tables to help avoid the paging of our CAS tables to CAS DISK_CACHE. Paging to CAS_DISK_CACHE impacts performance. In addition, one can parallel load and compress source tables from various formats i.e. CSV files, Hadoop tables as well as many others relational databases, such as Oracle, Teradata and so on.

How to Parallel Load and Compress a SAS® Cloud Analytic Services (CAS) Table was published on SAS Users.

10月 172019
 

It has recently become fashionable to talk about data literacy. This is an important idea so I’m glad to see people discussing it.

To me, data literacy means understanding that data are not dry, dusty, abstract squiggles on a computer screen, but represent living things: people, plants, animals. Having a deep understanding of data enables people to engage with data, to explore data, to interpret data, and to use data to impact their lives and work.

Data literacy necessarily comes with a degree of skepticism, recognizing that data can be not only used, but also misused. In this age of “alternative facts,” it is important to recognize when assertions are supported by data, and when they are not.

Everyone knows that technology is becoming more and more a part of everyday life. Without data literacy, people become passive recipients; with data literacy, you can actively engage with technology.

You know you are fluent in a foreign language when you are comfortable speaking it and can communicate what you want to say. The same is true for data literacy; it is about reaching a level of comfort, about seeing the meaning behind the data, and about being able to communicate what you want to say.

Here is how Wikipedia defines data literacy.

10月 162019
 

Introduction

Generating a word cloud (also known as a tag cloud) is a good way to mine internet text. Word (or tag) clouds visually represent the occurrence of keywords found in internet data such as Twitter feeds. In the visual representation, the importance of each keyword is denoted by the font size or font color.

You can easily generate Word clouds by using the Python language. Now that Python has been integrated into the SAS® System (via the SASPy package), you can take advantage of the capabilities of both languages. That is, you create the word cloud with Python. Then you can use SAS to analyze the data and create reports. You must have SAS® 9.4 and Python 3 or later in order to connect to SAS from Python with SASPy. Developed by SAS, SASPy a Python package that contains methods that enable you to connect to SAS from Python and to generate analysis in SAS.

Configuring SASPy

The first step is to configure SASPy. To do so, see the instructions in the SASPy Installation and configuration document. For additional details, see also the SASPy Getting started document and the API Reference document.

Generating a word cloud with Python

The example discussed in this blog uses Python to generate a word cloud by reading an open table from the data.world website that is stored as a CSV file. This file is from a simple Twitter analysis job where contributors commented via tweets as to how they feel about self-driving cars. (For this example, we're using data that are already scored for sentiment. SAS does offer text analytics tools that can score text for sentiment too -- see this example about rating conference presentations.) The sentiments were classified as very positive, slightly positive, neutral, slightly negative, very negative, and not relevant. (In the frequency results that are shown later, these sentiments are specified, respectively, as 1, 2, 3, 4, 5, and not_relevant.) This information is important to automakers as they begin to -design more self-driving vehicles and as transportation companies such as Uber and Lyft are already adding self- driving cars to the road. Along with understanding the sentiments that people expressed, we are also interested in exactly what is contributors said. The word cloud gives you a quick visual representation of both. If you do not have the wordcloud package installed, you need to do that by submitting the following command:

pip install wordcloud

After you install the wordcloud package, you can obtain a list of required and optional parameters by submitting this command:

?wordcloud

Then, follow these steps:

  1. First, you import the packages that you need in Python that enable you to import the CSV file and to create and save the word-cloud image, as shown below.

  2. Create a Python Pandas dataframe from the twitter sentiment data that is stored as CSV data in the data file. (The data in this example is a cleaned-up subset of the original CSV file on the data.world website.)

  3. Use the following code, containing the HEAD() method, to display the first five records of the Sentiment and Text columns. This step enables you to verify that the data was imported correctly.

  4. Create a variable that holds all of the text in a single row of data that can be used in the generation of the word cloud.

  5. Generate the word cloud from the TEXTVAR variable that you create in step 4. Include any parameters that you want. For example, you might want to change the background color from black to white (as shown below) to enable you to see the values better. This step includes the STOPWORDS= parameter, which enables you to supply a list of words that you want to eliminate. If you do not specify a list of words, the parameter uses the built-in default list.

  6. Create the word-cloud image and modify it, as necessary.

Analyzing the data with SAS®

After you create the word cloud, you can further analyze the data in Python. However, you can actually connect to SAS from Python (using the SASPy API package), which enables you to take advantage of SAS software's powerful analytics and reporting capabilities. To see a list of all available APIs, see the API Reference.

The following steps explain how to use SASPy to connect to SAS.

  1. Import the SASPy package (API) . Then create and generate a SAS session, as shown below. The code below creates a SAS session object.

  2. Create a SAS data set from the Python dataframe by using the DATAFRAME2SASDATA method. In the code below, that method is shown as the alias DF2DS.

  3. Use the SUBMIT() method to include SAS code that analyzes the data with the FREQ procedure. The code also uses the GSLIDE procedure to add the word cloud to an Adobe PDF file.

    When you submit the code, SAS generates the PDF file that contains the word-cloud image and a frequency analysis, as shown in the following output:

Summary

As you can see from the responses in the word cloud, it seems that the contributors are quite familiar with Google driverless cars. Some contributors are also familiar with the work that Audi has done in this area. However, you can see that after further analysis (based on a subset of the data), most users are still unsure about this technology. That is, 74 percent of the users responded with a sentiment frequency of 3, which indicates a neutral view about driverless cars. This information should alert automakers that more education and marketing is required before they can bring self-driving cars to market. This analysis should also signal companies such as Uber Technologies Inc. and Lyft, Inc. that perhaps consumers need more information in order to feel secure with such technology.

Creating a word cloud using Python and SAS® software was published on SAS Users.

10月 162019
 

The EFFECT statement is supported by more than a dozen SAS/STAT regression procedures. Among other things, it enables you to generate spline effects that you can use to fit nonlinear relationships in data. Recently there was a discussion on the SAS Support Communities about how to interpret the parameter estimates of spline effects. This article answers that question by visualizing the spline effects.

An overview of generated effects

Spline effects are powerful because they enable you to use parametric models to fit nonlinear relationships between an independent variable and a response. Using spline effects is not much different than use polynomial effects to fit nonlinear relationships. Suppose that a response variable, Y, appears to depend on an explanatory variable, X, in a complicated nonlinear fashion. If the relationship looks quadratic or cubic, you might try to capture the relationship by introducing polynomial effects. Instead of trying to model Y by X, you might try to use X, X2, and X3.

Strictly speaking, polynomial effects do not need to be centered at the origin. You could translate the polynomial by some amount, k, and use shifted polynomial effects such as (X-k), (X-k)2, and (X-k)3. Or you could combine these shifted polynomials with polynomials at the origin. Or use shifted polynomials that are shifted by different amounts, such as by the constants k1, k2, and k3.

Spline effects are similar to (shifted) polynomial effects. The constants (such as k1, k2, k3) that are used to shift the polynomials are called knots. Knots that are within the range of the data are called interior knots. Knots that are outside the range of the data are called exterior knots or boundary knots. You can read about the various kinds of spline effects that are supported by the EFFECT statement in SAS. Rather than rehash the mathematics, this article shows how you can use SAS to visualize a regression that uses splines. The visualization clarifies the meaning of the parameter estimates for the spline effects.

Output and visualize spline effects

This section shows how to output the spline effects into a SAS data set and plot the spline effects. Suppose you want to predict the MPG_City variable (miles per gallon in the city) based on the engine size. Because we will be plotting curves, the following statements sort the data by the EngineSize variable. Then the OUTDESIGN= option on the PROC GLMSELECT statement writes the spline effects to the Splines data set. For this example, I am using restricted cubic splines and four evenly spaced internal knots, but the same ideas apply to any choice of spline effects.

/* Fit data by using restricted cubic splines.
   The EFFECT statement is supported by many procedures: GLIMMIX, GLMSELECT, LOGISTIC, PHREG, ... */
title "Restricted TPF Splines";
title2 "Four Internal Knots";
proc glmselect data=cars outdesign(addinputvars fullmodel)=Splines; /* data set contains spline effects */
   effect spl = spline(EngineSize / details       /* define spline effects */
                naturalcubic basis=tpf(noint)     /* natural cubic splines, omit constant effect */
                knotmethod=equal(4));             /* 4 evenly spaced interior knots */
   model mpg_city = spl / selection=none;         /* fit model by using spline effects */
   ods select ParameterEstimates SplineKnots;
   ods output ParameterEstimates=PE;
quit;

The SplineKnots table shows the locations of the internal knots. There are four equally spaced knots because the procedure used the KNOTMETHOD=EQUAL(4) option. The ParameterEstimates table shows estimates for the regression coefficients for the spline effects, which are named "Spl 1", "Spl 2", and so forth. In the Splines data set, the corresponding variables are named Spl_1, Spl_2, and so forth.

But what do these spline effects look like? The following statements plot the spline effects versus the EngineSize variable, which is the variable from which the effects are generated:

proc sgplot data=Splines;
   series x=EngineSize y=Intercept / curvelabel;
   series x=EngineSize y=spl_1 / curvelabel;
   series x=EngineSize y=spl_2 / curvelabel;
   series x=EngineSize y=spl_3 / curvelabel;
   refline 2.7 4.1 5.5 / axis=x lineattrs=(color="lightgray");
   refline 6.9 / axis=x label="upper knot" labelloc=inside lineattrs=(color="lightgray");
   yaxis label="Spline Effect";
run;

As stated in the documentation for the NATURALCUBIC option, these spline effects include "an intercept, the polynomial X, and n – 2 functions that are all linear beyond the largest knot," where n is the number of knots. This example uses n=4 knots, so Spl_2 and Spl_3 are the cubic splines. You will also see different spline effects if you change to one of the other supported spline methods, such as B-splines or the truncated power functions. Try it!

The graph shows that the natural cubic splines are reminiscent of polynomial effects, but there are a few differences:

  • The spline effects (spl_2 and spl_3) are shifted away from the origin. The spl_2 effect is shifted by 2.7 units, which is the location of the first internal knot. The spl_3 effect is shifted by 4.1 units, which is the location of the second internal knot.
  • The spline effects are 0 when EngineSize is less than the first knot position (2.7). Not all splines look like this, but these effects are based on truncated power functions (the TPF option).
  • The spline effects are linear when EngineSize is greater than the last knot position (6.9). Not all splines look like this, but these effects are restricted splines.

Predicted values are linear combinations of the spline effects

Visualizing the shapes of the spline effects enable you to make sense of the ParameterEstimates table. As in all linear regression, the predicted value is a linear combination of the design variables. In this case, the predicted values are formed by
Pred = 34.96 – 5*Spl_1 + 2.2*Spl_2 – 3.9*Spl_3
You can use the SAS DATA set or PROC IML to compute that linear combination of the spline effects. The following graph shows the predicted curve and overlays the locations of the knots:

The coefficient for Spl_1 is the largest effect (after the intercept). In the graph, you can see the general trend has an approximate slope of -5. The coefficient for the Spl_2 effect is 2.2, and you can see that the predicted values change slope between the second and third knots due to adding the Spl_2 effect. Without the Spl_3 effect, the predicted values would continue to rise after the third knot, but by adding in a negative multiple of Spl_3, the predicted values turn down again after the third knot.

Notice that the prediction function for the restricted cubic spline regression is linear before the first knot and after the last knot. The prediction function models nonlinear relationships between the interior knots.

Summary

In summary, the EFFECT statement in SAS regression procedures can generate spline effects for a continuous explanatory variable. The EFFECT statement supports many different types of splines. This article gives an example of using natural cubic splines (also called restricted cubic splines), which are based on the truncated power function (TPF) splines of degree 3. By outputting the spline effects to a data set and graphing them, you can get a better understanding of the meaning of the estimates of the regression coefficients. The predicted values are a linear combination of the spline effects, so the magnitude and sign of the regression coefficients indicate how the spline effects combine to predict the response.

The post Visualize a regression with splines appeared first on The DO Loop.

10月 152019
 

In a previous post, I discussed using logs to troubleshoot problems in your Viya environment. In this post, I will look at some additional ways to troubleshoot using some of the tools provided by the Viya Operations Infrastructure. With applications, servers and numerous micro-services all working together and generating their own logs in Viya, it can be difficult to find relevant logs. In order to manage the large number of logs and to enable you to locate messages of interest, the operations infrastructure provides components to collect and store log messages.

The collection process is illustrated in the diagram below.

Co-ordinated by the operations infrastructure:

  • sas-watch log continuously collects and sends log messages to the RabbitMQ exchange
  • sas-stream pulls the messages from RabbitMQ and writes them to disk as a tab-separated value (TSV) file
  • Every five minutes, the sas-ops-agentsrv runs the DatamartEtl task to extract log messages from the TSV file and load them into the VIYALOGS CAS-indexed search table

SAS Environment Manager uses the information in the VIYALOGS table and the VIYALOGS_SOURCES tables to display log messages and graphs that contain the frequency and trends of messages. The SAS Environment Manager LOG’s interface makes it really easy to search and analyze log messages. Using the interface, you can view, subset and search logs. The interface has the filtering capabilities on the left hand side and displays the messages on the right. By default, the filter is set to display all messages from all applications and services from the last 30 minutes.

You can modify the filter to extend or shorten the timeframe, subset the level of messages displayed or the source (service/application) that the messages are coming from. You can also search for any text within a message.

Many administrators would prefer a command-line interface, and the good news is there is one.

sas-ops is a command-line interface which allows for the monitoring of the operational infrastructure in a SAS Viya deployment environment.

I have found the sas-ops log command very useful to troubleshoot problems. The sas-ops log command can be used to stream log messages that are generated by SAS Viya applications and services. The messages can be streamed to a terminal window or piped to a file. The sas-ops logs command is located at /opt/sas/viya/home/bin and can be run from any machine in a Viya environment that is included in the CommandLine.

When would you use sas-ops logs to stream log messages? Some potential scenarios are to:

  • troubleshoot a poorly performing report or analysis
  • debug problems in the environment such as logon issues
  • monitor access to resources

In these cases, using sas-ops logs you can stream the log messages from all services to a single file or terminal.

In its simplest form, the command live streams all log messages from a Viya environment to the terminal. Selecting CTRC+C will stop the streaming.

./sas-ops logs

Partial output from the stream is shown below.

If you want to save the output, you can redirect the stream to a file.

./sas-ops logs &gt; /tmp/mylog.log

You can get more creative and achieve more complex tasks. You can change the format of the message output using –format. For example, to create a file with json which could be read by another process use:

./sas-ops logs –format pretty &gt; mylogs.json

You can also:

  • stream messages for just a specific Viya service
  • filter logs messages by text in a regular expression
  • stream for a specific duration

The duration is specified using the format 0h0m0s0ms, but you can also use individual parts of the specification, for example 30s for 30 seconds or 5m for 5 minutes.

Consider the situation where we want to monitor access to a particular CAS table over a specific period of time. The command below will output to a file all messages that contain the table name HR_SUMMARY for a period of 5 minutes.

./sas-ops logs –match HR_SUMMARY –timeout 5m &gt; /tmp/hr_summary_access.log

The output shows all the CAS actions that were performed on the table during the time period.

You can subset the stream to one service.

Consider a case where a user is having an issue logging in and you suspect you have an issue with the LDAP setup. To check the problem, you can firstly enable DEBUG logging on com.sas.identities. Then stream the log messages from the identities service.

./sas-ops logs –format pretty –source identities &gt; logonerrors.json

Viewing the output shows that there is something wrong with the LDAP query.

I think you will agree that sas-ops logs is a very useful tool for monitoring and troubleshooting issues in a Viya environment. For more information, check out the following resources:

I would like to thank Bryan Ellington for his helpful input with this post.

Capturing log messages from Viya deployments was published on SAS Users.

10月 142019
 

I recently wrote about how to use PROC TTEST in SAS/STAT software to compute the geometric mean and related statistics. This prompted a SAS programmer to ask a related question. Suppose you have dozens (or hundreds) of variables and you want to compute the geometric mean of each. What is the best way to obtain these geometric means?

As I mentioned in the previous post, the SAS/IML language supports the GEOMEAN function, so you compute the geometric means by iterating over each column in a data matrix. If you do not have SAS/IML software, you can use PROC UNIVARIATE in Base SAS. The UNIVARIATE procedure supports the OUTTABLE= option, which creates a SAS data set that contains many univariate statistics, including the geometric mean.

For example, suppose you want to compute the geometric means for all numeric variables in the Sashelp.Cars data set. You can use the OUTTABLE= option to write the output statistics to a data set and then print only the column that contains the geometric mean, as follows:

proc univariate data=Sashelp.Cars outtable=DescStats noprint;
   var _NUMERIC_;
run;
 
proc print data=DescStats noobs;
   var _var_ _GEOMEAN_;
run;

This method also works if your data contain a classification variable and you want to compute the geometric mean for each level of the classification variable. For example, the following statements compute the geometric means for two variables for each level of the Origin variable, which has the values "Asia", "Europe", and "USA":

proc univariate data=Sashelp.Cars outtable=DescStatsClass noprint;
   class Origin;
   var MPG_City Horsepower;
run;
 
proc print data=DescStatsClass noobs;
   var _var_ Origin _GEOMEAN_;
run;

In summary, if you want to use Base SAS to compute the geometric mean (or any of almost 50 other descriptive statistics) for many variables, use the OUTTABLE= option of PROC UNIVARIATE.

The post Compute the geometric mean for many variables in SAS appeared first on The DO Loop.

10月 102019
 

DATA Step BY Statements

DATA Step is a very powerful language that SAS and Open Source programmers leverage to stage data for the analytical life cycle. A popular technique is to use the DESCENDING option on BY variables to identify the largest value. Let’s review the example in Figure 1:

  • On line 74 we are using the descending option on the BY statement for the numeric variable MSRP. The reason we are doing this is so we can identify the most expensive car for each make of car in our data set.
  • On line 79 we group our data by MAKE of car.
  • On line 80 we leverage the FIRST. statement on the subsetting IF statement to output the first record for each MAKE. In Figure 2 we can review the results.


Figure 1. Descending BY Statement


Figure 2. Listing of Most Expensive Cars by MAKE

What is CAS?

CAS is SAS Viya’s in-memory engine that processes data and logic in a distributed computing paradigm. When working with CAS tables we can simulate the DESCENDING BY statement by creating a CAS View which will then become the source table to our DATA Step. Let’s review Figure 3:

  • On line 79 we will leverage the CASL (SAS® Cloud Analytic Services Language) action set TABLE with the action VIEW to create the CAS View that will be used as our source table in the DATA Step.
  • On lines 80 and 81 we will store our CAS View in the CASUSER CASLIB with the name of DESCENDING.
  • On line 82 and 83 we use the TABLES statement to specify the input CAS table to our CAS View.
  • On line 84 we use the VARLIST statement to identify the columns from the input table we want in our CAS View.
  • On lines 85 we create a new variable for our CAS View using the computedVars statement,
  • On line 86 we provide the math for our new variable N_MSRP. N_MSRP is the negated value of the input CAS table variable MSRP. Note: This simulation only works for numeric variables. For character data I suggest using LAST. processing which you can review in this blog post.


Figure 3. Simulating DESCENDING BY Statement for Numeric Variables

Now that we have our CAS View with its new variable N_MSRP, we can move on to the DATA Step code in Figure 3.

  • On line 92 the SET statement specifies the source to our DATA Step; CAS View CASUSER.DESCENDING
  • On line 83 we leverage the BY Statement to group our data in ascending order for the CAS View variables MAKE and N_MSRP. Because N_MSRP is in ascending order of our original variable MSRP is in DESCENDING order.
  • On line 94 we use a subsetting IF statement to output the first occurrence of each MAKE.

Figure 4 is a listing of our new CAS table CASUSER.DESCENDING2 and displays the most expensive car for each make of car.


Figure 4. Listing of Most Expensive Cars by MAKE

Template for Creating a CAS View

/* Create a CAS view */
/* For each DESCENDING numeric create a new variable(s) */
/* The value of the new variable(s) is the negated value */
/* of the original DESCENDING BY numeric variable(s) */
proc cas;
   table.view / replace = true
   caslib='casuser'
   name='descending'
   tables={{
      name='cars'
      varlist={'msrp' 'make'},
      computedVars={{name='n_msrp'}},
      computedVarsProgram='n_msrp = -(msrp)'
   }};
run;
quit;
 
data casuser.descending2;
   set casuser.descending;
   by make n_msrp ;
   if first.make ;
run;
 
proc print data=casuser.descending2;
title "Most Expensive Cars";
run;

Conclusion

It is a very common coding technique to process data with a DESCENDING BY statement using DATA Step. With Viya 3.5 the DESCENDING BY statements is supported, for numeric and character data in DATA Step, with this caveat: DESCENDING works on all but the first BY variable on the BY statement. For earlier versions of SAS Viya this simulation is the best practices for numeric data that you want in DESCENDING order.

How to Simulate DATA Step DESCENDING BY Statements in SAS® Cloud Analytic Services (CAS) was published on SAS Users.