1月 282019
 

This article shows how to use SAS to simulate data that fits a linear regression model that has categorical regressors (also called explanatory or CLASS variables). Simulating data is a useful skill for both researchers and statistical programmers. You can use simulation for answering research questions, but you can also use it generate "fake" (or "synthetic") data for a presentation or blog post when the real data are confidential. I have previously shown how to use SAS to simulate data for a linear model and for a logistic model, but both articles used only continuous regressors in the model.

This discussion and SAS program are based on Chapter 11 in Simulating Data with SAS (Wicklin, 2013, p. 208-209). The main steps in the simulation are as follows. The comments in the SAS DATA program indicate each step:

  1. Macro variables are used to define the number of continuous and categorical regressors. Another macro variable is used to specify the number of levels for each categorical variable. This program simulates eight continuous regressors (x1-x8) and four categorical regressors (c1-c4). Each categorical regressor in this simulation has three levels.
  2. The continuous regressors are simulated from a normal distribution, but you can use any distribution you want. The categorical levels are 1, 2, and 3, which are generated uniformly at random by using the "Integer" distribution. The discrete "Integer" distribution was introduced in SAS 9.4M5; for older versions of SAS, use the %RandBetween macro as shown in the article "How to generate random integers in SAS." You can also generate the levels non-uniformly by using the "Table" distribution.
  3. The response variable, Y, is defined as a linear combination of some explanatory variables. In this simulation, the response depends linearly on the x1 and x8 continuous variables and on the levels of the C1 and C4 categorical variables. Noise is added to the model by using a normally distributed error term.
/* Program based on Simulating Data with SAS, Chapter 11 (Wicklin, 2013, p. 208-209) */
%let N = 10000;                  /* 1a. number of observations in simulated data */
%let numCont =  8;               /* number of continuous explanatory variables */
%let numClass = 4;               /* number of categorical explanatory variables */
%let numLevels = 3;              /* (optional) number of levels for each categorical variable */
 
data SimGLM; 
  call streaminit(12345); 
  /* 1b. Use macros to define arrays of variables */
  array x[&numCont]  x1-x&numCont;   /* continuous variables named x1, x2, x3, ... */
  array c[&numClass] c1-c&numClass;  /* CLASS variables named c1, c2, ... */
  /* the following statement initializes an array that contains the number of levels
     for each CLASS variable. You can hard-code different values such as (2, 3, 3, 2, 5) */
  array numLevels[&numClass] _temporary_ (&numClass * &numLevels);  
 
  do k = 1 to &N;                 /* for each observation ... */
    /* 2. Simulate value for each explanatory variable */ 
    do i = 1 to &numCont;         /* simulate independent continuous variables */
       x[i] = round(rand("Normal"), 0.001);
    end; 
    do i = 1 to &numClass;        /* simulate values 1, 2, ..., &numLevels with equal prob */
       c[i] = rand("Integer", numLevels[i]);        /* the "Integer" distribution requires SAS 9.4M5 */
    end; 
 
    /* 3. Simulate response as a function of certain explanatory variables */
    y = 4 - 3*x[1] - 2*x[&numCont] +                /* define coefficients for continuous effects */
       -3*(c[1]=1) - 4*(c[1]=2) + 5*c[&numClass]    /* define coefficients for categorical effects */
        + rand("Normal", 0, 3);                     /* normal error term */
    output;  
  end;  
  drop i k;
run;     
 
proc glm data=SimGLM;
   class c1-c&numClass;
   model y = x1-x&numCont c1-c&numClass / SS3 solution;
   ods select ModelANOVA ParameterEstimates;
quit;
Parameter estimates for synthetic (simulated) data that follows a regression model.

The ModelANOVA table from PROC GLM (not shown) displays the Type 3 sums of squares and indicates that the significant terms in the model are x1, x8, c1, and c4.

The parameter estimates from PROC GLM are shown to the right. You can see that each categorical variable has three levels, and you can use PROC FREQ to verify that the levels are uniformly distributed. I have highlighted parameter estimates for the significant effects in the model:

  • The estimates for the significant continuous effects are highlighted in red. The estimates for the coefficients of x1 and x8 are about -3 and -2, respectively, which are the parameter values that are specified in the simulation.
  • The estimates for the levels of the C1 CLASS variable are highlighted in green. They are close to (-3, -4, 0), which are the parameter values that are specified in the simulation.
  • The estimates for the Intercept and the C4 CLASS variable are highlighted in magenta. Notice that they seem to differ from the parameters in the simulation. As discussed previously, the simulation and PROC GLM use different parameterizations of the C4 effect. The simulation assigns Intercept = 4. The contribution of the first level of C4 is 5*1, the contribution for the second level is 5*2, and the contribution for the third level is 5*3. As explained in the previous article, the GLM parameterization reparameterizes the C4 effect as (4 + 15) + (5 - 15)*(C4=1) + (10 - 15)*(C4=2). The estimates are very close to these parameter values.

Although this program simulates a linear regression model, you can modify the program and simulate from a generalized linear model such as the logistic model. You just need to compute the linear predictor, eta (η), and then use the link function and the RAND function to generate the response variable, as shown in a previous article about how to simulate data from a logistic model.

In summary, this article shows how to simulate data for a linear regression model in the SAS DATA step when the model includes both categorical and continuous regressors. The program simulates arbitrarily many continuous and categorical variables. You can define a response variable in terms of the explanatory variables and their interactions.

The post Simulate data for a regression model with categorical and continuous variables appeared first on The DO Loop.

1月 252019
 

Need to authenticate on REST API calls

In my blog series regarding SAS REST APIs (article 1, article 2, article 3) I outlined how to integrate SAS analytical capabilities into applications. I detailed how to construct REST calls, build body parameters and interpret the responses. I've not yet covered authentication for the operations, purposefully putting the cart before the horse. If you're not authenticated, you can't do much, so this post will help to get the horse and cart in the right order.

While researching authentication I ran into multiple, informative articles and papers on SAS and OAuth. A non-exhaustive list includes Stuart Rogers' article on SAS Viya authentication options, one of which is OAuth. Also, I found several resources on connecting to external applications from SAS with explanations of OAuth. For example, Joseph Henry provides an overview of OAuth and using it with PROC HTTP and Chris Hemedinger explains securing REST API credentials in SAS programs in this article. Finally, the SAS Viya REST API documentation covers details on application registration and access token generation.

Consider this post a quick guide to summarize these resources and shed light on authenticating via authorization code and passwords.

What OAuth grant type should I use?

Choosing the grant method to get an access token with OAuth depends entirely on your application. You can get more information on which grant type to choose here. This post covers two grant methods: authorization code and password. Authorization code grants are generally used with web applications and considered the safest choice. Password grants are most often used by mobile apps and applied in more trusted environments.

The process, briefly

Getting an external application connected to the SAS Viya platform requires the following steps:

  1. Use the SAS Viya configuration server's Consul token to obtain an ID Token to register a new Client ID
  2. Use the ID Token to register the new client ID and secret
  3. Obtain the authorization code
  4. Acquire the access OAuth token of the Client ID using the authorization code
  5. Call the SAS Viya API using the access token for the authentication.

Registering the client (steps 1 and 2) is a one-time process. You will need a new authorization code (step 3) if the access token is revoked. The access and refresh tokens (step 4) are created once and only need to be refreshed if/when the token expires. Once you have the access token, you can call any API (step 5) if your access token is valid.

Get an access token using an authorization code

Step 1: Get the SAS Viya Consul token to register a new client

The first step to register the client is to get the consul token from the SAS server. As a SAS administrator (sudo user), access the consul token using the following command:

$ export CONSUL_TOKEN=`cat /opt/sas/viya/config/etc/SASSecurityCertificateFramework/tokens/consul/default/client.token`
64e01b03-7dab-41be-a104-2231f99d7dd8

The Consul token returns and is used to obtain an access token used to register the new application. Use the following cURL command to obtain the token:

$ curl -k -X POST "https://sasserver.demo.sas.com/SASLogon/oauth/clients/consul?callback=false&serviceId=app" \
     -H "X-Consul-Token: 64e01b03-7dab-41be-a104-2231f99d7dd8"
 {"access_token":"eyJhbGciOiJSUzI1NiIsIm...","token_type":"bearer","expires_in":35999,"scope":"uaa.admin","jti":"de81c7f3cca645ac807f18dc0d186331"}

The returned token can be lengthy. To assist in later use, create an environment variable from the returned token:

$ export IDTOKEN="eyJhbGciOiJSUzI1NiIsIm..."

Step 2: Register the new client

Change the client_id, client_secret, and scopes in the code below. Scopes should always include "openid" along with any other groups this client needs to get in the access tokens. You can specify "*" but then the user gets prompted for all their groups, which is tedious. The example below just uses one group named "group1".

$ curl -k -X POST "https://sasserver.demo.sas.com/SASLogon/oauth/clients" \
       -H "Content-Type: application/json" \
       -H "Authorization: Bearer $IDTOKEN" \
       -d '{
        "client_id": "myclientid", 
        "client_secret": "myclientsecret",
        "scope": ["openid", "group1"],
        "authorized_grant_types": ["authorization_code","refresh_token"],
        "redirect_uri": "urn:ietf:wg:oauth:2.0:oob"
       }'
{"scope":["openid","group1"],"client_id":"app","resource_ids":["none"],"authorized_grant_types":["refresh_token","authorization_code"],"redirect_uri":["urn:ietf:wg:oauth:2.0:oob"],"autoapprove":[],"authorities":["uaa.none"],"lastModified":1547138692523,"required_user_groups":[]}

Step 3: Approve access to get authentication code

Place the following URL in a browser. Change the hostname and myclientid in the URL as needed.

https://sasserver.demo.sas.com/SASLogon/oauth/authorize?client_id=myclientid&response_type=code

The browser redirects to the SAS login screen. Log in with your SAS user credentials.

SAS Login Screen

On the Authorize Access screen, select the openid checkbox (and any other required groups) and click the Authorize Access button.

Authorize Access form

After submitting the form, you'll see an authorization code. For example, "lB1sxkaCfg". You will use this code in the next step.

Authorization Code displays

Step 4: Get an access token using the authorization code

Now we have the authorization code and we'll use it in the following cURL command to get the access token to SAS.

$ curl -k https://sasserver.demo.sas.com/SASLogon/oauth/token -H "Accept: application/json" -H "Content-Type: application/x-www-form-urlencoded" \
     -u "app:appclientsecret" -d "grant_type=authorization_code&code=YZuKQUg10Z"
{"access_token":"eyJhbGciOiJSUzI1NiIsImtpZ...","token_type":"bearer","refresh_token":"eyJhbGciOiJSUzI1NiIsImtpZC...","expires_in":35999,"scope":"openid","jti":"b35f26197fa849b6a1856eea1c722933"}

We use the returned token to authenticate and authorize the calls made between the client and SAS. We also get a refresh token we use to issue a new token when the current one expires. This way we can avoid repeating all the previous steps. I explain the refresh process further down.

We will again create environment variables for the tokens.

$ export ACCESS_TOKEN="eyJhbGciOiJSUzI1NiIsImtpZCI6ImxlZ..."
$ export REFRESH_TOKEN="eyJhbGciOiJSUzI1NiIsImtpZC..."

Step 5: Use the access token to call SAS Viya APIs

The prep work is complete. We can now send requests to SAS Viya and get some work done. Below is an example REST call that returns user preferences.

$ curl -k https://sasserver.demo.sas.com/preferences/ -H "Authorization: Bearer $ACCESS_TOKEN"
{"version":1,"links":[{"method":"GET","rel":"preferences","href":"/preferences/preferences/stpweb1","uri":"/preferences/preferences/stpweb1","type":"application/vnd.sas.collection","itemType":"application/vnd.sas.preference"},{"method":"PUT","rel":"createPreferences","href":"/preferences/preferences/stpweb1","uri":"/preferences/preferences/stpweb1","type":"application/vnd.sas.preference","responseType":"application/vnd.sas.collection","responseItemType":"application/vnd.sas.preference"},{"method":"POST","rel":"newPreferences","href":"/preferences/preferences/stpweb1","uri":"/preferences/preferences/stpweb1","type":"application/vnd.sas.collection","responseType":"application/vnd.sas.collection","itemType":"application/vnd.sas.preference","responseItemType":"application/vnd.sas.preference"},{"method":"DELETE","rel":"deletePreferences","href":"/preferences/preferences/stpweb1","uri":"/preferences/preferences/stpweb1","type":"application/vnd.sas.collection","itemType":"application/vnd.sas.preference"},{"method":"PUT","rel":"createPreference","href":"/preferences/preferences/stpweb1/{preferenceId}","uri":"/preferences/preferences/stpweb1/{preferenceId}","type":"application/vnd.sas.preference"}]}

Use the refresh token to get a new access token

To use the refresh token to get a new access token, simply send a cURL command like the following:

$ curl -k https://sasserver.demo.sas.com/SASLogon/oauth/token -H "Accept: application/json" \
     -H "Content-Type: application/x-www-form-urlencoded" -u "app:appclientsecret" -d "grant_type=refresh_token&refresh_token=$REFRESH_TOKEN"
{"access_token":"eyJhbGciOiJSUzI1NiIsImtpZCI6ImxlZ...","token_type":"bearer","refresh_token":"eyJhbGciOiJSUzI1NiIsImtpZCSjYxrrNRCF7h0oLhd0Y","expires_in":35999,"scope":"openid","jti":"a5c4456b5beb4493918c389cd5186f02"}

Note the access token is new, and the refresh token remains static. Use the new token for future REST calls. Make sure to replace the ACCESS_TOKEN variable with the new token. Also, the access token has a default life of ten hours before it expires. Most applications deal with expiring and refreshing tokens programmatically. If you wish to change the default expiry of an access token in SAS, make a configuration change in the JWT properties in SAS.

Get an access token using a password

The steps to obtain an access token with a password are the same as with the authorization code. I highlight the differences below, without repeating all the steps.
The process for accessing the ID Token and using it to get an access token for registering the client is the same as described earlier. The first difference when using password authentication is when registering the client. In the code below, notice the key authorized_grant_types has a value of password, not authorization code.

$ curl -k -X POST https://sasserver.demo.sas.com/SASLogon/oauth/clients -H "Content-Type: application/json" \
       -H "Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZ..." \
       -d '{
        "client_id": "myclientid", 
        "client_secret": "myclientsecret",
        "scope": ["openid", "group1"],
        "authorized_grant_types": ["password","refresh_token"],
        "redirect_uri": "urn:ietf:wg:oauth:2.0:oob"
        }'
{"scope":["openid","group1"],"client_id":"myclientid","resource_ids":["none"],"authorized_grant_types":["refresh_token","authorization_code"],"redirect_uri":["urn:ietf:wg:oauth:2.0:oob"],"autoapprove":[],"authorities":["uaa.none"],"lastModified":1547801596527,"required_user_groups":[]}

The client is now registered on the SAS Viya server. To get the access token, we send a command like we did when using the authorization code, just using the username and password.

curl -k https://sasserver.demo.sas.com/SASLogon/oauth/token \
     -H "Content-Type: application/x-www-form-urlencoded" -u "sas.cli:" -d "grant_type=password&username=sasdemo&password=mypassword"
{"access_token":"eyJhbGciOiJSUzI1NiIsImtpZCI6Imx...","token_type":"bearer","refresh_token":"eyJhbGciOiJSUzI1NiIsImtpZ...","expires_in":43199,"scope":"DataBuilders ApplicationAdministrators SASScoreUsers clients.read clients.secret uaa.resource openid PlanningAdministrators uaa.admin clients.admin EsriUsers scim.read SASAdministrators PlanningUsers clients.write scim.write","jti":"073bdcbc6dc94384bcf9b47dc8b7e479"}

From here, sending requests and refreshing the token steps are identical to the method explained in the authorization code example.

Final thoughts

At first, OAuth seems a little intimidating; however, after registering the client and creating the access and refresh tokens, the application will handle all authentication components . This process runs smoothly if you plan and make decisions up front. I hope this guide clears up any question you may have on securing your application with SAS. Please leave questions or comments below.

Authentication to SAS Viya: a couple of approaches was published on SAS Users.

1月 242019
 

As we're approaching the anniversary of Hans Rosling's passing, I fondly remember his spectacular graphical presentations comparing the wealth and health of nations around the world. He certainly raised the bar for data visualization, and his animated charts inspired me to work even harder to create similar visualizations! What better way [...]

The post Re-creating a Hans Rosling graph animation, with SAS! appeared first on SAS Learning Post.

1月 232019
 

Recently I was asked to explain the result of an ANOVA analysis that I posted to a statistical discussion forum. My program included some simulated data for an ANOVA model and a call to the GLM procedure to estimate the parameters. I was asked why the parameter estimates from PROC GLM did not match the parameter values that were specified in the simulation. The answer is that there are many ways to parameterize the categorical effects in a regression model. SAS regression procedures support many different parameterizations, and each parameterization leads to a different set of parameter estimates for the categorical effects. The GLM procedure uses the so-called GLM-parameterization of classification effects, which sets to zero the coefficient of the last level of a categorical variable. If your simulation specifies a non-zero value for that coefficient, the parameters that PROC GLM estimates are different from the parameters in the simulation.

An example makes this clearer. The following SAS DATA step simulates 300 observations for a categorical variable C with levels 'C1', 'C2', and 'C3' in equal proportions. The simulation creates a response variable, Y, based on the levels of the variable C. The GLM procedure estimates the parameters from the simulated data:

data Have;
call streaminit(1);
do i = 1 to 100;
   do C = 'C1', 'C2', 'C3';
      eps = rand("Normal", 0, 0.2);
      /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 
         This is NOT the GLM parameterization. */
      Y = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps;  /* C='C1' is a 0/1 binary variable */
      output;
   end;
end;
keep C Y;
run;
 
proc glm data=Have plots=none;
   class C;
   model Y = C / SS3 solution;
   ods select ParameterEstimates;
quit;

The output from PROC GLM shows that the parameter estimates are very close to the following values: Intercept=11, C1=7, C2=5, and C3=0. Although these are not the parameter values that were specified in the simulation, these estimates make sense if you remember the following:

In other words, you can use the parameter values in the simulation to convert to the corresponding parameters for the GLM parameterization. In the following DATA step, the Y and Y2 variables contain exactly the same values, even though the formulas look different. The Y2 variable is simulated by using a GLM parameterization of the C variable:

data Have;
call streaminit(1);
refEffect = 1;
do i = 1 to 100;
   do C = 'C1', 'C2', 'C3';
      eps = rand("Normal", 0, 0.2);
      /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 */
      Y  = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps;
      /* GLM parameterization for the same response: Intercept=11, C1=7, C2=5, C3=0 */
      Y2 = (10 + refEffect) + (8-refEffect)*(C='C1') + (6-refEffect)*(C='C2') + eps;
      diff = Y - Y2;         /* Diff = 0 when Y=Y2 */
      output;
   end;
end;
keep C Y Y2 diff;
run;
 
proc means data=Have;
   var Y Y2 diff;
run;

The output from PROC MEANS shows that the Y and Y2 variables are exactly equal. The coefficients for the Y2 variable are 11 (the intercept), 7, 5, and 0, which are the parameter values that are estimated by PROC GLM.

Of course, other parameterizations are possible. For example, you can create the simulation by using other parameterizations such as the EFFECT coding. (The EFFECT coding is the default coding in PROC LOGISTIC.) For the effect coding, parameter estimates of main effects indicate the difference of each level as compared to the average effect over all levels. The following statements show the effect coding for the variable Y3. The values of the Y3 variable are exactly the same as Y and Y2:

avgEffect = 5;   /* average effect for C is (8 + 6 + 1)/3 = 15/3 = 5 */
...
      /* EFFECT parameterization: Intercept=15, C1=3, C2=1, C3=0 */
      Y3 = 10 + avgEffect + (8-avgEffect)*(C='C1') + (6-avgEffect)*(C='C2') + eps;

In summary, when you write a simulation that includes categorical data, there are many equivalent ways to parameterize the categorical effects. When you use a regression procedure to analyze the simulated data, the procedure and simulation might use different parameterizations. If so, the estimates from the procedure might be quite different from the parameters in your simulation. This article demonstrates this fact by using the GLM parameterization and the EFFECT parameterization, which are two commonly used parameterizations in SAS. See the SAS/STAT documentation for additional details about the different parameterizations of classification variables in SAS.

The post Coding and simulating categorical variables in regression models appeared first on The DO Loop.

1月 232019
 

You’re probably already familiar with Leonid Batkhan from his popular blog right here on The Learning Post. In fact, he’s one of our most engaging authors, with thousands of views and hundreds of comments. Leonid is a true SAS Sensei. He has been at SAS for nearly 25 years and [...]

The post Secrets from a SAS Expert: An Interview with Leonid Batkhan appeared first on SAS Learning Post.