1月 232019
 

Recently I was asked to explain the result of an ANOVA analysis that I posted to a statistical discussion forum. My program included some simulated data for an ANOVA model and a call to the GLM procedure to estimate the parameters. I was asked why the parameter estimates from PROC GLM did not match the parameter values that were specified in the simulation. The answer is that there are many ways to parameterize the categorical effects in a regression model. SAS regression procedures support many different parameterizations, and each parameterization leads to a different set of parameter estimates for the categorical effects. The GLM procedure uses the so-called GLM-parameterization of classification effects, which sets to zero the coefficient of the last level of a categorical variable. If your simulation specifies a non-zero value for that coefficient, the parameters that PROC GLM estimates are different from the parameters in the simulation.

An example makes this clearer. The following SAS DATA step simulates 300 observations for a categorical variable C with levels 'C1', 'C2', and 'C3' in equal proportions. The simulation creates a response variable, Y, based on the levels of the variable C. The GLM procedure estimates the parameters from the simulated data:

data Have;
call streaminit(1);
do i = 1 to 100;
   do C = 'C1', 'C2', 'C3';
      eps = rand("Normal", 0, 0.2);
      /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 
         This is NOT the GLM parameterization. */
      Y = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps;  /* C='C1' is a 0/1 binary variable */
      output;
   end;
end;
keep C Y;
run;
 
proc glm data=Have plots=none;
   class C;
   model Y = C / SS3 solution;
   ods select ParameterEstimates;
quit;

The output from PROC GLM shows that the parameter estimates are very close to the following values: Intercept=11, C1=7, C2=5, and C3=0. Although these are not the parameter values that were specified in the simulation, these estimates make sense if you remember the following:

In other words, you can use the parameter values in the simulation to convert to the corresponding parameters for the GLM parameterization. In the following DATA step, the Y and Y2 variables contain exactly the same values, even though the formulas look different. The Y2 variable is simulated by using a GLM parameterization of the C variable:

data Have;
call streaminit(1);
refEffect = 1;
do i = 1 to 100;
   do C = 'C1', 'C2', 'C3';
      eps = rand("Normal", 0, 0.2);
      /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 */
      Y  = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps;
      /* GLM parameterization for the same response: Intercept=11, C1=7, C2=5, C3=0 */
      Y2 = (10 + refEffect) + (8-refEffect)*(C='C1') + (6-refEffect)*(C='C2') + eps;
      diff = Y - Y2;         /* Diff = 0 when Y=Y2 */
      output;
   end;
end;
keep C Y Y2 diff;
run;
 
proc means data=Have;
   var Y Y2 diff;
run;

The output from PROC MEANS shows that the Y and Y2 variables are exactly equal. The coefficients for the Y2 variable are 11 (the intercept), 7, 5, and 0, which are the parameter values that are estimated by PROC GLM.

Of course, other parameterizations are possible. For example, you can create the simulation by using other parameterizations such as the EFFECT coding. (The EFFECT coding is the default coding in PROC LOGISTIC.) For the effect coding, parameter estimates of main effects indicate the difference of each level as compared to the average effect over all levels. The following statements show the effect coding for the variable Y3. The values of the Y3 variable are exactly the same as Y and Y2:

avgEffect = 5;   /* average effect for C is (8 + 6 + 1)/3 = 15/3 = 5 */
...
      /* EFFECT parameterization: Intercept=15, C1=3, C2=1, C3=0 */
      Y3 = 10 + avgEffect + (8-avgEffect)*(C='C1') + (6-avgEffect)*(C='C2') + eps;

In summary, when you write a simulation that includes categorical data, there are many equivalent ways to parameterize the categorical effects. When you use a regression procedure to analyze the simulated data, the procedure and simulation might use different parameterizations. If so, the estimates from the procedure might be quite different from the parameters in your simulation. This article demonstrates this fact by using the GLM parameterization and the EFFECT parameterization, which are two commonly used parameterizations in SAS. See the SAS/STAT documentation for additional details about the different parameterizations of classification variables in SAS.

The post Coding and simulating categorical variables in regression models appeared first on The DO Loop.

1月 232019
 

You’re probably already familiar with Leonid Batkhan from his popular blog right here on The Learning Post. In fact, he’s one of our most engaging authors, with thousands of views and hundreds of comments. Leonid is a true SAS Sensei. He has been at SAS for nearly 25 years and [...]

The post Secrets from a SAS Expert: An Interview with Leonid Batkhan appeared first on SAS Learning Post.

1月 222019
 

You'll notice several changes in SAS Grid Manager with the release of SAS 9.4M6.
 
For the first time, you can get a grid solution entirely written by SAS, with no dependence on any external or third-party grid provider.
 
This post gives a brief architectural description of the new SAS grid provider, including all major components and their role. The “traditional” SAS Grid Manager for Platform has seen some architectural changes too; they are detailed at the bottom.

A new kid in town

SAS Grid Manager is a complex offering, composed of different layers of software. The following picture shows a very simple, high-level view. SAS Infrastructure here represents the SAS Platform, for example the SAS Metadata Server, SAS Middle Tier, etc. They service the execution of computing tasks, whether a batch process, a SAS Workspace server, and so on. In a grid environment these computing tasks are distributed on multiple hosts, and orchestrated/managed/coordinated by a layer of software that we can generically call Grid Infrastructure or Grid Middleware. That’s basically a set of lower-level components that sit between computing processes and the operating system.

Since its initial design more than a decade ago, the SAS Grid Manager offering has always been able to leverage different grid infrastructure providers, thanks to an abstraction layer that makes them transparent to end-user client software.
 
Our strategic grid middleware has been, since the beginning, Platform Suite for SAS, provided by Platform Computing (now part of IBM).
 
A few years ago, with the release of SAS 9.4M3, SAS started delivering an additional grid provider, SAS Grid Manager for Hadoop, tailored to grid environments co-located with Hadoop.
 
The latest version, SAS 9.4M6, opens up choices with the introduction of a new, totally SAS-developed grid provider. What’s its name? Well, since it’s SAS’s grid provider, we use the simplest one: SAS Grid Manager. To avoid confusion, what we used to call SAS Grid Manager has been renamed SAS Grid Manager for Platform.

The reasons for a choice

The SAS-developed provider for SAS Grid Manager:
 
• Is streamlined specifically for SAS workloads.
 
• Is easier to install (simply use the SAS Deployment Wizard (SDW) and administer.
 
• Extends workload management and scheduling capabilities into other technologies, such as
 
    o Third-party compute workloads like open source.
 
    o SAS Viya (in a future release).
 
• Reduces dependence of SAS Grid Manager on third party technologies.

So what are the new components?

The SAS-developed provider for SAS Grid Manager includes:
 
• SAS Workload Orchestrator
 
• SAS Job Flow Scheduler
 
• SAS Workload Orchestrator Web Interface
 
• SAS Workload Orchestrator Administration Utility

 
These are the new components, delivered together with others also available in previous releases and with other providers, such as the Grid Manager Thin Client Utility (a.k.a. SASGSUB), the SAS Grid Manager Agent Plug-in, etc. Let’s see these new components in more detail.

SAS Workload Orchestrator

The SAS Workload Orchestrator is your grid controller – just like Platform’s LSF is with SAS Grid Manager, it:
 
• Dispatches jobs.
 
• Monitors hosts and spreads the load.
 
• Is installed and runs on all machines in the cluster (but is not required on dedicated Metadata Server or Middle-tier hosts).
 
A notable difference, when compared to LSF, is that the SAS Workload Orchestrator is a single daemon, with its configuration stored in a single text file in json format.
 
Redeveloped for modern workloads, the new grid provider can schedule more types of jobs, beyond just SAS jobs. In fact, you can use it to schedule ANY job, including open source code running in Python, R or any other language.

SAS Job Flow Scheduler

SAS Job Flow Scheduler is the flow scheduler for the grid (just as Platform Process Manager is with SAS Grid Manager for Platform):
 
• It passes commands to the SAS Workload Orchestrator at certain times or events.
 
• Flows can be used to run many tasks in parallel on the grid.
 
• Flows can also be used to determine the sequence of events for multiple related jobs.
 
• It only determines when jobs are submitted to the grid, but they may not run immediately if the right conditions are not met (hosts too busy, closed, etc.)
 
The SAS Job Flow Scheduler provides flow orchestration of batch jobs. It uses operating system services to trigger the flow to handle impersonation of the user when it is time for the flow to start execution.
 
A flow can be built using the SAS Management Console or other SAS products such as SAS Data Integration Studio.
 
SAS Job Flow Scheduler includes the ability to run a flow immediately (a.k.a. “Run Now”), or to schedule the flow for some future time/recurrence.
 
SAS Job Flow Scheduler consists of different components that cooperate to execute flows:
 
SASJFS service is the main running service that handles the requests to schedule a flow. It runs on the middle tier as a dedicated thread in the Web Infrastructure Platform, deployed inside sasserver1. It uses services provided by the data store (SAS Content Server) and Metadata Server to read/write the configuration options of the scheduler, the content of the scheduled flows and the history records of executed flows.
 
Launcher acts as a gateway between SASJFS and OS Trigger. It is a daemon that accepts HTTP connections using basic authentication (username/password) to start the OS Trigger program as the scheduled user. This avoids the requirement to save end-users’ passwords in the grid provider, for both Windows and Unix.
 
OS Trigger is a stand-alone Java program that uses the services of the operating system to handle the triggering of the scheduled flow by providing a call to the Job Flow Orchestrator. On Windows, it uses the Windows Task Scheduler; on UNIX, it uses cron or crontab.
 
Job Flow Orchestrator is a stand-alone program that manages the flow orchestration. It is invoked by the OS scheduler (as configured by the OS Trigger) with the id of the flow to execute, then it connects to the SASJFS service to read the flow information, the job execution configuration and the credentials to connect to the grid. With that information, it sends jobs for execution to the SAS Workload Orchestrator. Finally, it is responsible for providing the history record for the flow back to SASJFS service.

Additional components

SAS Grid Manager provides additional components to administer the SAS Workload Orchestrator:
 
• SAS Workload Orchestrator Web Interface
 
• SAS Workload Orchestrator Administration Utility
 
Both can monitor jobs, queues, hosts, services, and logs, and configure hosts, queues, services, user groups, and user resources.
 
The SAS Workload Orchestrator Web Interface is a web application hosted by the SAS Workload Orchestrator process on the grid master host; it can be proxied by the SAS Web Server to always point to the current master in case of failover.

The SAS Workload Orchestrator Administration Utility is an administration command-line interface; it has a similar syntax to SAS Viya CLIs and is located in the directory /Lev1/Applications/GridAdminUtility. A sample invocation to list all running jobs is:
sas-grid-cli show-jobs --state RUNNING

What has not changed

Describing what has not changed with the new grid provider is an easy task: everything else.
 
Obviously, this is a very generic statement, so let’s call out a few noteworthy items that have not changed:
 
• User experience is unchanged. SAS programming interfaces to grid have not changed, apart from the lower-level libraries to connect to the new provider. As such, you still have the traditional SAS grid control server, SAS grid nodes, SAS thin client (aka SASGSUB) and the full SAS client (SAS Display Manager). Users can submit jobs or start grid-launched sessions from SAS Enterprise Guide, SAS Studio, SAS Enterprise Miner, etc.
 
• A directory shared among all grid hosts is still required to share the grid configuration files.
 
• A high-performance, clustered file system for the SASWORK area and for data libraries is mandatory to guarantee satisfactory performance.

What about SAS Grid Manager for Platform?

The traditional grid provider, now rebranded as SAS Grid Manager for Platform, has seen some changes as well with SAS 9.4M6:
 
• The existing management interface, SAS Grid Manager for Platform Module for SAS Environment Manager, has been completely re-designed. The user interface has completely changed, although the functions provided remain the same.
 
• Grid Management Services (GMS) is not updated to work with the latest release of LSF. Therefore, the SAS Grid Manager plug-in for SAS Management Console is no longer supported. However, the plug-in is included with SAS 9.4M6 if you want to upgrade to SAS 9.4M6 without also upgrading Platform Suite for SAS.

You can find more comprehensive information in these doc pages:
 
What’s New in SAS Grid Manager 9.4
 
• Grid Computing for SAS Using SAS Grid Manager (Part 2) section of Grid Computing in SAS 9.4

Native scheduler, new types of workloads, and more: introducing the new SAS Grid Manager was published on SAS Users.

1月 222019
 

The first person to live to 150 has already been born. When I first heard this controversial idea, popularized by biomedical gerontologist Aubrey de Grey of the SENS Research Foundation, it gave me chills. It’s amazing to think we could have 50 more years to spend with loved ones and [...]

Analytics and life beyond age 100 was published on SAS Voices by Cameron McLauchlin

1月 212019
 

In machine learning and other model building techniques, it is common to partition a large data set into three segments: training, validation, and testing.

  • Training data is used to fit each model.
  • Validation data is a random sample that is used for model selection. These data are used to select a model from among candidates by balancing the tradeoff between model complexity (which fit the training data well) and generality (but they might not fit the validation data). These data are potentially used several times to build the final model
  • Test data is a hold-out sample that is used to assess final model and estimate its prediction error. It is only used at the end of the model-building process.

I've seen many questions about how to use SAS to split data into training, validation, and testing data. (A common variation uses only training and validation.) There are basically two approaches to partitioning data:

  • Specify the proportion of observations that you want in each role. For each observation, randomly assign it to one of the three roles. The number of observations assigned to each role will be a multinomial random variable with expected value N pk, where N is the number of observations and pk (k = 1, 2, 3) is the probability of assigning an observation to the k_th role. For this method, if you change the random number seed you will usually get a different number of observations each role because the size is a random variable.
  • Specify the number of observations that you want in each role and randomly allocate that many observations.

This article uses the SAS DATA step to accomplish the first task and uses PROC SURVEYSELECT to accomplish the second. I also discuss how to split data into only two roles: training and validation.

It is worth mentioning that many model-selection routines in SAS enable you to split data by using the PARTITION statement. Example include the "SELECT" procedures (GLMSELECT, QUANTSELECT, HPGENSELECT...) and the ADAPTIVEREG procedure. However, be aware that the procedures might ignore observations that have missing values for the variables in the model.

Random partition into training, validation, and testing data

When you partition data into various roles, you can choose to add an indicator variable, or you can physically create three separate data sets. The following DATA step creates an indicator variable with values "Train", "Validate", and "Test". The specified proportions are 60% training, 30% validation, and 10% testing. You can change the values of the SAS macro variables to use your own proportions. The RAND("Table") function is an efficient way to generate the indicator variable.

data Have;             /* the data to partition  */
   set Sashelp.Heart;  /* for example, use Heart data */
run;
 
/* If propTrain + propValid = 1, then no observation is assigned to testing */
%let propTrain = 0.6;         /* proportion of trainging data */
%let propValid = 0.3;         /* proportion of validation data */
%let propTest = %sysevalf(1 - &propTrain - &propValid); /* remaining are used for testing */
 
/* Randomly assign each observation to a role; _ROLE_ is indicator variable */
data RandOut;
   array p[2] _temporary_ (&propTrain, &propValid);
   array labels[3] $ _temporary_ ("Train", "Validate", "Test");
   set Have;
   call streaminit(123);         /* set random number seed */
   /* RAND("table") returns 1, 2, or 3 with specified probabilities */
   _k = rand("Table", of p[*]); 
   _ROLE_ = labels[_k];          /* use _ROLE_ = _k if you prefer numerical categories */
   drop _k;
run;
 
proc freq data=RandOut order=freq;
   tables _ROLE_ / nocum;
run;

A shown by the output of PROC FREQ, the proportion of observations in each role is approximately the same as the specified proportions. For this random number seed, the proportions are 59.1%, 30.4%, and 10.6%. If you change the random number seed, you will get a different assignment of observations to roles and also a different proportion of data in each role.

The observant reader will notice that there are only two elements in the array of probabilities (p) that is used in the RAND("Table") call. This is intentional. The documentation for the RAND("Table") function states that if the sum of the specified probabilities is less than 1, then the quantity 1 – Σ pi is used as the probability of the last event. By specifying only two values in the p array, the same program works for partitioning the data into two pieces (training and validation) or three pieces (and testing).

Create random training, validation, and testing data sets

Some practitioners choose to create three separate data sets instead of adding an indicator variable to the existing data. The computation is exactly the same, but you can use the OUTPUT statement to direct each observation to one of three output data sets, as follows:

/* create a separate data set for each role */
data Train Validate Test;
array p[2] _temporary_ (&propTrain, &propValid);
set Have;
call streaminit(123);         /* set random number seed */
/* RAND("table") returns 1, 2, or 3 with specified probabilities */
_k = rand("Table", of p[*]);
if      _k = 1 then output Train;
else if _k = 2 then output Validate;
else                output Test;
drop _k;
run;
NOTE: The data set WORK.TRAIN has 3078 observations and 17 variables.
NOTE: The data set WORK.VALIDATE has 1581 observations and 17 variables.
NOTE: The data set WORK.TEST has 550 observations and 17 variables.

This example uses the same random number seed as the previous example. Consequently, the three output data sets have the same observations as are indicated by the partition variable (_ROLE_) in the previous example.

Specify the number of observations in each role

Instead of specifying a proportion, you might want to specify the exact number of observations that are randomly assigned to each role. The advantage of this technique is that changing the random number seed does not change the number of observations in each role (although it does change which observations are assigned to each role). The SURVEYSELECT procedure supports the GROUPS= option, which you can use to specify the number of observations.

The GROUPS= option requires that you specify integer values. For this example, the original data contains 5209 observations but 60% of 5209 is not an integer. Therefore, the following DATA step computes the number of observations as ROUND(N p) for the training and validation sets. These integer values are put into macro variables and used in the GROUPS= option on the PROC SURVEYSELECT statement. You can, of course, skip the DATA step and specify your own values such as groups=(3200, 1600, 409).

/* Specify the sizes of the train/validation/test data from proportions */
data _null_;
   if 0 then set sashelp.heart nobs=N;  /* N = total number of obs */
   nTrain = round(N * &propTrain);      /* size of training data */
   nValid = round(N * &propValid);      /* size of validataion data */
   call symputx("nTrain", nTrain);      /* put integer into macro variable */
   call symputx("nValid", nValid);
   call symputx("nTest", N - nTrain - nValid);
run;
 
/* randomly assign observations to three groups */
proc surveyselect data=Have seed=12345 out=SSOut
     groups=(&nTrain, &nValid, &nTest); /* if no Test data, use  GROUPS=(&nTrain, &nValid) */
run;
 
proc freq data=SSOut order=freq;
   tables GroupID / nocum;           /* GroupID is name of indicator variable */
run;

The training, validation, and testing groups contain 3125, 1563, and 521 observations, respectively. These numbers are the closest integer approximations to 60%, 30% and 10% of the 5209 observations. Notice that the output from the SURVEYSELECT procedure uses the values 1, 2, and 3 for the GroupID indicator variable. You can use PROC FORMAT to associate those numbers with labels such as "Train", "Validate", and "Test".

In summary, there are two basic programming techniques for randomly partitioning data into training, validation, and testing roles. One way uses the SAS DATA step to randomly assign each observation to a role according to proportions that you specify. If you use this technique, the size of each group is random. The other way is to use PROC SURVEYSELECT to randomly assign observations to roles. If you use this technique, you must specify the number of observation in each group.

The post Create training, validation, and test data sets in SAS appeared first on The DO Loop.

1月 182019
 
Interested in learning about what's new in a software release? What if you want to know whether anything has changed in a SAS product? Or whether there are steps that you need to take before you upgrade to a new software release?
 
The online SAS product documentation for SAS® 9.4 and SAS® Viya® contains new sections that provide these answers. The following sections are usually listed on the left-hand side of the table of contents for the online Help document:
 
“What’s New in Base SAS: Details”
“What’s New in SAS 9.4 and SAS Viya”
“SAS Guide to Software Updates and Product Changes”
Note: To make the product-change information easier to find, this section was retitled for SAS® 9.4M6 and SAS® Viya® 3.4. For documentation about previous releases of SAS 9.4, this section is called “SAS Guide to Software Updates.” The information about product changes is included in a subsection called “Product Details and Requirements.” Although the title is different in newer documentation, the content remains the same.

What's available in each section?

• “What’s New” contains information about new features. For example, in SAS 9.4M6, “What’s New” discusses a new ODS destination for Word and a new procedure called PROC SGPIE.
 
• The “SAS Guide to Software Updates and Product Changes” includes the following subsections:
      o A section on software updates, which is for customers who are upgrading to a new release. (FYI: A software update is any modification that SAS provides for existing software. An upgrade is a new release of SAS. A maintenance release is a collection of updates that are applied to the currently installed software.)
      o Another subsection discusses product details and requirements. In it, you will find information about values or settings that have changed from one release to the next. For example, for SAS 9.4M6, the default style for HTML5 output changed from HTMLBlue to HTMLEncore. Another example is for SAS® 9.4M0, when the YEARCUTOFF= system option changed from 1920 to 1926.

Other links to these resources

In the “What’s New” documentation, there is a link in each section to the corresponding product topic in “SAS Guide to Software Updates and Product Changes.”

For example, when you scroll through this SAS Studio page, you see references both to various “What’s New” pages for different versions of SAS Studio and to “SAS Guide to Software Updates and Product Changes.”

In “What's New in Base SAS: Details,” you can search by the software and maintenance release to find new features. Beginning with SAS 9.4, new features for maintenance releases are introduced using the SAS 9.4Mx notation. For example, in the search box on the page, you can enter 9.4M6.

Final thoughts

With these new online Help sections, you can find information quickly about new features of the current SAS release, as well as what has changed from the previous release. As always, we welcome your feedback and suggestions for improving the documentation.

Special thanks

Elizabeth Downes and Marie Dexter in SAS Documentation Development were very willing to make the requested wording changes in the documentation. They also contributed to the content of this article. Thanks to both for their time and effort!

Navigating SAS documentation to find out about new, modified, and updated features was published on SAS Users.

1月 182019
 

They say "The Sun never sets on the SAS Empire" ... and it's true! There are SAS users all over the world, and SAS output & results could be in any language. Therefore, if you're a SAS programmer, you might need to know how to create SAS graphs with international [...]

The post Using International characters in a SAS graph appeared first on SAS Learning Post.

1月 182019
 

It seems that everyone knows about GitHub -- the service that hosts many popular open source code projects. The underpinnings of GitHub are based on Git, which is itself an open-source implementation of a source management system. Git was originally built to help developers collaborate on Linux (yet another famous open source project) -- but now we all use it for all types of projects.

There are other free and for-pay services that use Git, like Bitbucket and GitLab. And there are countless products that embed Git for its versioning and collaboration features. In 2014, SAS developers added built-in Git support for SAS Enterprise Guide.

Since then, Git (and GitHub) have grown to play an even larger role in data science operations and DevOps in general. Automation is a key component for production work -- including check-in, check-out, commit, and rollback. In response, SAS has added Git integration to more SAS products, including:

  • the Base SAS programming language, via a collection of SAS functions.
  • SAS Data Integration Studio, via a new source control plugin
  • SAS Studio (experimental in v3.8)

You can use this Git integration with any service that supports Git (GitHub, GitLab, etc.), or with your own private Git servers and even just local Git repositories.

SAS functions for Git

Git infrastructure and functions were added to SAS 9.4 Maintenance 6. The new SAS functions all have the helpful prefix of "GITFN_" (signifying "Git fun!", I assume). Here's a partial list:

GITFN_CLONE  Clones a Git repository (for example, from GitHub) into a directory on the SAS server.
GITFN_COMMIT  Commits staged files to the local repository
GITFN_DIFF Returns the number of diffs between two commits in the local repository and creates a diff record object for the local repository.
GITFN_PUSH  Pushes the committed files in the local repository to the remote repository.
GITFN_NEW_BRANCH  Creates a Git branch

 

The function names make sense if you're familiar with Git lingo. If you're new to Git, you'll need to learn the terms that go with the commands: clone, repo, commit, stage, blame, and more. This handbook provided by GitHub is friendly and easy to read. (Or you can start with this xkcd comic.)

You can

data _null_;
 version = gitfn_version();
 put version=;             
 
 rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/",
   "c:\Projects\sas-dummy-blog");
 put rc=;
run;

In one line, this function fetches an entire collection of code files from your source control system. Here's a more concrete example that fetches the code to a work space, then runs a program from that repository. (This is safe for you to try -- here's the code that will be pulled/run. It even works from SAS University Edition.)

options dlcreatedir;
%let repoPath = %sysfunc(getoption(WORK))/sas-dummy-blog;
libname repo "&repoPath.";
libname repo clear;
 
/* Fetch latest code from GitHub */
data _null_;
 rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/",
   "&repoPath.");
 put rc=;
run;
 
/* run the code in this session */
%include "&repoPath./rng_example_thanos.sas";

You could use the other GITFN functions to stage and commit the output from your SAS jobs, including log files, data sets, ODS results -- whatever you need to keep and version.

Using Git in SAS Data Integration Studio

SAS Data Integration Studio has supported source control integration for many years, but only for CVS and Subversion (still in wide use, but they aren't media darlings like GitHub). By popular request, the latest version of SAS Data Integration Studio adds support for a Git plug-in.

Example of Git in SAS DI Studio

See the documentation for details:

Read more about setup and use in the available here as part of our "Custom Tasks Tuesday" series.

Using Git in SAS Enterprise Guide

This isn't new, but I'll include it for completeness. SAS Enterprise Guide supports built-in Git repository support for SAS programs that are stored in your project file. You can use this feature without having to set up any external Git servers or repositories. Also, SAS Enterprise Guide can recognize when you reference programs that are managed in an external Git repository. This integration enables features like program history, compare differences, commit, and more. Read more and see a demo of this in action here.

program history

If you use SAS Enterprise Guide to edit and run SAS programs that are managed in an external Git repository, here's an important tip. Change your project file properties to "Use paths relative to the project for programs and importable files." You'll find this checkbox in File->Project Properties.

With this enabled, you can store the project file (EGP) and any SAS programs together in Git, organized into subfolders if you want. As long as these are cloned into a similar structure on any system you use, the file paths will resolve automatically.

The post Using built-in Git operations in SAS appeared first on The SAS Dummy.