Jim Harris discusses a key role of the data engineer – protecting sensitive personal data.
The post Data engineers: Key role in data privacy and protection appeared first on The Data Roundtable.
Jim Harris discusses a key role of the data engineer – protecting sensitive personal data.
The post Data engineers: Key role in data privacy and protection appeared first on The Data Roundtable.
Recently I was asked to explain the result of an ANOVA analysis that I posted to a statistical discussion forum. My program included some simulated data for an ANOVA model and a call to the GLM procedure to estimate the parameters. I was asked why the parameter estimates from PROC GLM did not match the parameter values that were specified in the simulation. The answer is that there are many ways to parameterize the categorical effects in a regression model. SAS regression procedures support many different parameterizations, and each parameterization leads to a different set of parameter estimates for the categorical effects. The GLM procedure uses the so-called GLM-parameterization of classification effects, which sets to zero the coefficient of the last level of a categorical variable. If your simulation specifies a non-zero value for that coefficient, the parameters that PROC GLM estimates are different from the parameters in the simulation.
An example makes this clearer. The following SAS DATA step simulates 300 observations for a categorical variable C with levels 'C1', 'C2', and 'C3' in equal proportions. The simulation creates a response variable, Y, based on the levels of the variable C. The GLM procedure estimates the parameters from the simulated data:
data Have; call streaminit(1); do i = 1 to 100; do C = 'C1', 'C2', 'C3'; eps = rand("Normal", 0, 0.2); /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 This is NOT the GLM parameterization. */ Y = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps; /* C='C1' is a 0/1 binary variable */ output; end; end; keep C Y; run; proc glm data=Have plots=none; class C; model Y = C / SS3 solution; ods select ParameterEstimates; quit;
The output from PROC GLM shows that the parameter estimates are very close to the following values: Intercept=11, C1=7, C2=5, and C3=0. Although these are not the parameter values that were specified in the simulation, these estimates make sense if you remember the following:
In other words, you can use the parameter values in the simulation to convert to the corresponding parameters for the GLM parameterization. In the following DATA step, the Y and Y2 variables contain exactly the same values, even though the formulas look different. The Y2 variable is simulated by using a GLM parameterization of the C variable:
data Have; call streaminit(1); refEffect = 1; do i = 1 to 100; do C = 'C1', 'C2', 'C3'; eps = rand("Normal", 0, 0.2); /* In simulation, parameters are Intercept=10, C1=8, C2=6, and C3=1 */ Y = 10 + 8*(C='C1') + 6*(C='C2') + 1*(C='C3') + eps; /* GLM parameterization for the same response: Intercept=11, C1=7, C2=5, C3=0 */ Y2 = (10 + refEffect) + (8-refEffect)*(C='C1') + (6-refEffect)*(C='C2') + eps; diff = Y - Y2; /* Diff = 0 when Y=Y2 */ output; end; end; keep C Y Y2 diff; run; proc means data=Have; var Y Y2 diff; run;
The output from PROC MEANS shows that the Y and Y2 variables are exactly equal. The coefficients for the Y2 variable are 11 (the intercept), 7, 5, and 0, which are the parameter values that are estimated by PROC GLM.
Of course, other parameterizations are possible. For example, you can create the simulation by using other parameterizations such as the EFFECT coding. (The EFFECT coding is the default coding in PROC LOGISTIC.) For the effect coding, parameter estimates of main effects indicate the difference of each level as compared to the average effect over all levels. The following statements show the effect coding for the variable Y3. The values of the Y3 variable are exactly the same as Y and Y2:
avgEffect = 5; /* average effect for C is (8 + 6 + 1)/3 = 15/3 = 5 */ ... /* EFFECT parameterization: Intercept=15, C1=3, C2=1, C3=0 */ Y3 = 10 + avgEffect + (8-avgEffect)*(C='C1') + (6-avgEffect)*(C='C2') + eps;
In summary, when you write a simulation that includes categorical data, there are many equivalent ways to parameterize the categorical effects. When you use a regression procedure to analyze the simulated data, the procedure and simulation might use different parameterizations. If so, the estimates from the procedure might be quite different from the parameters in your simulation. This article demonstrates this fact by using the GLM parameterization and the EFFECT parameterization, which are two commonly used parameterizations in SAS. See the SAS/STAT documentation for additional details about the different parameterizations of classification variables in SAS.
The post Coding and simulating categorical variables in regression models appeared first on The DO Loop.
You’re probably already familiar with Leonid Batkhan from his popular blog right here on The Learning Post. In fact, he’s one of our most engaging authors, with thousands of views and hundreds of comments. Leonid is a true SAS Sensei. He has been at SAS for nearly 25 years and [...]
The post Secrets from a SAS Expert: An Interview with Leonid Batkhan appeared first on SAS Learning Post.
You'll notice several changes in SAS Grid Manager with the release of SAS 9.4M6.
For the first time, you can get a grid solution entirely written by SAS, with no dependence on any external or third-party grid provider.
This post gives a brief architectural description of the new SAS grid provider, including all major components and their role. The “traditional” SAS Grid Manager for Platform has seen some architectural changes too; they are detailed at the bottom.
SAS Grid Manager is a complex offering, composed of different layers of software. The following picture shows a very simple, high-level view. SAS Infrastructure here represents the SAS Platform, for example the SAS Metadata Server, SAS Middle Tier, etc. They service the execution of computing tasks, whether a batch process, a SAS Workspace server, and so on. In a grid environment these computing tasks are distributed on multiple hosts, and orchestrated/managed/coordinated by a layer of software that we can generically call Grid Infrastructure or Grid Middleware. That’s basically a set of lower-level components that sit between computing processes and the operating system.
Since its initial design more than a decade ago, the SAS Grid Manager offering has always been able to leverage different grid infrastructure providers, thanks to an abstraction layer that makes them transparent to end-user client software.
Our strategic grid middleware has been, since the beginning, Platform Suite for SAS, provided by Platform Computing (now part of IBM).
A few years ago, with the release of SAS 9.4M3, SAS started delivering an additional grid provider, SAS Grid Manager for Hadoop, tailored to grid environments co-located with Hadoop.
The latest version, SAS 9.4M6, opens up choices with the introduction of a new, totally SAS-developed grid provider. What’s its name? Well, since it’s SAS’s grid provider, we use the simplest one: SAS Grid Manager. To avoid confusion, what we used to call SAS Grid Manager has been renamed SAS Grid Manager for Platform.
The SAS-developed provider for SAS Grid Manager:
• Is streamlined specifically for SAS workloads.
• Is easier to install (simply use the SAS Deployment Wizard (SDW) and administer.
• Extends workload management and scheduling capabilities into other technologies, such as
o Third-party compute workloads like open source.
o SAS Viya (in a future release).
• Reduces dependence of SAS Grid Manager on third party technologies.
The SAS-developed provider for SAS Grid Manager includes:
• SAS Workload Orchestrator
• SAS Job Flow Scheduler
• SAS Workload Orchestrator Web Interface
• SAS Workload Orchestrator Administration Utility
These are the new components, delivered together with others also available in previous releases and with other providers, such as the Grid Manager Thin Client Utility (a.k.a. SASGSUB), the SAS Grid Manager Agent Plug-in, etc. Let’s see these new components in more detail.
The SAS Workload Orchestrator is your grid controller – just like Platform’s LSF is with SAS Grid Manager, it:
• Dispatches jobs.
• Monitors hosts and spreads the load.
• Is installed and runs on all machines in the cluster (but is not required on dedicated Metadata Server or Middle-tier hosts).
A notable difference, when compared to LSF, is that the SAS Workload Orchestrator is a single daemon, with its configuration stored in a single text file in json format.
Redeveloped for modern workloads, the new grid provider can schedule more types of jobs, beyond just SAS jobs. In fact, you can use it to schedule ANY job, including open source code running in Python, R or any other language.
SAS Job Flow Scheduler is the flow scheduler for the grid (just as Platform Process Manager is with SAS Grid Manager for Platform):
• It passes commands to the SAS Workload Orchestrator at certain times or events.
• Flows can be used to run many tasks in parallel on the grid.
• Flows can also be used to determine the sequence of events for multiple related jobs.
• It only determines when jobs are submitted to the grid, but they may not run immediately if the right conditions are not met (hosts too busy, closed, etc.)
The SAS Job Flow Scheduler provides flow orchestration of batch jobs. It uses operating system services to trigger the flow to handle impersonation of the user when it is time for the flow to start execution.
A flow can be built using the SAS Management Console or other SAS products such as SAS Data Integration Studio.
SAS Job Flow Scheduler includes the ability to run a flow immediately (a.k.a. “Run Now”), or to schedule the flow for some future time/recurrence.
SAS Job Flow Scheduler consists of different components that cooperate to execute flows:
• SASJFS service is the main running service that handles the requests to schedule a flow. It runs on the middle tier as a dedicated thread in the Web Infrastructure Platform, deployed inside sasserver1. It uses services provided by the data store (SAS Content Server) and Metadata Server to read/write the configuration options of the scheduler, the content of the scheduled flows and the history records of executed flows.
• Launcher acts as a gateway between SASJFS and OS Trigger. It is a daemon that accepts HTTP connections using basic authentication (username/password) to start the OS Trigger program as the scheduled user. This avoids the requirement to save end-users’ passwords in the grid provider, for both Windows and Unix.
• OS Trigger is a stand-alone Java program that uses the services of the operating system to handle the triggering of the scheduled flow by providing a call to the Job Flow Orchestrator. On Windows, it uses the Windows Task Scheduler; on UNIX, it uses cron or crontab.
• Job Flow Orchestrator is a stand-alone program that manages the flow orchestration. It is invoked by the OS scheduler (as configured by the OS Trigger) with the id of the flow to execute, then it connects to the SASJFS service to read the flow information, the job execution configuration and the credentials to connect to the grid. With that information, it sends jobs for execution to the SAS Workload Orchestrator. Finally, it is responsible for providing the history record for the flow back to SASJFS service.
SAS Grid Manager provides additional components to administer the SAS Workload Orchestrator:
• SAS Workload Orchestrator Web Interface
• SAS Workload Orchestrator Administration Utility
Both can monitor jobs, queues, hosts, services, and logs, and configure hosts, queues, services, user groups, and user resources.
The SAS Workload Orchestrator Web Interface is a web application hosted by the SAS Workload Orchestrator process on the grid master host; it can be proxied by the SAS Web Server to always point to the current master in case of failover.
The SAS Workload Orchestrator Administration Utility is an administration command-line interface; it has a similar syntax to SAS Viya CLIs and is located in the directory /Lev1/Applications/GridAdminUtility. A sample invocation to list all running jobs is:
sas-grid-cli show-jobs --state RUNNING
Describing what has not changed with the new grid provider is an easy task: everything else.
Obviously, this is a very generic statement, so let’s call out a few noteworthy items that have not changed:
• User experience is unchanged. SAS programming interfaces to grid have not changed, apart from the lower-level libraries to connect to the new provider. As such, you still have the traditional SAS grid control server, SAS grid nodes, SAS thin client (aka SASGSUB) and the full SAS client (SAS Display Manager). Users can submit jobs or start grid-launched sessions from SAS Enterprise Guide, SAS Studio, SAS Enterprise Miner, etc.
• A directory shared among all grid hosts is still required to share the grid configuration files.
• A high-performance, clustered file system for the SASWORK area and for data libraries is mandatory to guarantee satisfactory performance.
The traditional grid provider, now rebranded as SAS Grid Manager for Platform, has seen some changes as well with SAS 9.4M6:
• The existing management interface, SAS Grid Manager for Platform Module for SAS Environment Manager, has been completely re-designed. The user interface has completely changed, although the functions provided remain the same.
• Grid Management Services (GMS) is not updated to work with the latest release of LSF. Therefore, the SAS Grid Manager plug-in for SAS Management Console is no longer supported. However, the plug-in is included with SAS 9.4M6 if you want to upgrade to SAS 9.4M6 without also upgrading Platform Suite for SAS.
The first person to live to 150 has already been born. When I first heard this controversial idea, popularized by biomedical gerontologist Aubrey de Grey of the SENS Research Foundation, it gave me chills. It’s amazing to think we could have 50 more years to spend with loved ones and [...]
In machine learning and other model building techniques, it is common to partition a large data set into three segments: training, validation, and testing.
I've seen many questions about how to use SAS to split data into training, validation, and testing data. (A common variation uses only training and validation.) There are basically two approaches to partitioning data:
This article uses the SAS DATA step to accomplish the first task and uses PROC SURVEYSELECT to accomplish the second. I also discuss how to split data into only two roles: training and validation.
It is worth mentioning that many model-selection routines in SAS enable you to split data by using the PARTITION statement. Example include the "SELECT" procedures (GLMSELECT, QUANTSELECT, HPGENSELECT...) and the ADAPTIVEREG procedure. However, be aware that the procedures might ignore observations that have missing values for the variables in the model.
When you partition data into various roles, you can choose to add an indicator variable, or you can physically create three separate data sets. The following DATA step creates an indicator variable with values "Train", "Validate", and "Test". The specified proportions are 60% training, 30% validation, and 10% testing. You can change the values of the SAS macro variables to use your own proportions. The RAND("Table") function is an efficient way to generate the indicator variable.
data Have; /* the data to partition */ set Sashelp.Heart; /* for example, use Heart data */ run; /* If propTrain + propValid = 1, then no observation is assigned to testing */ %let propTrain = 0.6; /* proportion of trainging data */ %let propValid = 0.3; /* proportion of validation data */ %let propTest = %sysevalf(1 - &propTrain - &propValid); /* remaining are used for testing */ /* Randomly assign each observation to a role; _ROLE_ is indicator variable */ data RandOut; array p _temporary_ (&propTrain, &propValid); array labels $ _temporary_ ("Train", "Validate", "Test"); set Have; call streaminit(123); /* set random number seed */ /* RAND("table") returns 1, 2, or 3 with specified probabilities */ _k = rand("Table", of p[*]); _ROLE_ = labels[_k]; /* use _ROLE_ = _k if you prefer numerical categories */ drop _k; run; proc freq data=RandOut order=freq; tables _ROLE_ / nocum; run;
A shown by the output of PROC FREQ, the proportion of observations in each role is approximately the same as the specified proportions. For this random number seed, the proportions are 59.1%, 30.4%, and 10.6%. If you change the random number seed, you will get a different assignment of observations to roles and also a different proportion of data in each role.
The observant reader will notice that there are only two elements in the array of probabilities (p) that is used in the RAND("Table") call. This is intentional. The documentation for the RAND("Table") function states that if the sum of the specified probabilities is less than 1, then the quantity 1 – Σ pi is used as the probability of the last event. By specifying only two values in the p array, the same program works for partitioning the data into two pieces (training and validation) or three pieces (and testing).
Some practitioners choose to create three separate data sets instead of adding an indicator variable to the existing data. The computation is exactly the same, but you can use the OUTPUT statement to direct each observation to one of three output data sets, as follows:
/* create a separate data set for each role */ data Train Validate Test; array p _temporary_ (&propTrain, &propValid); set Have; call streaminit(123); /* set random number seed */ /* RAND("table") returns 1, 2, or 3 with specified probabilities */ _k = rand("Table", of p[*]); if _k = 1 then output Train; else if _k = 2 then output Validate; else output Test; drop _k; run;
NOTE: The data set WORK.TRAIN has 3078 observations and 17 variables. NOTE: The data set WORK.VALIDATE has 1581 observations and 17 variables. NOTE: The data set WORK.TEST has 550 observations and 17 variables.
This example uses the same random number seed as the previous example. Consequently, the three output data sets have the same observations as are indicated by the partition variable (_ROLE_) in the previous example.
Instead of specifying a proportion, you might want to specify the exact number of observations that are randomly assigned to each role. The advantage of this technique is that changing the random number seed does not change the number of observations in each role (although it does change which observations are assigned to each role). The SURVEYSELECT procedure supports the GROUPS= option, which you can use to specify the number of observations.
The GROUPS= option requires that you specify integer values. For this example, the original data contains 5209 observations but 60% of 5209 is not an integer. Therefore, the following DATA step computes the number of observations as ROUND(N p) for the training and validation sets. These integer values are put into macro variables and used in the GROUPS= option on the PROC SURVEYSELECT statement. You can, of course, skip the DATA step and specify your own values such as groups=(3200, 1600, 409).
/* Specify the sizes of the train/validation/test data from proportions */ data _null_; if 0 then set sashelp.heart nobs=N; /* N = total number of obs */ nTrain = round(N * &propTrain); /* size of training data */ nValid = round(N * &propValid); /* size of validataion data */ call symputx("nTrain", nTrain); /* put integer into macro variable */ call symputx("nValid", nValid); call symputx("nTest", N - nTrain - nValid); run; /* randomly assign observations to three groups */ proc surveyselect data=Have seed=12345 out=SSOut groups=(&nTrain, &nValid, &nTest); /* if no Test data, use GROUPS=(&nTrain, &nValid) */ run; proc freq data=SSOut order=freq; tables GroupID / nocum; /* GroupID is name of indicator variable */ run;
The training, validation, and testing groups contain 3125, 1563, and 521 observations, respectively. These numbers are the closest integer approximations to 60%, 30% and 10% of the 5209 observations. Notice that the output from the SURVEYSELECT procedure uses the values 1, 2, and 3 for the GroupID indicator variable. You can use PROC FORMAT to associate those numbers with labels such as "Train", "Validate", and "Test".
In summary, there are two basic programming techniques for randomly partitioning data into training, validation, and testing roles. One way uses the SAS DATA step to randomly assign each observation to a role according to proportions that you specify. If you use this technique, the size of each group is random. The other way is to use PROC SURVEYSELECT to randomly assign observations to roles. If you use this technique, you must specify the number of observation in each group.
The post Create training, validation, and test data sets in SAS appeared first on The DO Loop.
• “What’s New” contains information about new features. For example, in SAS 9.4M6, “What’s New” discusses a new ODS destination for Word and a new procedure called PROC SGPIE.
• The “SAS Guide to Software Updates and Product Changes” includes the following subsections:
o A section on software updates, which is for customers who are upgrading to a new release. (FYI: A software update is any modification that SAS provides for existing software. An upgrade is a new release of SAS. A maintenance release is a collection of updates that are applied to the currently installed software.)
o Another subsection discusses product details and requirements. In it, you will find information about values or settings that have changed from one release to the next. For example, for SAS 9.4M6, the default style for HTML5 output changed from HTMLBlue to HTMLEncore. Another example is for SAS® 9.4M0, when the YEARCUTOFF= system option changed from 1920 to 1926.
In the “What’s New” documentation, there is a link in each section to the corresponding product topic in “SAS Guide to Software Updates and Product Changes.”
For example, when you scroll through this SAS Studio page, you see references both to various “What’s New” pages for different versions of SAS Studio and to “SAS Guide to Software Updates and Product Changes.”
In “What's New in Base SAS: Details,” you can search by the software and maintenance release to find new features. Beginning with SAS 9.4, new features for maintenance releases are introduced using the SAS 9.4Mx notation. For example, in the search box on the page, you can enter 9.4M6.
With these new online Help sections, you can find information quickly about new features of the current SAS release, as well as what has changed from the previous release. As always, we welcome your feedback and suggestions for improving the documentation.
Elizabeth Downes and Marie Dexter in SAS Documentation Development were very willing to make the requested wording changes in the documentation. They also contributed to the content of this article. Thanks to both for their time and effort!
They say "The Sun never sets on the SAS Empire" ... and it's true! There are SAS users all over the world, and SAS output & results could be in any language. Therefore, if you're a SAS programmer, you might need to know how to create SAS graphs with international [...]
It seems that everyone knows about GitHub -- the service that hosts many popular open source code projects. The underpinnings of GitHub are based on Git, which is itself an open-source implementation of a source management system. Git was originally built to help developers collaborate on Linux (yet another famous open source project) -- but now we all use it for all types of projects.
There are other free and for-pay services that use Git, like Bitbucket and GitLab. And there are countless products that embed Git for its versioning and collaboration features. In 2014, SAS developers added built-in Git support for SAS Enterprise Guide.
Since then, Git (and GitHub) have grown to play an even larger role in data science operations and DevOps in general. Automation is a key component for production work -- including check-in, check-out, commit, and rollback. In response, SAS has added Git integration to more SAS products, including:
You can use this Git integration with any service that supports Git (GitHub, GitLab, etc.), or with your own private Git servers and even just local Git repositories.
Git infrastructure and functions were added to SAS 9.4 Maintenance 6. The new SAS functions all have the helpful prefix of "GITFN_" (signifying "Git fun!", I assume). Here's a partial list:
|GITFN_CLONE||Clones a Git repository (for example, from GitHub) into a directory on the SAS server.|
|GITFN_COMMIT||Commits staged files to the local repository|
|GITFN_DIFF||Returns the number of diffs between two commits in the local repository and creates a diff record object for the local repository.|
|GITFN_PUSH||Pushes the committed files in the local repository to the remote repository.|
|GITFN_NEW_BRANCH||Creates a Git branch|
The function names make sense if you're familiar with Git lingo. If you're new to Git, you'll need to learn the terms that go with the commands: clone, repo, commit, stage, blame, and more. This handbook provided by GitHub is friendly and easy to read. (Or you can start with this xkcd comic.)
data _null_; version = gitfn_version(); put version=; rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/", "c:\Projects\sas-dummy-blog"); put rc=; run;
In one line, this function fetches an entire collection of code files from your source control system. Here's a more concrete example that fetches the code to a work space, then runs a program from that repository. (This is safe for you to try -- here's the code that will be pulled/run. It even works from SAS University Edition.)
options dlcreatedir; %let repoPath = %sysfunc(getoption(WORK))/sas-dummy-blog; libname repo "&repoPath."; libname repo clear; /* Fetch latest code from GitHub */ data _null_; rc = gitfn_clone("https://github.com/sascommunities/sas-dummy-blog/", "&repoPath."); put rc=; run; /* run the code in this session */ %include "&repoPath./rng_example_thanos.sas";
You could use the other GITFN functions to stage and commit the output from your SAS jobs, including log files, data sets, ODS results -- whatever you need to keep and version.
SAS Data Integration Studio has supported source control integration for many years, but only for CVS and Subversion (still in wide use, but they aren't media darlings like GitHub). By popular request, the latest version of SAS Data Integration Studio adds support for a Git plug-in.
See the documentation for details:
Read more about setup and use in the available here as part of our "Custom Tasks Tuesday" series. This isn't new, but I'll include it for completeness. SAS Enterprise Guide supports built-in Git repository support for SAS programs that are stored in your project file. You can use this feature without having to set up any external Git servers or repositories. Also, SAS Enterprise Guide can recognize when you reference programs that are managed in an external Git repository. This integration enables features like program history, compare differences, commit, and more. Read more and see a demo of this in action here. If you use SAS Enterprise Guide to edit and run SAS programs that are managed in an external Git repository, here's an important tip. Change your project file properties to "Use paths relative to the project for programs and importable files." You'll find this checkbox in File->Project Properties. With this enabled, you can store the project file (EGP) and any SAS programs together in Git, organized into subfolders if you want. As long as these are cloned into a similar structure on any system you use, the file paths will resolve automatically.
Using Git in SAS Enterprise Guide
Read more about setup and use in the available here as part of our "Custom Tasks Tuesday" series.
This isn't new, but I'll include it for completeness. SAS Enterprise Guide supports built-in Git repository support for SAS programs that are stored in your project file. You can use this feature without having to set up any external Git servers or repositories. Also, SAS Enterprise Guide can recognize when you reference programs that are managed in an external Git repository. This integration enables features like program history, compare differences, commit, and more. Read more and see a demo of this in action here.
If you use SAS Enterprise Guide to edit and run SAS programs that are managed in an external Git repository, here's an important tip. Change your project file properties to "Use paths relative to the project for programs and importable files." You'll find this checkbox in File->Project Properties.
With this enabled, you can store the project file (EGP) and any SAS programs together in Git, organized into subfolders if you want. As long as these are cloned into a similar structure on any system you use, the file paths will resolve automatically.
Guest blogger Khari Villela says data lakes are not a cure-all – they're just one part of a comprehensive, strategic architecture.
The post Are data lakes stuck in the black and white world of the big data infomercial? appeared first on The Data Roundtable.