10月 122017
 

With SAS Data Management, you can setup SAS Data Remediation to manage and correct data issues. SAS Data Remediation allows user- or role-based access to data exceptions.

When a data issue is discovered it can be sent automatically or manually to a remediation queue where it can be corrected by designated users.

Let’s look how to setup a remediation service and how to send issue records to Data Remediation.

Register the remediation service.

To register a remediation service in SAS Data Remediation we go to Data Remediation Administrator “Add New Client Application.

Under Properties we supply an ID, which can be the name of the remediation service as long as it is unique, and a Display name, which is the name showing in the Remediation UI.

Under the tab Subject Area, we can register different subject categories for this remediation service.  When calling the remediation service we can categorize different remediation issues by setting different subject areas. We can, for example, use the Subject Area to point to different Data Quality Dimensions like Completeness, Uniqueness, Validity, Accuracy, Consistency.

Under the tab Issues Types, we can register issue categories. This enables us to categorize the different remediation issues. For example, we can point to the affected part of record like Name, Address, Phone Number.

At Task Templates/Select Templates we can set a workflow to be used for each issue type. You can design your own workflow using SAS Workflow Studio or you can use a prepared workflow that comes with Data Remediation. You need to make sure that the desired workflow is loaded on to Workflow Server to link it to the Data Remediation Service. Workflows are not mandatory in SAS Data Remediation but will improve efficiency of the remediation process.

Saving the remediation service will make it available to be called.

Sending issues to Data Remediation.

When you process data, and have identified issues that you want to send to Data Remediation, you can either call Data Remediation from the job immediately where you process the data or you store the issue records in a table first and then, in a second step, create remediation records via a Data Management job.

To send records to Data Remediation you can call remediation REST API form the HTTP Request node in a Data Management job.

Remediation REST API

The REST API expects a JSON structure supplying all required information:

{
	"application": "mandatory",
	"subjectArea": "mandatory",
	"name": "mandatory",
	"description": "",
	"userDefinedFieldLabels": {
		"1": "",
		"2": "",
		"3": ""
	},
	"topics": [{
		"url": "",
		"name": "",
		"userDefinedFields": {
			"1": "",
			"2": "",
			"3": ""
		},
		"key": "",
		"issues": [{
			"name": "mandatory",
			"importance": "",
			"note": "",
			"assignee": {
				"name": ""
			},
			"workflowName": "",
			"dueDate": "",
			"status": ""
		}]
	}]
}

 

JSON structure description:

In a Data Management job, you can create the JSON structure in an Expression node and use field substitution to pass in the necessary values from the issue records. The expression code could look like this:

REM_APPLICATION= "Customer Record"
REM_SUBJECT_AREA= "Completeness"
REM_PACKAGE_NAME= "Data Correction"
REM_PACKAGE_DESCRIPTION= "Mon-Result: " &formatdate(today(),"DD MM YY") 
REM_URL= "http://myserver/Sourcesys/#ID=" &record_id
REM_ITEM_NAME= "Mobile phone number missing"
REM_FIELDLABEL_1= "Source System"
REM_FIELD_1= "CRM"
REM_FIELDLABEL_2= "Redord ID"
REM_FIELD_2= record_id
REM_FIELDLABEL_3= "-"
REM_FIELD_3= ""
REM_KEY= record_id
REM_ISSUE_NAME= "Phone Number"
REM_IMPORTANCE= "high"
REM_ISSUE_NOTE= "Violated data quality rule phone: 4711"
REM_ASSIGNEE= "Ben"
REM_WORKFLOW= "Customer Tag"
REM_DUE-DATE= "2018-11-01"
REM_STATUS= "open"
 
JSON_REQUEST= '
{
  "application":"' &REM_APPLICATION &'",
  "subjectArea":"' &REM_SUBJECT_AREA &'",
  "name":"' &REM_PACKAGE_NAME &'",
  "description":"' &REM_PACKAGE_DESCRIPTION &'",
  "userDefinedFieldLabels": {
    "1":"' &REM_FIELDLABEL_1 &'",
    "2":"' &REM_FIELDLABEL_2 &'",
    "3":"' &REM_FIELDLABEL_3 &'"
  },
  "topics": [{
    "url":"' &REM_URL &'",
    "name":"' &REM_ITEM_NAME &'",
    "userDefinedFields": {
      "1":"' &REM_FIELD_1 &'",
      "2":"' &REM_FIELD_2 &'",
      "3":"' &REM_FIELD_3 &'"
    },
    "key":"' &REM_KEY &'",
    "issues": [{
      "name":"' &REM_ISSUE_NAME &'",
      "importance":"' &REM_IMPORTANCE &'",
      "note":"' &REM_ISSUE_NOTE &'",
      "assignee": {
        "name":"' &REM_ASSIGNEE &'"
      },
      "workflowName":"' &REM_WORKFLOW &'",
      "dueDate":"' &REM_DUE_DATE &'",
      "status":"' &REM_STATUS &'"
    }]
  }]
}'

 

Tip: You could also write a global function to generate the JSON structure.

After creating the JSON structure, you can invoke the web service to create remediation records. In the HTTP Request node, you call the web service as follows:

Address:  http://[server]:[port]/SASDataRemediation/rest/groups
Method: post
Input Filed: The variable containing the JSON structure. I.e. JSON_REQUEST
Output Filed: A field to take the output from the web service. You can use the New button create a filed and set the size to 1000
Under Security… you can set a defined user and password to access Data Remediation.
In the HTTP Request node’s advanced settings set the WSCP_HTTP_CONTENT_TYPE options to application/json

 

 

 

You can now execute the Data Management job to create the remediation records in SAS Data Remediation.

Improving data quality through SAS Data Remediation was published on SAS Users.

10月 112017
 

The article "Fisher's transformation of the correlation coefficient" featured a Monte Carlo simulation that generated sample correlations from bivariate normal data. The simulation used three steps:

  1. Simulate B samples of size N from a bivariate normal distribution with correlation ρ.
  2. Use PROC CORR to compute the sample correlation matrix for each of the B samples.
  3. Use the DATA step to extract the off-diagonal elements from the correlation matrices.

After the three steps, you obtain a distribution of B sample correlation coefficients that approximates the sampling distribution of the Pearson correlation coefficient for bivariate normal data.

There is a simpler way to simulate the correlation estimates: You can directly simulate from the Wishart distribution. Each draw from the Wishart distribution is a sample covariance matrix for a multivariate normal sample of size N. If you convert that covariance matrix to a correlation matrix, you can immediately extract the off-diagonal elements, as shown in the following SAS/IML statements:

%let rho = 0.8;           /* correlation for bivariate normal distribution */
%let N = 20;              /* sample size */
%let NumSamples = 2500;   /* number of simulated samples */
 
/* generate sample correlation coefficients by using Wishart distribution */
proc iml;
call randseed(12345);
NumSamples = &NumSamples;
DF = &N - 1;              /* X ~ N obs from MVN(0, Sigma) */
Sigma = {1     &rho,      /* covariance for MVN samples */
         &rho   1  };
S = RandWishart(NumSamples, DF, Sigma); /* each row is 2x2 matrix */
Corr = j(NumSamples, 1);  /* allocate vector for correlation estimates */
do i = 1 to nrow(S);      /* convert to correlation; extract off-diagonal */
   Corr[i] = cov2corr( shape(S[i,], 2, 2) )[1,2];
end;

You can create a comparative histogram of the sample correlation coefficients. In the following graph, the histogram at the top of the panel displays the distribution of the simulated correlation coefficients from the three-step method. The bottom histogram displays the distribution of correlations coefficients that are generated from the Wishart distribution.

Visually, the histograms appear to be similar. You can use PROC NPAR1WAY to run various hypothesis tests that compare the distributions; all tests support the hypothesis that these two distributions are equivalent.

If you'd like to see the complete analysis, you can download the SAS program that runs both simulations and compares the resulting distributions.

Although the Wishart distribution is more efficient for this simulation, recall that the Wishart distribution assumes that the underlying data distribution is multivariate normal. In contrast, the three-step simulation is more general. It can be used to generate correlation coefficients for any data distribution. So although the three-step simulation is not necessary for multivariate normal data, it is still an important technique to store in your simulation toolbox.

The post Simulate correlations by using the Wishart distribution appeared first on The DO Loop.

10月 102017
 

Remember when 100MB was large?

SAS 9.4 Maintenance 5 includes new support for reading and writing GZIP files directly. GZIP files, usually found with a .gz file extension, are a different format than ZIP files. Although both are forms of compressed files, a GZIP file is usually a compressed copy of a single file, whereas a ZIP file is an "archive" -- a collection of files in a compressed virtual folder. GZIP tools are built into Unix/Linux platforms and are commonly used to save space when storing large text-based files that you're not ready to part with: log files, csv files, and more. The algorithm used to compress GZIP files performs especially well with text files, although you can technically GZIP any file that you want.

I've written extensively about using FILENAME ZIP to read and write ZIP archives with SAS. The latest version of filename my_gz ZIP "path-to-file/compressedfile.txt.gz" GZIP;

Here's an example that creates a compressed version of a log file:

filename source "C:\Logs\SEGuide_log.10168.txt";
filename tozip ZIP "C:\Logs\SEGuide_log.10168.txt.gz" GZIP;
 
data _null_;   
   infile source;
   file tozip ;
   input;
   put _infile_ ;
run;

In my test here, the result represents a significant size difference, with the compressed file occupying just 14% of the space.


To "re-inflate" the compressed file, we can perform the opposite operation. (I added the ENCODING option here because I know my log file was UTF-8 encoded.)

filename target "C:\LogsExpanded\SEGuide_log.10168.txt" encoding='utf-8';
filename fromzip ZIP "C:\Logs\SEGuide_log.10168.txt.gz" GZIP;
 
data _null_;   
   infile fromzip;
   file target ;
   input;
   put _infile_ ;
run;

You don't have to explicitly expand a compressed text file in order to read it with SAS. You can use the GZIP method to read and parse a .gz file directly, similar to the zcat command that you might be familiar with from the Unix shell:

filename fromzip ZIP "C:\Logs\SEGuide_log.10168.txt.gz" GZIP;
data logdata;   
   infile fromzip; /* read directly from compressed file */
   input  date : yymmdd10. time : anydttme. ;
   format date date9. time timeampm.;
run;

If your file is in a binary format such as a SAS data set (sas7bdat) or Excel (XLS or XLSX), you probably will need to expand the file completely before reading it as data. These files are read using special drivers that don't process the bytes sequentially, so you need the entire file available on disk.

Note: Because each GZIP file represents just one compressed file, the MEMBER= option doesn't apply. When dealing with ZIP file archives that contain multiple files, you could use the MEMBER= option on FILENAME ZIP to address a specific file that you want. My recent example about FINFO and file details relies heavily on that approach. However, the GZIP option and MEMBER= options are mutually exclusive. In that way, it's much simpler...just like its Unix shell equivalent.


* ZIP drive image By © Raimond Spekking / CC BY-SA 4.0 (via Wikimedia Commons), CC BY-SA 4.0, Link

The post Reading and writing GZIP files with SAS appeared first on The SAS Dummy.

10月 102017
 

Industry-University relationships can be very complex, yet at Regent’s University London (RUL), we are very proud to report on the success of our first joint SAS-RUL project. In June 2016, as part of SAS Academy activities at Regent’s, I started a research collaboration with Mike Turner and Neil Griffin from [...]

The post Inspiring students to re-think marketing for the digital economy appeared first on SAS Analytics U Blog.

10月 092017
 

My new SAS Press book “An Introduction to SAS Visual Analytics” (written in collaboration with Tricia Aanderud and Rob Collum) covers all of the different aspects of SAS® Visual Analytics, including how to develop reports, load data, and handle administration. Below is an example of the types of tips that you can find [...]

The post An introduction to SAS Visual Analytics: the Parallel Period function of the Derived Item calculations appeared first on SAS Learning Post.

10月 092017
 

Every day 91 Americans die from opioid abuse, every nine seconds a student drops out of high school and every 47 seconds a child is confirmed to be abused or neglected. These are sobering statistics that show the challenge our government leaders face to help those in need. While there [...]

Inspirational, emotional government leadership event focuses on challenges, solutions was published on SAS Voices by Paula Henderson

10月 092017
 

Correlations between variables are typically displayed in a matrix. Because the correlation matrix is determined by the order of the variables, it is difficult to find the largest and smallest correlations, which is why analysts sometimes use colors to visualize the correlation matrix. Another visualization option is the pairwise correlation plot, which orders pairs of variables by their correlations.

Neither graph addresses a related problem: for each variable, which other variables are strongly correlated with it? Which are weakly correlated? In SAS, the CORR procedure supports a little-known option that answers that question. You can use the RANK option in the PROC CORR statement to order the correlations for each variable (independently) according to the magnitude of the correlations.

Notice that the RANK option does not compute "rank correlation." You can compute rank correlation by using the SPEARMAN option.

Ordering correlations by size

Consider the Sashelp.Heart data set, which contains data for 5209 patients who enrolled in the Framingham Heart Study. You might want to know which variables are highly correlated with weight, or smoking, or blood pressure, to name a few examples. The following call to PROC CORR uses the RANK option to order each row of correlations in the output:

/* Note: The MRW variable is similar to the body-mass index (BMI). */
proc corr data=sashelp.Heart RANK noprob;
   var Height Weight MRW Smoking Diastolic Systolic Cholesterol;
run;
Correlations ordered by magnitude

The output orders each row according to the magnitude of the correlations. (Click to enlarge.) For example, look at the row for the Weight variable, which is highlighted by a red rectangle. Scanning across the row, you can see that the variables that are the most strongly correlated with Weight are MRW (which measures whether a patient is overweight) and the height. At the end of the row are the variables that are essentially uncorrelated with Weight, namely Smoking and Cholesterol. The numbers at the bottom of each cell indicate the number of nonmissing pairwise observations.

In a similar way, look at the row for the Smoking variable. That variable is most strongly correlated with Height and MRW. Notice that the correlation with MRW is negative, which shows that the correlations are ordered by absolute values (magnitude). The highly correlated variables—whether positively or negatively correlated—appear first and the uncorrelated variable (correlations near zero) occur last in each row.

Ordering correlations for groups of variables

It is common to want to examine the correlations between groups of variables. For example, a clinician might want to look at the correlations between clinical measurements (blood pressure, cholesterol,...) and genetic or lifestyle choices (weight, smoking habits,...). The following call to PROC CORR uses the VAR and WITH statements to compare groups of variables, and uses the RANK option to order the correlations along each row:

proc corr data=sashelp.Heart RANK noprob;
   var Height Weight MRW Smoking;       /* genetic and lifestyle factors */
   with Diastolic Systolic Cholesterol; /* clinical measurements */
run;

Notice that the number of variables in the VAR statement determine the columns. The variables in the WITH statement determine the rows. Within each row, the variables in the VAR statement are ordered by the magnitude of the correlation. For these data, the Diastolic and Systolic variables are similar with respect to how they correlate with the column variables. The order of the column variables is the same for the first two rows. In contrast, the strength of the correlations between the Cholesterol variable and the column variables are in a different order.

In summary, you can use the RANK option in the PROC CORR statement to order the rows of a correlation matrix according to the magnitude (absolute value) of the correlations between each variable and the others. This makes it easy to find pairs of variables that are strongly correlated and pairs that are weakly correlated.

The post Order correlations by magnitude appeared first on The DO Loop.

10月 062017
 

After a few tough years in the trenches, analytics leaders in utilities are emerging and making a difference as their utilities vie to stay relevant in the ever-changing energy landscape. At the core of this emergence are leaders that are embracing open analytics platforms and pushing analytics to the edge. [...]

For utilities, the horizon is wide open was published on SAS Voices by Mike F. Smith