Tech

5月 082018
 

SAS reporting tools for GDPR and other privacy protection lawThe European Union’s General Data Protection Regulation (GDPR) taking effect on 25 May 2018 pertains not only to organizations located within the EU; it applies to all companies processing and holding the personal data of data subjects residing in the European Union, regardless of the company’s location.

If the GDPR acronym does not mean much to you, think of the one that does – HIPAA, FERPA, COPPA, CIPSEA, or any other that is relevant to your jurisdiction – this blog post is equally applicable to all of them.

The GDPR prohibits personal data processing revealing such individual characteristics as race or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, as well as the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health, and data concerning a natural person’s sex life or sexual orientation. It also has special rules for data relating to criminal convictions or offenses and the processing of children’s personal data.

Whenever SAS users produce reports on demographic data, there is always a risk of inadvertently revealing personal data protected by law, especially when reports are generated automatically or interactively via dynamic data queries. Even for aggregate reports there is a high potential for such exposure.

Suppose you produce an aggregate cross-tabulation report on a small demographic group, representing a count distribution by students’ grade and race. It is highly probable that you can get the count of 1 for some cells in the report, which will unequivocally identify persons and thus disclose their education record (grade) by race. Even if the count is not equal to 1, but is equal to some other small number, there is still a risk of possible deducing or disaggregating of Personally Identifiable Information (PII) from surrounding data (other cells, row and column totals) or related reports on that small demographic group.

The following are the four selected SAS tools that allow you to take care of protecting personal data in SAS reports by suppressing counts in small demographic group reports.

1. Automatic data suppression in SAS reports

This blog post explains the fundamental concepts of data suppression algorithms. It takes you behind the scenes of the iterative process of complementary data suppression and walks you through SAS code implementing a primary and secondary complementary suppression algorithm. The suppression code uses BASE SAS – DATA STEPs, SAS macros, PROC FORMAT, PROC MEANS, and PROC REPORT.

2. Implementing Privacy Protection-Compliant SAS® Aggregate Reports

This SAS Global Forum 2018 paper solidifies and expands on the above blog post. It walks you through the intricate logic of an enhanced complementary suppression process, and demonstrates SAS coding techniques to implement and automatically generate aggregate tabular reports compliant with privacy protection law. The result is a set of SAS macros ready for use in any reporting organization responsible for compliance with privacy protection.

3. In SAS Visual Analytics you can create derived data items that are aggregated measures.  SAS Visual Analytics 8.2 on SAS Viya introduces a new Type for the aggregated measures derived data items called Data Suppression. Here is an excerpt from the documentation on the Data Suppression type:

“Obscures aggregated data if individual values could easily be inferred. Data suppression replaces all values for the measure on which it is based with asterisk characters (*) unless a value represents the aggregation of a specified minimum number of values. You specify the minimum in the Suppress data if count less than parameter. The values are hidden from view, but they are still present in the data query. The calculation of totals and subtotals is not affected.

Some additional values might be suppressed when a single value would be suppressed from a subgroup. In this case, an additional value is suppressed so that the suppressed value cannot be inferred from totals or subtotals.

A common use of suppressed data is to protect the identity of individuals in aggregated data when some crossings are sparse. For example, if your data contains testing scores for a school district by demographics, but one of the demographic categories is represented only by a single student, then data suppression hides the test score for that demographic category.

When you use suppressed data, be sure to follow these best practices:

  • Never use the unsuppressed version of the data item in your report, even in filters and ranks. Consider hiding the unsuppressed version in the Data pane.
  • Avoid using suppressed data in any object that is the source or target of a filter action. Filter actions can sometimes make it possible to infer the values of suppressed data.
  • Avoid assigning hierarchies to objects that contain suppressed data. Expanding or drilling down on a hierarchy can make it possible to infer the values of suppressed data.”

This Data Suppression type functionality is significant as it represents the first such functionality embedded directly into a SAS product.

4. Is it sensitive? Mask it with data suppression

This blog post provides an example of using the above Data Suppression type aggregated measures derived data items in SAS Visual Analytics.

We need your feedback!

We want to hear from you.  Is this blog post useful? How do you comply with GDPR (or other Privacy Law of your jurisdiction) in your organization? What SAS privacy protection features would you like to see in future SAS releases?

SAS tools for GDPR privacy compliant reporting was published on SAS Users.

5月 082018
 

SAS reporting tools for GDPR and other privacy protection lawThe European Union’s General Data Protection Regulation (GDPR) taking effect on 25 May 2018 pertains not only to organizations located within the EU; it applies to all companies processing and holding the personal data of data subjects residing in the European Union, regardless of the company’s location.

If the GDPR acronym does not mean much to you, think of the one that does – HIPAA, FERPA, COPPA, CIPSEA, or any other that is relevant to your jurisdiction – this blog post is equally applicable to all of them.

The GDPR prohibits personal data processing revealing such individual characteristics as race or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, as well as the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health, and data concerning a natural person’s sex life or sexual orientation. It also has special rules for data relating to criminal convictions or offenses and the processing of children’s personal data.

Whenever SAS users produce reports on demographic data, there is always a risk of inadvertently revealing personal data protected by law, especially when reports are generated automatically or interactively via dynamic data queries. Even for aggregate reports there is a high potential for such exposure.

Suppose you produce an aggregate cross-tabulation report on a small demographic group, representing a count distribution by students’ grade and race. It is highly probable that you can get the count of 1 for some cells in the report, which will unequivocally identify persons and thus disclose their education record (grade) by race. Even if the count is not equal to 1, but is equal to some other small number, there is still a risk of possible deducing or disaggregating of Personally Identifiable Information (PII) from surrounding data (other cells, row and column totals) or related reports on that small demographic group.

The following are the four selected SAS tools that allow you to take care of protecting personal data in SAS reports by suppressing counts in small demographic group reports.

1. Automatic data suppression in SAS reports

This blog post explains the fundamental concepts of data suppression algorithms. It takes you behind the scenes of the iterative process of complementary data suppression and walks you through SAS code implementing a primary and secondary complementary suppression algorithm. The suppression code uses BASE SAS – DATA STEPs, SAS macros, PROC FORMAT, PROC MEANS, and PROC REPORT.

2. Implementing Privacy Protection-Compliant SAS® Aggregate Reports

This SAS Global Forum 2018 paper solidifies and expands on the above blog post. It walks you through the intricate logic of an enhanced complementary suppression process, and demonstrates SAS coding techniques to implement and automatically generate aggregate tabular reports compliant with privacy protection law. The result is a set of SAS macros ready for use in any reporting organization responsible for compliance with privacy protection.

3. In SAS Visual Analytics you can create derived data items that are aggregated measures.  SAS Visual Analytics 8.2 on SAS Viya introduces a new Type for the aggregated measures derived data items called Data Suppression. Here is an excerpt from the documentation on the Data Suppression type:

“Obscures aggregated data if individual values could easily be inferred. Data suppression replaces all values for the measure on which it is based with asterisk characters (*) unless a value represents the aggregation of a specified minimum number of values. You specify the minimum in the Suppress data if count less than parameter. The values are hidden from view, but they are still present in the data query. The calculation of totals and subtotals is not affected.

Some additional values might be suppressed when a single value would be suppressed from a subgroup. In this case, an additional value is suppressed so that the suppressed value cannot be inferred from totals or subtotals.

A common use of suppressed data is to protect the identity of individuals in aggregated data when some crossings are sparse. For example, if your data contains testing scores for a school district by demographics, but one of the demographic categories is represented only by a single student, then data suppression hides the test score for that demographic category.

When you use suppressed data, be sure to follow these best practices:

  • Never use the unsuppressed version of the data item in your report, even in filters and ranks. Consider hiding the unsuppressed version in the Data pane.
  • Avoid using suppressed data in any object that is the source or target of a filter action. Filter actions can sometimes make it possible to infer the values of suppressed data.
  • Avoid assigning hierarchies to objects that contain suppressed data. Expanding or drilling down on a hierarchy can make it possible to infer the values of suppressed data.”

This Data Suppression type functionality is significant as it represents the first such functionality embedded directly into a SAS product.

4. Is it sensitive? Mask it with data suppression

This blog post provides an example of using the above Data Suppression type aggregated measures derived data items in SAS Visual Analytics.

We need your feedback!

We want to hear from you.  Is this blog post useful? How do you comply with GDPR (or other Privacy Law of your jurisdiction) in your organization? What SAS privacy protection features would you like to see in future SAS releases?

SAS tools for GDPR privacy compliant reporting was published on SAS Users.

4月 212018
 

Have you ever been working in the macro facility and needed a macro function, but you could not locate one that would achieve your task? With the %SYSFUNC macro function, you can access most SAS® functions. In this blog post, I demonstrate how %SYSFUNC can help in your programming needs when a macro function might not exist. I also illustrate the formatting feature that is built in to %SYSFUNC. %SYSFUNC also has a counterpart called %QSYSFUNC that masks the returned value, in case special characters are returned.
%SYSFUNC enables the execution of SAS functions and user-written functions, such as those created with the FCMP procedure. Within the DATA step, arguments to the functions require quotation marks, but because %SYSFUNC is a macro function, you do not enclose the arguments in quotation marks. The examples here demonstrate this.

%SYSFUNC has two possible arguments. The first argument is the SAS function, and the second argument (which is optional) is the format to be applied to the value returned from the function. Suppose you had a report and within the title you wanted to issue today’s date in word format:

   title "Today is %sysfunc(today(),worddate20.)";

The title appears like this:

   "Today is               July 4, 2018"

Because the date is right-justified, there are leading blanks before the date. In this case, you need to introduce another function to remove the blank spaces. Luckily %SYSFUNC enables the nesting of functions, but each function that you use must have its own associated %SYSFUNC. You can rewrite the above example by adding the STRIP function to remove any leading or trailing blanks in the value:

   title "Today is %sysfunc(strip(%sysfunc(today(),worddate20.)))";

The title now appears like this:

    "Today is July 4, 2018"

The important thing to notice is the use of two separate functions. Each function is contained within its own %SYSFUNC.

Suppose you had a macro variable that contained blank spaces and you wanted to remove them. There is no macro COMPRESS function that removes all blanks. However, with %SYSFUNC, you have access to one. Here is an example:

   %let list=a    b    c; 
   %put %sysfunc(compress(&list));

The value that is written to the log is as follows:

   abc

In this last example, I use %SYSFUNC to work with SAS functions where macro functions do not exist.

The example checks to see whether an external file is empty. It uses the following SAS functions: FILEEXIST, FILENAME, FOPEN, FREAD, FGET, and FCLOSE. There are other ways to accomplish this task, but this example illustrates the use of SAS functions within %SYSFUNC.

   %macro test(outf);
   %let filrf=myfile;
 
   /* The FILEEXIST function returns a 1 if the file exists; else, a 0
   is returned. The macro variable &OUTF resolves to the filename
   that is passed into the macro. This function is used to determine
   whether the file exists. In this case you want to find the file
   that is contained within &OUTF. Notice that there are no quotation
   marks around the argument, as you will see in all cases below. If
   the condition is false, the %ELSE portion is executed, and a
   message is written to the log stating that the file does not
   exist.*/
 
   %if %sysfunc(fileexist(&outf)) %then %do;
 
   /* The FILENAME function returns 0 if the operation was successful; 
   else, a nonzero is returned. This function can assign a fileref
   for the external file that is located in the &OUTF macro 
   variable. */
 
   %let rc=%sysfunc(filename(filrf,&outf));
 
   /* The FOPEN function returns 0 if the file could not be opened; 
   else, a nonzero is returned. This function is used to open the
   external file that is associated with the fileref from &FILRF. */
 
   %let fid=%sysfunc(fopen(&filrf));
 
   /* The %IF macro checks to see whether &FID has a value greater
   than zero, which means that the file opened successfully. If the
   condition is true, we begin to read the data in the file. */
 
   %if &fid > 0 %then %do;
 
   /* The FREAD function returns 0 if the read was successful; else, a
   nonzero is returned. This function is used to read a record from
   the file that is contained within &FID. */
 
   %let rc=%sysfunc(fread(&fid));
 
   /* The FGET function returns a 0 if the operation was successful. A
   returned value of -1 is issued if there are no more records
   available. This function is used to copy data from the file data 
   buffer and place it into the macro variable, specified as the
   second argument in the function. In this case, the macro variable
   is MYSTRING. */   
 
   %let rc=%sysfunc(fget(&fid,mystring));
 
   /* If the read was successful, the log will write out the value
   that is contained within &MYSTRING. If nothing is returned, the
   %ELSE portion is executed. */
 
   %if &rc = 0 %then %put &mystring;
   %else %put file is empty;
 
   /* The FCLOSE function returns a 0 if the operation was successful;
   else, a nonzero value is returned. This function is used to close
   the file that was referenced in the FOPEN function. */
 
   %let rc=%sysfunc(fclose(&fid));
   %end;
 
   /* The FILENAME function is used here to deassign the fileref 
   FILRF. */
 
   %let rc=%sysfunc(filename(filrf));
   %end;
   %else %put file does not exist;
   %mend test;
   %test(c:\testfile.txt)

There are times when the value that is returned from the function used with %SYSFUNC contains special characters. Those characters then need to be masked. This can be done easily by using %SYSFUNC’s counterpart, %QSYSFUNC. Suppose we run the following example:

   %macro test(dte);
   %put &dte;
   %mend test;
 
   %test(%sysfunc(today(), worddate20.))

The above code would generate an error in the log, similar to the following:

   1  %macro test(dte);
   2  %put &dte;
   3  %mend test;
   4
   5  %test(%sysfunc(today(), worddate20.))
   MLOGIC(TEST):  Beginning execution.
   MLOGIC(TEST):  Parameter DTE has value July 20
   ERROR: More positional parameters found than defined.
   MLOGIC(TEST):  Ending execution.

The WORDDATE format would return the value like this: July 20, 2017. The comma, to a parameter list, represents a delimiter, so this macro call is pushing two positional parameters. However, the definition contains only one positional parameter. Therefore, an error is generated. To correct this problem, you can rewrite the macro invocation in the following way:

   %test(%qsysfunc(today(), worddate20.))

The %QSYSFUNC macro function masks the comma in the returned value so that it is seen as text rather than as a delimiter.

For a list of the functions that are not available with %SYSFUNC, see the “How to expand the number of available SAS functions within the macro language was published on SAS Users.

4月 192018
 

In SAS Visual Analytics 7.4 on 9.4M5 and SAS Visual Analytics 8.2 on SAS Viya, the periodic operators have a new additional parameter that controls how filtering on the date data item used in the calculation affects the calculated values.

The new parameter values are:

SAS Visual Analytics filters

These parameter values enable you to improve the appearance of reports based on calculations that use periodic operators. You can have periods that produce missing values for periodic calculations removed from the report, but still available for use in the calculations for later periods. These parameter settings also enable you to provide users with a prompt for choosing the data to display in a report, without having any effect on the calculations themselves.

The following will illustrate the points above, using periodic Revenue calculations based on monthly data from the MEGA_CORP table. New aggregated measures representing Previous Month Revenue (RelativePeriod) and Same Month Last Year (ParallelPeriod) will be displayed as measures in a crosstab. The default _ApplyAllFilters_ is in effect for both, as shown below, but there are no current filters on report or objects.

The Change from Previous Month and Change From Same Month Last Year calculations, respectively, are below:

The resulting report is a crosstab with Date by Month and Product Line in the Row roles, and Revenue, along with the 4 aggregations, in the Column roles.  All calculations are accurate, but of course, the calculations result in missing values for the first month (Jan2009) and for the first year (2009).

An improvement to the appearance of the report might be to only show Date by Month values beginning with Jan2010, where there are no non-missing values.  Why not apply a filter to the crosstab (shown below), so that the interval shown is Jan2010 to the most recent date?

With the above filter applied to the crosstab, the result is shown below—same problem, different year!

This is where the new parameter on our periodic operators is useful. We would like to have all months used in the calculations, but only the months with non-missing values for both of the periodic calculations shown in the crosstab. So, edit both periodic calculations to change the default _ApplyAllFilters_ to _IgnoreAllTimeFrameFilters_, so that the filters will filter the data in the crosstab, but not for the calculations. When the report is refreshed, only the months with non-missing values are shown:

This periodic operator parameter is also useful if you want to enable users to select a specific month, for viewing only a subset of the crosstab results.

For a selection prompt, add a Drop-Down list to select a MONYY value and define a filter action from the Drop-Down list to the Crosstab. To prevent selection of a month value with missing calculation values, you will also want to apply a filter to the Drop-Down list as you did for the crosstab, displaying months Jan2010 and after in the list.

Now the user can select a month, with all calculations relative to that month displayed, shown in the examples below:

Note that, at this point, since you’ve added the action from the drop-down list to the crosstab, you actually no longer need the filter on the crosstab itself.  In addition, if you remove the crosstab filter, then all of your filters will now be from prompts or actions, so you could use the _IgnoreInteractiveTimeFrameFilters_ parameter on your periodic calculations instead of the _IgnoreTimeFrameFilters_ parameter.

You will also notice that, in release 8.2 of SAS Visual Analytics that the performance of the periodic calculations has been greatly improved, with more of the work done in CAS.

Be sure to check out all of the periodic operators, documented here for SAS Visual Analytics 7.4 and SAS Visual Analytics filters on periodic calculations: Apply them or ignore them! was published on SAS Users.

4月 192018
 

A very common coding technique SAS programmers use is identifying the largest value for a given column using DATA Step BY statement with the DESCENDING option. In this example I wanted to find the largest value of the number of runs (nRuns) by each team in the SASHELP.BASEBALL dataset. Using a SAS workspace server, one would write:

DESCENDING BY Variables in SAS Viya

Figure 1. Single Threaded DATA Step in SAS Workspace Server

Figure 2 shows the results of the code we ran in Figure 1:

Figure 2. Result from SAS Code Displayed in Figure 1

To run this DATA Step distributed we will leveraging the SAS® Cloud Analytic Services in SAS® Viya™. Notice in Figure 3, there is no need for the PROC SORT step which is required when running DATA Step single threaded in a SAS workspace server. This is because SAS® Cloud Analytic Services in SAS® Viya™ . Instead we will use

Figure 3. Distributed DATA Step in SAS® Cloud Analytic Services in SAS® Viya™

Figure 4 shows the results when running distributed DATA Step in SAS® Cloud Analytic Services in SAS® Viya™.

Figure 4. Results of Distributed DATA Step in SAS® Cloud Analytic Services in SAS® Viya™

Conclusion

Until the BY statement running in the SAS® Cloud Analytic Services in SAS® Viya™ supports DESCENDING use this technique to ensure your DATA Step runs distributed.

Read more SAS Viya posts.

Read our SAS 9 to SAS Viya whitepaper.

How to Simulate DESCENDING BY Variables in DATA Step Code that Runs Distributed in SAS® Viya™ was published on SAS Users.

4月 172018
 

Peak moments from SAS Global Forum“Customer experiences are defined by peak moments”, said Dan Heath, in his typical charismatic fashion during the keynote on the Executive Track of the SAS Global Forum 2018. Customer experiences is not an alien term by any means but "peak moments?" What was that?

The best way to get a detailed understanding is to simply get your copy of The Power of Moments authored by the Heath Brothers. Heath went on to use examples from a bank in Canada where the ATM would have a human dialog with the surprised consumer just walking up to withdraw a few dollars and would pop up a free ticket to Disney for a mother of two or a round trip ticket for a grandmother to be with her beautiful grandchildren. Peak Moments. OK, now I get it.

Heath also challenged everyone in the audience – including yours truly – to submit our own interpretation of peak moments within the next 24 hours to get a free copy of his book. But, as luck would have it, I did not get around to submitting it because the peak moment I experienced was a few hours later during the SAS Grid Dinner. And it was no surprise that this conference was in Denver with a beautiful view of the glorious mountain peaks -- the perfect backdrop for customers to experience their peak moments!

Peak Moments from SAS Global Forum

Fast forward to the SAS Grid Dinner. A panel of customers using SAS Grid Manager with years of experienced insight had a great conversation with the audience with the discussion eloquently moderated by Cheryl Doninger, VP, Business Intelligence, Research & Development, SAS. It was clear that this was a panel that had done many things right from the get go and taken up the challenges of technology, culture, change management different ways. It was really nice to see them openly share their experiences and findings for the benefit of the hundreds of members in the audience listening with rapt attention.

And, then it hit me! We were looking at SAS customers on the panel. Going back to Heath’s assertion, I started wondering what could have been the peak moment for every one of these customers as they continued to evangelize the enterprise-wide adoption of SAS Grid Manager. Here is a synopsis of various peak moments shared by the customers in response to my question. Heath! Are you reading this?

Peak Moment One
SAS Grid Manager users are happy with the uptime and availability of Grid Manager. One customer called it a Layer of Happiness.

Peak Moment Two
Users who had initially issued a stern warning – “You ain’t taking my PC SAS away” – were happy to see 25 percent to 50 percent performance and throughput improvements with SAS Grid Manager.

Peak Moment Three
The loudest and most aggressive naysayer who was opposed to the migration to SAS Grid Manager became an advocate after experiencing tangible proof points and measurable outcomes.

Peak Moment Four
200 users were seamlessly migrated without any glitch. Worked like a charm! Nodes were decommissioned unbeknownst to the user.

Peak Moment Five
Taking a moment to reflect on the overall experience – what could have been and how smooth the overall experience was.

Peak Moment Six
Certify models when moving from previous platform to the SAS Grid but no code changes were needed to move the models.

Peak Moment Seven.
SAS Grid Manager being fully utilized at 80 percent to 90 percent capacity just like the mainframes. Maximizing the utilization is always a strong indicator of systemic adoption across the enterprise.

There you have it.  These were peak moments that the customers shared in response to my question.

But, here is what I was pleasantly surprised by! It took them seconds to come out with their respective peak moments. They did not have to reflect. They did not have to think hard. It came to them naturally. And that is what made it a peak moment for me! Just seeing these real-life customers share openly what worked for them and their users and why. These seven peak moments took me to seventh heaven!

So, Dan Heath, I could not meet your deadline of submitting my peak moment because I was waiting to experience one myself – albeit a tad bit late! Guess what could be another peak moment for me? You reading this blog and sending me a copy of your book!

What are your peak moments as a customer?  What are peak moments your customers have experienced?  Please share them here.  And we will wait for the Heath brothers to synthesize that into another wonderful book for the rest of us to read!

Peak moments define SAS Global Forum 2018 was published on SAS Users.

4月 142018
 

SAS Visual Forecasting 8.2 effectively models and forecasts time series in large scale. It is built on SAS Viya and powered by SAS Cloud Analytic Services (CAS).

In this blog post, I will build a Visual Forecasting (VF) Pipeline, which is a process flow diagram whose nodes represent tasks in the VF Process.  The objective is to show how to perform the full analytics life cycle with large volumes of data: from accessing data and assigning variable roles accurately, to building forecasting models, to select a champion model and overriding the system generated forecast. In this blog post I will use 1,337 time series related to a chemical company and will illustrate the main steps you would use for your own applications and datasets.

In future posts, I will work in the Programming application, a collection of SAS procedures and CAS actions for direct coding or access through tasks in SAS Studio, and will develop and assess VF models via Python code.

In a VF pipeline, teams can easily save forecast components to the Toolbox to later share in this collaborative environment.

Forecasting Node in Visual Analytics

This section briefly describes what is available in SAS Visual Analytics, the rest of the blog discusses SAS Visual Forecasting 8.2.

In SAS Visual Analytics the Forecast object will select the model that best fits the data out of these models: ARIMA, Damped trend exponential smoothing, Linear exponential smoothing, Seasonal exponential smoothing, Simple exponential smoothing, Winters method (additive), or Winters method (multiplicative). Currently there are no diagnostic statistics (MAPE, RMSE) for the model selected.

You can do “what-if analysis” using the Scenario Analysis and Goal Seeking functionalities. Scenario Analysis enables you to forecast hypothetical scenarios by specifying the future values for one or more underlying factors that contribute to the forecast. For example, if you forecast the profit of a company, and material cost is an underlying factor, then you might use scenario analysis to determine how the forecasted profit would change if the material cost increased by 10%. Goal Seeking enables you to specify a target value for your forecast measure, and then determine the values of underlying factors that would be required to achieve the target value. For example, if you forecast the profit of a company, and material cost is an underlying factor, then you might use Goal Seeking to determine what value for material cost would be required to achieve a 10% increase in profit.

Another neat feature in SAS Visual Analytics is that one can apply different filters to the final forecast. Filters are underlying factors or different levels of the hierarchy, and the resulting plot incorporates those filters.

Data Requirements for Forecasting

There are specific data requirements when working in a forecasting project, a time series dataset that contains at least two variables: 1) the variable that you want to forecast which is known as the target of your analysis, for example, Revenue and 2) a time ID variable that contains the time stamps of the target variable. The intervals of this variable are regularly spaced. Your time series table can contain other time-varying variables. When your time series table contains more than one individual series, you might have classification variables, as shown in the photo below. Distribution Center and Product Key are classification variables. Optionally, you can designate numerical variables (ex: Discount) as independent variables in your models.

You also have the option of adding a table of attributes to the time series table. Attributes are categorical variables that define qualities of the time series. Attribute variables are similar to BY variables, but are not used to identify the series that you want to forecast. In this post, the data I am using includes Distribution Center, Supplier Name, Product Type, Venue and Product Category. Notice that the attributes are time invariant, and that the attribute table is much smaller than the time series table.

The two data sets (SkinProduct and SkinProductAttributes) used in this blog contain 1,337 time series related to a chemical company. This picture shows a few rows of the two data sets used in this post, note that DATE intervals are regularly spaced weeks. The SkinProduct dataset is referred as the Time Series table in the SkinProductAttributes dataset as the Attribute Data.

Introduction SAS Visual Forecasting 8.2

Developing Models in SAS Visual Forecasting 8.2

Step One: Create a Forecasting Project and Assign Variables

From the SAS Home menu select the action Build Models that will take you to SAS Model Studio, where you select New Project, and enter 1) the Name of your project, 2) the Type of project, make sure you enter “Forecasting” and 3) the data source.

Introduction SAS Visual Forecasting 8.2

In the Data tab,  assign the variables roles by using the icon in the upper right corner.

The roles assigned in this post are typical of role assignments in forecasting projects. Notice these variables are in the Time Series table. Also, notice that the classification variables are ordered from highest to lowest hierarchy:

Time Variable: Date
Dependent Variable: Revenue
Classification Variables: Distribution Center and Product Key

In the Time Series table, you might have additional variables you’d like to assign the role “independent” that should be considered for model generation. Independent variables are the explanatory, input, predictor, or causal variables that can be used to model and forecast the dependent variable. In this post, the variable “Discount” is assigned the role “independent”. To do this assignment: right click on the variable, and select Edit Variables.

To bring in the second dataset with the Attribute variables, follow the steps in this photo:

 

Step two: Automated Modeling with Pipelines

The objective in this step is to select a champion model.  Working in the Pipelines tab, one explores the time series plots, uses the code editor to modify the default model, adds a 2nd model to the pipeline, compares the models and selects a champion model.

After step one is completed, select the Pipelines Tab

The first node in the VF pipelines is the Data node. After right-clicking and running this node, one can see the time series by selecting Explore Time Series. Notice that one can filter by the attribute variables, and that the table shows the exact historical data values.

Auto-forecasting is the next node in the default pipeline. Remember that we are modeling 1,332 time series. For each time series, the Auto-forecasting node automatically diagnoses the statistical characteristics of the time series, generates a list of appropriate time series models, automatically selects the model, and generates forecasts. This node evaluates and selects for each time series the ARIMAX and exponential smoothing models.

One can customize and modify the forecasting models (except the Hierarchical Forecasting model) by editing the model’s code. For example, to add the class of models for intermittent demand to the auto- forecasting node, one could open the code editor for that node and replace these lines
rc = diagspec.setESM();
rc = diagspec.setARIMAX();
with:
rc = diagspec.setIDM();

To open the code editor see photo below. After changes, save code and close the editor.

At this point, you can run the Auto Forecasting node, and after looking at its results, save it to the toolbox, so the editing changes are saved and later reused or shared with team members.

By expanding the Nodes pane and the Forecasting Modeling pane on the left, you can select from several models and add a 2nd modeling node to the pipeline

The next photo shows a pipeline with the Naïve Forecast as the second model. It was added to the pipeline by dropping its node into the parent node (data node). This is the resulting pipeline:

After running the Model Comparison node, compare the WMAE (Weighted Mean Absolute Error) and WMAPE (Weighted Mean Absolute Percent Error) and select a champion model.

You can build several pipelines using different model strategies. In order to select a champion model from all the models developed in the pipelines one uses the Pipeline Comparison tab.

Before you work on any overrides for your forecasting project, you need to make sure that you are working with the best pipeline and modeling node for your data. SAS Visual Forecasting selects the best fit model in each pipeline. After each pipeline is run, the champion pipeline is selected based on the statistics of fit that you chose for the selection criteria. If necessary, you can change the selected champion pipeline.

Step Three: Overrides

The Overrides tab is used to manually adjust the forecasts in the future. For example, if you want to account for some promotions that your company and its competitors are running and that are not captured by the models.

The Overrides module allows users to select subsets of time series at the aggregate level by selecting attribute values in the attribute table that you defined in the data tab. The filters based on the attributes are highly customizable and do not restrict you to use the hierarchy that was used for the modeling. The section of a filter using attribute is often referred to as faceted search. Whenever you create a new filter based on a selection of values of the attributes (also known as facets), the aggregate for all series that match the facets will be displayed on your main panel.

There is a wealth of information in the Overrides overview: 1) a list of the BY variables, as well as attribute variables, available to use as filters.

2) a Plot of the Time Series Aggregation and Overrides displaying historical and forecast data, and

3) a Forecast and Overrides table which can be used to create, edit and submit, override values for a time series based on external factors that are not included in the forecast models

Conclusion

Using SAS Visual Forecasting 8.2 you can effectively model and forecast time series in large scale. The Visual Forecasting Pipeline greatly facilitates the automatic forecasting of large volumes of data, and provides a structured and robust method with efficient and flexible processes.

References

An introduction to SAS Visual Forecasting 8.2 was published on SAS Users.

4月 132018
 

The release of SAS Viya 3.3 has brought some nice data quality features. In addition to the visual applications like Data Studio or Data Explorer that are part of the Data Preparation offering, one can leverage data quality capabilities from a programming perspective.

For the time being, SAS Viya provides two ways to programmatically perform data quality processing on CAS data:

  • The Data Step Data Quality functions.
  • The profile CAS action.

To use Data Quality programming capabilities in CAS, a Data Quality license is required (or a Data Preparation license which includes Data Quality).

Data Step Data Quality functions

The list of the Data Quality functions currently supported in CAS are listed here and below:

SAS Data Quality 3.3 programming capabilities

They cover casing, parsing, field extraction, gender analysis, identification analysis, match codes and standardize capabilities.

As for now, they are only available in the CAS Data Step. You can’t use them in DS2 or in FedSQL.

To run in CAS certain conditions must be met. These include:

  • Both the input and output data must be CAS tables.
  • All language elements must be supported in the CAS Data Step.
  • Others.

Let’s look at an example:

cas mysession sessopts=(caslib="casuser") ;
 
libname casuser cas caslib="casuser" ;
 
data casuser.baseball2 ;
   length gender $1 mcName parsedValue tokenNames lastName firstName varchar(100) ;
   set casuser.baseball ;
   gender=dqGender(name,'NAME','ENUSA') ;
   mcName=dqMatch(name,'NAME',95,'ENUSA') ;   
   parsedValue=dqParse(name,'NAME','ENUSA') ;
   tokenNames=dqParseInfoGet('NAME','ENUSA') ;
   if _n_=1 then put tokenNames= ;
   lastName=dqParseTokenGet(parsedValue,'Family Name','NAME','ENUSA') ;
   firstName=dqParseTokenGet(parsedValue,'Given Name','NAME','ENUSA') ;
run ;

Here, my input and output tables are CAS tables, and I’m using CAS-enabled statements and functions. So, this will run in CAS, in multiple threads, in massively parallel mode across all my CAS workers on in-memory data. You can confirm this by looking for the following message in the log:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.

I’m doing simple data quality processing here:

  • Determine the gender of an individual based on his(her) name, with the dqGender function.
  • Create a match code for the name for a later deduplication, with the dqMatch function.
  • Parse the name using the dqParse function.
  • Identify the name of the tokens produced by the parsing function, with the dqParseInfoGet function.
  • Write the token names in the log, the tokens for this definition are:
    Prefix,Given Name,Middle Name,Family Name,Suffix,Title/Additional Info
  • Extract the “Family Name” token from the parsed value, using dqParseTokenGet.
  • Extract the “Given Name” token from the parsed value, again using dqParseTokenGet.

I get the following table as a result:

Performing this kind of data quality processing on huge tables in memory and in parallel is simply awesome!

The dataDiscovery.profile CAS action

This CAS action enables you to profile a CAS table:

  • It offers 2 algorithms, one is faster but uses more memory.
  • It offers multiple options to control your profiling job:
    • Columns to be profiled.
    • Number of distinct values to be profiled (high-cardinality columns).
    • Number of distinct values/outliers to report.
  • It provides identity analysis using RegEx expressions.
  • It outputs the results to another CAS table.

The resulting table is a transposed table of all the metrics for all the columns. This table requires some post-processing to be analyzed properly.

Example:

proc cas; 
   dataDiscovery.profile /
      algorithm="PRIMARY"
      table={caslib="casuser" name="product_dim"}
      columns={"ProductBrand","ProductLine","Product","ProductDescription","ProductQuality"}
      cutoff=20
      frequencies=10
      outliers=5
      casOut={caslib="casuser" name="product_dim_profiled" replace=true}
   ;
quit ;

In this example, you can see:

  • How to specify the profiling algorithm (quite simple: PRIMARY=best performance, SECONDARY=less memory).
  • How to specify the input table and the columns you want to profile.
  • How to reduce the number of distinct values to process using the cutoff option (it prevents excessive memory use for high-cardinality columns, but might show incomplete results).
  • How to reduce the number of distinct values reported using the frequencies option.
  • How to specify where to store the results (casout).

So, the result is not a report but a table.

The RowId column needs to be matched with

A few comments/cautions on this results table:

  • DoubleValue, DecSextValue, or IntegerValue fields can appear on the output table if numeric fields have been profiled.
  • DecSextValue can contain the mean (metric #1008), median (#1009), standard deviation (#1022) and standard error (#1023) if a numeric column was profiled.
  • It can also contain frequency distributions, maximum, minimum, and mode if the source column is of DecSext data type which is not possible yet.
  • DecSext is a 192-bit fixed-decimal data type that is not supported yet in CAS, and consequently is converted into a double most of the time. Also, SAS Studio cannot render correctly new CAS data types. As of today, those metrics might not be very reliable.
  • Also, some percentage calculations might be rounded due to the use of integers in the Count field.
  • The legend for metric 1001 is not documented. Here it is:

1: CHAR
2: VARCHAR
3: DATE
4: DATETIME
5: DECQUAD
6: DECSEXT
7: DOUBLE
8: INT32
9: INT64
10: TIME

A last word on the profile CAS action. It can help you to perform some identity analysis using patterns defined as RegEx expressions (this does not use the QKB).

Here is an example:

proc cas; 
   dataDiscovery.profile /
      table={caslib="casuser" name="customers"}
      identities={
         {pattern="PAT=</span>?999[<span style=" />-]? ?999[- ]9999",type="USPHONE"}, 
         {pattern= "PAT=^99999[- ]9999$",type="ZIP4"}, 
         {pattern= "PAT=^99999$",type="ZIP"}, 
         {pattern= "[^ @]+@[^ @]+\.[A-Z]{2,4}",type="EMAIL"}, 
         {pattern= "^(?i:A[LKZR]|C[AOT]|DE|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[AT]|W[AIVY])$",type="STATE"}
      }
      casOut={caslib="casuser" name="customers_profiled" replace="true"}
   ;
quit ;

In this example that comes from

I hope this post has been helpful.

Thanks for reading.

An overview of SAS Data Quality 3.3 programming capabilities was published on SAS Users.

4月 132018
 

As a follow on from my previous blog post, where we looked at the different use cases for using Kerberos in SAS Viya 3.3, in this post will delve into more details on the requirements for use case 4, where we use Kerberos authentication through-out both the SAS 9.4 and SAS Viya 3.3 environments. We won’t cover the configuration of this setup as that is a topic too broad for a single blog post.

As a reminder the use case we are considering is shown here:

SAS Viya 3.3 Kerberos Delegation

Here the SAS 9.4 Workspace Server is launched with Kerberos credentials, the Service Principal for the SAS 9.4 Object Spawner will need to be trusted for delegation. This means that a Kerberos credential for the end-user is available to the SAS 9.4 Workspace Server. The SAS 9.4 Workspace Server can use this end-user Kerberos credential to request a Service Ticket for the connection to SAS Cloud Analytic Services. While SAS Cloud Analytic Services is provided with a Kerberos keytab and principal it can use to validate this Service Ticket. Validating the Service Ticket authenticates the SAS 9.4 end-user to SAS Cloud Analytic Services. The principal for SAS Cloud Analytic Services must also be trusted for delegation. We need the SAS Cloud Analytic Services session to have access to the Kerberos credentials of the SAS 9.4 end-user.

These Kerberos credentials made available to the SAS Cloud Analytic Services are used for two purposes. First, they are used to make a Kerberized connection to the SAS Viya Logon Manager, this is to obtain the SAS Viya internal OAuth token. As a result, the SAS Viya Logon Manager must be configured to accept Kerberos connections. Secondly, the Kerberos credentials of the SAS 9.4 end-user are used to connect to the Secure Hadoop environment.

In this case, since all the various principals are trusted for delegation, our SAS 9.4 end-user can perform multiple authentication hops using Kerberos with each component. This means that through the use of Kerberos authentication the SAS 9.4 end-user is authenticated into SAS Cloud Analytic Services and out to the Secure Hadoop environment.

Reasons for doing it…

To start with, why would we look to use this use case? From all the use cases we considered in the previous blog post this provides the strongest authentication between SAS 9.4 Maintenance 5 and SAS Viya 3.3. At no point do we have a username/password combination passing between the SAS 9.4 environment and the SAS Viya 3.3. In fact, the only credential set (username/password) sent over the network in the whole environment is the credential set used by the Identities microservice to fetch user and group information for SAS Viya 3.3. Something we could also eliminate if the LDAP provider supported anonymous binds for fetching user details.

Also, this use case provides true single sign-on from SAS 9.4 Maintenance 5 to SAS Viya 3.3 and all the way out to the Secured Hadoop environment. Each operating system run-time process will be launched as the end-user and no cached or stored username/password combination is required.

High-Level Requirements

At a high-level, we need to have both sides configured for Kerberos delegated authentication. This means both the SAS 9.4 Maintenance 5 and the SAS Viya 3.3 environments must be configured for Kerberos authentication.

The following SAS components and tiers need to be configured:

  • SAS 9.4 Middle-Tier
  • SAS 9.4 Metadata Tier
  • SAS 9.4 Compute Tier
  • SAS Viya 3.3 SAS Logon Manager
  • SAS Viya 3.3 SAS Cloud Analytic Services

Detailed Requirements

First let’s talk about Service Principal Names. We need to have a Service Principal Name (SPN) registered for each of the components/tiers in our list above. Specifically, we need a SPN registered for:

  • HTTP/<HOSTNAME> for the SAS 9.4 Middle-Tier
  • SAS/<HOSTNAME> for the SAS 9.4 Metadata Tier
  • SAS/<HOSTNAME> for the SAS 9.4 Compute Tier
  • HTTP/<HOSTNAME> for the SAS Viya 3.3 SAS Logon Manager
  • sascas/<HOSTNAME> for the SAS Viya 3.3 SAS Cloud Analytic Services

Where the <HOSTNAME> part should be the fully qualified hostname of the machines where the component is running. This means that some of these might be combined, for example if the SAS 9.4 Metadata Tier and Compute Tier are running on the same host we will only have one SPN for both. Conversely, we might require more SPNs, if for example, we are running a SAS 9.4 Metadata Cluster.

The SPN needs to be registered against something. Since our aim is to support single sign-on from the end-user’s desktop we’ll probably be registering the SPNs in Active Directory. In Active Directory we can register against either a user or computer object. For both the SAS 9.4 Metadata and Compute Tier the registration can be performed automatically if the processes run as the local system account on a Microsoft Windows host and will be against the computer object. Otherwise, and for the other tiers and components, the SPN must be registered manually. We recommend, that while you can register multiple SPNs against a single object, that you register each SPN against a separate object.

Since the entire aim of this configuration is to delegate the Kerberos authentication from one tier/component onto the next we need to ensure the objects, namely users or computer objects, are trusted for delegation. The SAS 9.4 Middle-Tier will only support un-constrained delegation, whereas the other tiers and components support Microsoft’s constrained delegation. If you choose to go down the path of constrained delegation you need to specify each and every Kerberos service the object is trusted to delegate authentication to.

Finally, we need to provide a Kerberos keytab for the majority of the tiers/components. The Kerberos keytab will contain the long-term keys for the object the SPN is registered against. The only exceptions being the SAS 9.4 Metadata and Compute Tiers if these are running on Windows hosts.

Conclusion

You can now enable Kerberos delegation across the SAS Platform, using a single strong authentication mechanism across that single platform. As always with configuring Kerberos authentication the prerequisites, in terms of Service Principal Names, service accounts, delegation settings, and keytabs are important for success.

SAS Viya 3.3 Kerberos Delegation from SAS 9.4M5 was published on SAS Users.

4月 062018
 

SAS Global Forum 2018 is just days away, and if you're one of the 6,000 SAS users heading to Denver for the event, get ready for four days of learning from SAS peers, exchanging ideas, discovering new techniques for using SAS, and a bit of fun as well.

But what if you can't make the trip to Denver this year? Is there another way to experience some of the great content that will be shared there? I'm happy to say the answer is yes!

SAS Global Forum 2018 Virtual Event

SAS Global Forum 2018 Virtual EventOnce again, SAS will provide dozens of hours of live video streaming from the event. Register for an All-Access Pass and watch select sessions including Opening Session, keynote talks, select breakouts, Tech Talks, updates from The Quad, interviews with SAS executives and developers, and more. You can find livestream presentations and add them to your calendar here. Coverage will be available for on-demand viewing on SAS Global Forum Video Portal after the conference as well.

Get involved via Social Media

Video not your thing? No worries. SAS will provide several other ways to stay up to date. For starters, you can read any of a number of blog posts from the event. Posts will come from many different SAS blogs, but all posts from SAS Global Forum will be aggregated here.

If you're on LinkedIn, Twitter or Facebook, you can stay connected with what's happening and engage with attendees on SAS’ social media channels. Join the conversation, comment on some of the cool presentations you attended or viewed, discuss the exciting news coming out of the event, or simply follow along. The channels sure to have the most activity are the SAS Users LinkedIn group, the SAS Twitter account, the SAS Users Group Facebook page; and Instagram. The hashtag for SAS Global Forum is #SASGF; be sure to use the hashtag in all your posts.

SAS Global Forum 2018 Proceedings

If you want a deeper dive into a particular talk, or if you're interested in presentations that may not be part of the Live Stream, the proceedings for SAS Global Forum 2018 are already available. Here you'll find abstracts, PDF of papers, e-posters, slides, data files, and more.

With all the opportunities to follow along, connect and contribute, you can be a part of SAS Global Forum 2018, whether you're attending in person or not.

Enjoy the event...whether you're going or not!

Participate in SAS Global Forum 2018...even if you're not going was published on SAS Users.