Tech

12月 222017
 

Another report requirement came my way and I wanted to share how to use our Visual Analytics’ out-of-the-box relative period calculations to solve it.

Essentially, we had a customer who wanted to see a metric for every month, the previous month’s value next to it, and lastly the difference between the two.

Relative Period Report in SAS Visual Analytics

To do this in SAS Visual Analytics, which is available in versions 7.3 and above, use the relative periodic operators. I am going to use the Mega_Corp data which has a date data item called Date by Month using the format: MMMYYYY. SAS Visual Analytics supports relative period calculations for month, quarter and year.
The first two columns, circled in red, are straight from the data. The metric we are interested in for this report is Profit.

Next, we will create the last column, Profit (Difference from Previous Period), which is an aggregated measure that uses the periodic operators.

From the Data pane, select the metric used in the list table, Profit. Then right-click on Profit and navigate the menus: Create / Difference from Previous Period / Using: Date by Month.

A new aggregated measure will be created for you:

If you right-click on the aggregated measure and select Edit Aggregated Measure…, you will see this relative period calculation, where it is taking the current period (notice the 0) minus the value for the previous period (notice the -1).

Okay – that’s it. This out-of-the-box relative period calculation is ready to be added to the list table. Notice the other Period Operators available in the list. These support SAS Visual Analytics’ additional out-of-the-box aggregated measure calculations such as the Difference between Parallel Periods, Year to Date cumulative calculations, etc.

Now we have to create the final column to meet our report requirement: the Previous Period column.

To do this we are going to leverage the out-of-the-box functionality of the relative period calculation. Since this aggregated measure calculates the previous period for the subtraction – let’s use this to our advantage.

Duplicate the out-of-the-box relative period calculation by right-clicking on Profit (Difference from Previous Period) and select Duplicate Data Item.

Then right-click on the new data item, and select Edit Aggregated Measure….

Now delete everything highlighted in yellow below, remember to also delete the minus sign. And give the data item a new name. Click OK. This will create an aggregated measure that will calculate the previous period.

The final result should look like this from either the Visual tab or Text tab:

Now we have all the columns to meet our report requirement:

Now that I’ve piqued your interest, I’m sure you are wondering if you could use this technique to create aggregated data items to represent the Period -1, -2, -3 offset? YES! This is absolutely possible.
Also, I went ahead and plotted the Difference from Previous Period on a line chart. This is an extremely useful visualization to gage if the variance between periods is acceptable. You can easily assign display rules to this visualization to flag any periods that may need further investigation.

Relative Period Report in SAS Visual Analytics was published on SAS Users.

12月 212017
 

With SAS Data Management, you can set up SAS Data Remediation to manage and correct data issues. SAS Data Remediation allows user- or role-based access to data exceptions.

When a data issue is discovered it can be sent automatically or manually to a remediation queue where it can be corrected by designated users. The issue can be fixed from within SAS Remediation without the need of going to the affected source system. For more efficiency, the remediation process can also be linked to a purpose designed workflow.

It involves a few steps to set up a remediation process that allows you to correct data issues from within SAS Remediation:

  • Set up Data Management job to retrieve data and correct data in remediation.
  • Set up a Workflow to control the remediation process.
  • Register the remediation service.

Set up Data Management job to retrieve and correct data in remediation

To correct data issues from within Data Remediation we need two real-time data management jobs to retrieve and send data. The retrieve job will read the record in question to populate its data in the remediation UI and a send job to write the corrected data back to the data source or a staging area first.

Retrieve and sent job

If the following remediation fields are available in the retrieve or send job’s External Data Provider node, data will be passed to the fields. The field values can be used to identify and work with the correct record:

REM_KEY (internal field to store issues record id)
REM_USERNAME (the current remediation user)
1.  REM_ITEM_NAME
2.  REM_ISSUE_NAME
3.  REM_APPLICATION
4.  REM_SUBJECT_AREA
5.  REM_PACKAGE_NAME
6.  REM_STATUS
7.  REM_ASSIGNEE

Retrieve Action

The "retrieve” action occurs when the issue is opened in SAS Data Remediation. Data Remediation will only pass REM_ values to the data management job if the fields are present in the External Data Provider node. Although the REM_ values are the only way the data management job can communicate with SAS Data Remediation but they are not all required, meaning you can just call the fields in the External Data Provider node you need.

The job’s output fields will be displayed in the Remediation UI as edit fields to correct the issue record. it's best to use a Field Layout node as the last job node to pass out the wanted fields with desired labels.

Note: The retrieve job should only return one record.

A simple example of a retrieve job would be to have the issue record id coming from REM_KEY into the data management job to select the record from the source system.

Send Action

The “send” action occurs when pressing the “Commit Changes” button in the Data Remediation UI. All REM_ values in addition to the output fields of the retrieve job (the issue record edit fields) are passed to the send job. The job will receive values for those fields present in the External Data Provider node.

The send job can now work with the remediation record and save it to a staging area or submit it to the data source directly.

Note: Only one row will be sent as an input to the send job. Any data values returned by the send job will be ignored by Data Remediation.

Move jobs the Data Management Server

When both jobs are written, and tested you need to move them to Data Management Server into a Real-Time Data Services sub-directory for Data Remediation to call them.

When Data Remediation is calling the jobs, it will use the user credentials for the person logged on to Data Remediation. Therefore, you need to make sure that the jobs on Data Management Server have been granted the right permission.

Set up a Workflow to control the remediation process

Although you don’t need to involve a workflow in SAS Data Remediation but to improve efficiency it might be a good using one.

You can design your own workflow using SAS Workflow Studio or you can use a prepared workflow already coming with Data Remediation. You need to make sure that the desired workflow is loaded on to Workflow Server to link it to the Data Remediation Service.

Using SAS Workflow will help you to better control Data Remediation issues.

Register the remediation service

We can now register our remediation service in SAS Data Remediation. Therefore, we go to Data Remediation Administrator “Add New Client Application.”

Under Properties we supply an ID, which can be the name of the remediation service as long as it is unique, and a Display name, which is the name showing in the Remediation UI.

Next we set up the edit UI for the issue record. Under Issue User Interface we go to: User default remediation UI…. Using Data Management Server:

The Server address is the fully qualified address for Data Management Server including the port it is listening on. For example: http://dmserver.company.com:21036.

The Real-time service to retrieve item attributes and Real-time service to send item attributes needs to point to the retrieve/send job respectively on Data Management Server, including the job suffix .ddf as well as any directories under Real-Time Data Services where the jobs are stored.

 

 

 

 

 

 

Under the tab Subject Area, we can register different subject categories for this remediation service.  When calling the remediation service we can categorize different remediation issues by setting different subject areas.

Under the tab Issues Types, we can register issue categories. This enables us to categorize the different remediation issues.

At Task Templates/Select Templates you can set the workflow to be used for a particular issue type.

By saving the remediation service you will be able to use it. You can now assign data issues to the remediation service to efficiently correct the data and improve your data quality from within SAS Data Remediation.

Manage remediation issues using SAS Data Management was published on SAS Users.

12月 202017
 

Running SAS programs in batchWhile SAS program development is usually done in an interactive SAS environment (SAS Enterprise Guide, SAS Display Manager, SAS Studio, etc.), when it comes to running SAS programs in a production or operations environment, it is routinely done in batch mode.

Why run SAS programs in batch mode?

First and foremost, this is done for automation, as the batch process does not require human participation at the time of run. It can be scheduled to run (using Operating System scheduler or other scheduling software) while we sleep, at any time of the day or at any time interval between two consecutive runs.

Running SAS programs in batch mode allows streamlining SAS processing by eliminating the possibility of human error, submitting multiple SAS jobs (programs) all at once or in a sequence securing programs and/or data dependencies.

SAS batch processing also takes care of self-documenting, as it automatically generates and stores SAS logs and outputs.

Imagine the following scenario. Every night, a SAS batch process “wakes up” at 3 a.m. and runs an ETL process on a SAS Application server that extracts multiple tables from a database, transforms, combines, and loads them into a SAS datamart; then moves some data tables across the network and loads them into SAS LASR server, so when you are back to work in the morning your SAS Visual Analytics application has all its data refreshed and ready to roll. Of course, the process schedule can be custom-tailored to your particular needs; your batch jobs may run every 15 minutes, once a week, every first Friday of the month – you name it.

What is a batch script file?

To submit a single SAS program in batch mode manually, you could submit an OS command that looks something like the following:

Unix/Linux

sas /sas/code/proj1/job1.sas -log /sas/code/proj1/job1.log

DOS/Windows

"C:\Program Files\SASHome\SASFoundation\9.4\Sas.exe" -SYSIN c:\proj1\job1.sas -NOSPLASH -ICON -LOG c:\proj1\job1.log

However, submitting an OS command manually has too many drawbacks: it’s too much typing, it only submits one SAS program at a time, and most importantly – it is manual, which means it is prone to human error.

Usually, these OS commands are packaged into so called batch files (shell scripts in Unix) that allow for sequential, parallel, as well as conditional execution of multiple OS line commands. They can be run either manually, or automatically – on schedule, or called by other batch scripts.

In a Windows/DOS Operating System, these script files are called batch files and have .bat filename extensions. In Unix-like operating systems, such as Linux, these script files are called shell scripts and have .sh filename extensions.

Since Windows batch files are similar, but slightly different from the Unix (and its open source cousin Linux) shell scripts, in the below examples we are going to use Unix/Linux shell scripts only, in order to avoid any confusion. And we are going to use terms Unix and Linux interchangeably.

Here is the typical content of a Linux shell script file to run a single SAS program:

#!/usr/bin/sh
dtstamp=$(date +%Y.%m.%d_%H.%M.%S)
pgmname="/sas/code/project1/program1.sas"
logname="/sas/code/project1/program1_$dtstamp.log"
/sas/SASHome/SASFoundation/9.4/sas $pgmname -log $logname

Note, that the shell script syntax allows for some basic programming features like current datetime function, formatting, and variables. It also provides some conditional processing similar to “if-then-else” logic. For detailed information on the shell scripting language you may refer to the following BASH shell script tutorial or any other source of many dialects or flavors of the shell scripting (C Shell, Korn Shell, etc.)

Let’s save the above shell script as the following file:
/sas/code/project1/program1.sh

How to submit a SAS program via Unix script

In order to run this shell script we would submit the following Linux command:
/sas/code/project1/program1.sh

Or, if we navigate to the directory first:
cd /sas/code/project1

then we can submit an abbreviated Linux command
./program1.sh
When run, this shell script not only executes a SAS program (program1.sas), but for every run it also creates and saves a uniquely named SAS Log file. You may create the SAS log file in the same directory where the SAS code is stored, as specified in the script shell above, or specify another directory of your choice.

For example, it creates the following SAS log file:
/sas/code/project1/program1_2017.12.06_09.15.20.log

The file name uniqueness is achieved by adding a date/time stamp suffix between the SAS program name and .log file name extension, in this particular case indicating that this SAS log file was created on December 6, 2017, at 09:15:20 (hours:minutes:seconds).

Unix script for submitting multiple SAS programs

Unix scripts may contain not only OS commands, but also other Unix script calls. You can mix-and-match OS commands and other script calls.

When scripts are created for each individual SAS program that you intend to run in a batch, you can easily combine them into a program flow by creating a flow script containing those single program scripts. For example, let’s create a script file /sas/code/project1/flow1.sh with the following contents:

/sas/code/project1/program1.sh
/sas/code/project1/program2.sh
/sas/code/project1/program3.sh

When submitted as

/sas/code/project1/flow1.sh

it will sequentially execute three scripts - program1.sh, program2.sh, and program3.sh, each of which will execute the corresponding SAS program - program1.sas, program2.sas, and program3.sas, and produce three SAS logs - program1.log, program2.log, and program3.log.

Unix script file permissions

In order to be executable, UNIX script files must have certain permissions. If you create the script file and want to execute it yourself only, the file permissions can be as follows:

-rwxr-----, or 740 in octal representation.

This means that you (the Owner of the script file) have Read (r), Write (w) and Execute (x) permission as indicated by the green highlighting; Group owning the script file has only Read (r) permission as indicated by yellow highlighting;  Others have no permissions to the script file at all as indicated by red highlighting.

If you want to give yourself (Owner) and Group execution permissions then your script file permissions can be as:

-rwxr-x---, or 750 in octal representation.

In this case, your group has Read (r) and Execute (x) permissions as highlighted in yellow.

In Unix, file permissions are assigned using the chmod Unix command.

Note, that in both examples above we do not give Others any permissions at all. Remember that file permissions are a security feature, and you should assign them at the minimum level necessary.

Conditional execution of scripts and SAS programs

Here is an example of a Unix script file that allows running multiple SAS programs and OS commands at different times.

#!/bin/sh

#1 extract data from a database
/sas/code/etl/etl.sh

>#2 copy data to the Visual Analytics autoload directory
scp -B userid@sasAPPservername:/sas/data/*.sas7bdat userid@sasVAservername:/sas/config/.../AutoLoad

#3 run weekly, every Monday
dow=$(date +%w)
if [ $dow -eq 1 ]
then
   /sas/code/alerts_generation.sh
fi

#4 run monthly, first Friday of every month
dom=$(date +%d)
if [ $dow -eq 5 -a $dom -le 7 ]
then
   /sas/code/update_history.sh
   /sas/code/update_transactions.sh
fi

In this script, the following logical operators are used: -eq (equal), -le (less or equal), -a (logical and).

As you can see, the script logic takes care of branching to execute different SAS programs when certain timing conditions are met. With such an approach, you would need to schedule only this single script to run at a specified time/interval, say daily at 3 a.m.

In this case, the script will “wake up” every morning at 3 a.m. and execute its component scripts either unconditionally, or conditionally.

If one of the included programs needs to run at a different, lesser frequency (e.g. every Monday, or monthly on first Friday of every month) the script logic will trigger those executions at the appropriate times.

In the above script example steps #1 and #2 will execute every time (unconditionally) the script runs (daily). Step #1 runs ETL program to extract data from a database, step #2 copies the extracted data across the network from SAS Application server to the SAS LASR Analytic server’s drop zone from where they are automatically loaded (autoloaded) into the LASR.

Step #3 will run conditionally every Monday ( $dow -eq 1). Step #4 will run conditionally every first Friday of a month ($dow -eq 5 -a $dom -le 7).

For more information on how to format date for use in shell scripts please refer to this post.

Do you run your SAS programs in batch?

Please share your batch experiences in the comment section below. I am sure the rest of us will really appreciate it!

Running SAS programs in batch under Unix/Linux was published on SAS Users.

12月 192017
 

In my last article, Managing SAS Configuration Directory Security, we stepped through the process for granting specific users more access without opening up access to everyone. One example addressed how to modify security for autoload. There are several other aspects of SAS Visual Analytics that can benefit from a similar security model.

You can maintain a secure environment while still providing one or more select users the ability to:

  • start and stop a SAS LASR Analytic Server.
  • load data to a SAS LASR Analytic Server.
  • import data to a SAS LASR Analytic Server.

Requirements for these types of users fall into two areas: metadata and operating system.

The metadata requirements are very well documented and include:

  • an individual metadata identity.
  • membership in appropriate groups (for example: Visual Analytics Data Administrators for SAS Visual Analytics suite level administration; Visual Data Builder Administrators for data preparation tasks; SAS Administrators for platform level administration).
  • access to certain metadata (refer to the SAS Visual Analytics 7.3: Administration Guide for metadata permission requirements).

Operating System Requirements

Users who need to import data, load data, or start a SAS LASR Analytic Server need the ability to authenticate to the SAS LASR Analytic Server host and write access to some specific locations.

If the SAS LASR Analytic Server is distributed users need:

If the compute tier (the machine where the SAS Workspace Server runs) is on Windows, users need the Log on as a batch job user right on the compute machine.

In addition, users need write access to the signature files directory, the path for the last action logs for the SAS LASR Analytic Server, and the PIDs directory in the monitoring path for the SAS LASR Analytic Server.

Signature Files

There are two types of signature files: server signature files and table signature files. Server signature files are created when a SAS LASR Analytic Server is started. Table signature files are created when a table is loaded into memory. The location of the signature files for a specific SAS LASR Analytic Server can be found on the Advanced properties of the SAS LASR Analytic Server in SAS Management Console.

SAS Configuration Directory Security for SAS Visual Analytics

On Linux, if your signature files are in /tmp you may want to consider relocating them to a different location.

Last Action Logs and the Monitoring Path

In the SAS Visual Analytics Administrator application, logs of interactive actions for a SAS LASR Analytic Server are written to the designated last action log path. The standard location is on the middle tier host in <SAS_CONFIG_ROOT>/Lev1/Applications/SASVisualAnalytics/VisualAnalyticsAdministrator/Monitoring/Logs. The va.lastActionLogPath property is specified in the SAS Visual Analytics suite level properties. You can access the SAS Visual Analytics suite level properties in SAS Management Console under the Configuration Manager: expand SAS Applicaiton Infrastructure, right-click on Visual Analytics 7.3 to open the properties and select the Advanced tab.

The va.monitoringPath property specifies the location of certain monitoring process ID files and logs. The standard location is on the compute tier in <SAS_CONFIG_ROOT>/Lev1/Applications/SASVisualAnalytics/VisualAnalyticsAdministrator/Monitoring/. This location includes two subdirectories: Logs and PIDs. You can override the default monitoring path by adding the va.monitoringPath extended attribute to the SAS LASR Analytic Server properties.

Host Account and Group

For activities like starting the SAS LASR Analytic Server you might want to use a dedicated account such as lasradm or assign the access to existing users. If you opt to create the lasradm account, you will need to also create the related metadata identity.

For group level security on Linux, it is recommended that you create a new group, for example sasusers, to reserve the broader access provided by the sas group to only platform level administrators. Be sure to include in the membership of this sasusers group any users who need to start the SAS LASR Analytic Server or that need to load or import data to the SAS LASR Analytic Server.

Since the last action log path, the monitoring path, and the autoload scripts location all fall under <SAS_CONFIG_ROOT>/Lev1/Applications/SASVisualAnalytics/VisualAnalyticsAdministrator, you can modify the ownership of this folder to get the right access pattern.

A similar pattern can also be applied to the back-end store location for the data provider library that supports reload-on-start.

Don’t forget to change the ownership of your signature files location too!

SAS Admin Notebook: Managing SAS Configuration Directory Security for SAS Visual Analytics was published on SAS Users.

12月 192017
 

Compressing a data setCompressing a data set is a process that reduces the number of bytes that are required to represent each observation in a file. You might choose to enable compression to reduce the storage requirements for the file and to lessen the number of I/O operations that are needed to read from or write to the data during processing.

Compression is enabled by the COMPRESS= system option, the COMPRESS= option in the LIBNAME statement, and the COMPRESS= data set option. The COMPRESS= system option compresses all data set sets that are created during a SAS session, and the COMPRESS= option in the LIBNAME statement compresses all data sets for a particular SAS® library. The COMPRESS= data set option is the most popular of these methods because you compress data sets individually as they are created.

The COMPRESS= data set option can be set to CHAR (or YES), NO, and BINARY. The following example illustrates using COMPRESS=YES:

data new(compress=yes);
set old;
run;

 

While compression is a useful tool in your programming toolbox, it isn't a tool that you should use on every data set. When you request compression by using the COMPRESS= option, SAS considers the following information:

  • The header information of the data set to determine how many variables are in the program data vector
  • whether the variables are character or numeric
  • the lengths of the variables

SAS doesn't consider data values at all. The compression overhead for Microsoft 32-bit Windows and 64-bit Windows is 12 bytes, whereas 64-bit UNIX hosts require 24 bytes of overhead. When SAS determines that it is possible to recoup the 12 or 24 bytes of overhead per observation that compression requires, then SAS attempts to compress the data. If that 12 or 24 bytes per observation can't be recouped, the data set size is increased when the compression is completed. So, you should determine ahead of time whether your data set is a good candidate for compression.

In the following example, a data set is created in the Windows operating environment with two variables having lengths, respectively, of 3 and 5 bytes. Because it is impossible to recoup the 12 bytes that are needed per observation for compression overhead, SAS automatically disables compression and a note is written to the SAS log that indicates the same.

571  data new(compress=char);
572     x='abc';
573     y='defgh';
574  run;
 
NOTE: Compression was disabled for data set WORK.NEW because compression overhead would increase
      the size of the data set.
NOTE: The data set WORK.NEW has 1 observations and 2 variables.

 

The compression process doesn’t recognize individual variables within an observation. Instead, the process sees each observation as a large collection of bytes that are run together end to end. In the COMPRESS= data set option, you enable compression by specifying either CHAR (YES) and BINARY. These values for the option differ slightly in the types of data values that they target for compression.

Using the COMPRESS=CHAR|YES option

Specifying COMPRESS=CHAR (or YES) targets data with repeating single characters and variables with stored lengths that are longer than most of the values. As a result, blank spaces pad the end of values that are shorter than the number of bytes of storage.

In thinking about conserving space, customers often shorten the storage lengths of variables by using a LENGTH statement. When you shorten the lengths of your variables, you remove the best opportunity for SAS to compress. For example, if a numeric variable can be stored accurately in 4 bytes, the remaining 4 bytes (in an 8-byte variable) will all be zeros. This situation is perfect for compression. However, when you shorten the length to 4 bytes, the layout of the value is no longer suitable for compression. The only reason to truncate the storage length by using the LENGTH statement is to save disk space. All values are expanded to the full size of 8 bytes in the program data vector to perform computations in DATA and PROC steps. You'll use extra CPU resources to uncompress the data set as well as to expand variables back to 8 bytes.

Using the COMPRESS=BINARY option

When you use COMPRESS=BINARY, patterns of multiple characters across the entire observation are compressed. Binary compression uses two techniques at the same time. This option searches for the following:

  1. Repeating byte sequences (for example, 10 blank spaces or 10 zero bytes in a row)
  2. Repeating byte patterns (for example, the repeated pattern in the hexadecimal value 0102030405FAF10102030405FBF20102030405FCF3)

With that in mind, you can see that the bytes in a numeric variable are just as likely to be compressed as those in a character variable because the compression process does not consider those bytes to be numeric or character. They are just viewed as bytes. Consider a missing value that is represented in hexadecimal notation as FFFF000000000001. In the middle of that value is a string of five zero bytes (0x00) that can be replaced by two compression code-bytes. So, what starts as a sequence of 8 bytes ends up as a sequence of 5 bytes.

Keep in mind

As mentioned earlier, although compression saves space and is a great tool to keep handy in your SAS toolbox, it’s not meant for all your data sets. Some data sets are not going to compress well and the data set will grow larger, so know your data. Also, you’ll want to consider the extra CPU resources that are required to read a compressed file due to the overhead of uncompressing each observation.

What can compression do for you? was published on SAS Users.

12月 162017
 

SAS' new Support site search“Alexa, how many ounces are in a 750-ml bottle of wine?”
“Waze, how many miles between Cary and Kannapolis, NC?”
“Google, which NFL teams are favored to win this weekend?”

Every day many of us turn to some sort of search vehicle to help us solve a work issue or win a “discussion” with a friend. In fact, the word “Google”  has now become a verb in our lexicon. Don’t know what PROC BCHOICE does? Google it!

Here at SAS, there are over 30,000 pages on the support.sas.com site, so we recognize how important an efficient search can be for our users. As part of our continuing effort to evolve the site and provide you with the best possible user experience, we’re pleased to announce that this week we launched a beta version of a new search engine. We believe the new search delivers an improved experience, but we’ll let you be the judge.

We want you to give the new search a try and provide your feedback. Type a few terms into the search box and let us know what you think. You’ll find a “Feedback” link in the right-hand column of each page. Click it, complete the short form, and give us your opinions. We want to know what works well for you and what doesn’t, and which filters you see yourself using and which you think aren’t as useful. Is there a feature you would like to see and don’t? Please use the search preview as often as you like, and feel free to give us feedback every time you do. But hurry, because the search preview will only run until the end of December.

We’ll use your feedback to put the finishing touches on our search functionality. Look for our first production release in early 2018! We want the support site search enhancements to be as important in helping you use SAS as Google is in helping you become the champion of your fantasy football team!

Thanks for your help and happy holidays!

Click to image to launch; type in "Search" box; click "Feedback" to share your comments

 

Test-Drive SAS' new Support site search was published on SAS Users.

12月 132017
 

SAS Data Preparation 2.1 is now available and it includes the ability to perform data quality transformations on your data using the definitions from the SAS Quality Knowledge Base (QKB).

The SAS Quality Knowledge Base is a collection of files which store data and logic that define data quality operations such as parsing, standardization, and generating match codes to facilitate fuzzy matching based on geographic locales. SAS software products reference the QKB when performing data quality transformations on your data.  These products include: SAS Data Integration Studio, SAS DataFlux Data Management Studio/Server, SAS code via dqprocs, SAS MDM, SAS Data Loader for Hadoop, SAS Event Stream Processing, and now SAS Data Preparation which is powered by SAS Viya.

Out-of-the-box QKB definitions include the ability to perform data quality operations on items such as Name, Address, Phone, and Email.

SAS Data Preparation 2.1

SAS Data Preparation – Data Quality Transformations

The following are the data quality transformations available in SAS Data Preparation:

  • Casing – case a text string in upper, lower, or proper case. Example using the Proper (Organization) case definition – input: sas institute   output: SAS Institute.
  • Parsing – break up a text string into its tokens. Example using the Name parse definition – input: James Michael Smith   output: James (Given Name token), Michael (Middle Name token), and Smith (Family Name token).
  • Field extraction – extract relevant tokens from a text string. Example using a custom created extraction definition for Clothing information – input: The items purchased were a small red dress and a blue shirt, large   output: dress; shirt (Item token), red; blue (Color token), and small; large (Size token).
  • Gender analysis – guess the gender of a text string. Example using the Name gender analysis definition – input: James Michael Smith   output: M (abbreviation for Male).
  • Identification analysis – guess the type of data for a text string. Example using the Contact Info identification analysis definition:
  • Match codes – generate a code to fuzzy match text strings. Example using the Name match definition at a sensitivity of 85:
    For more information on match codes, view this YouTube video on The Power of the SAS® Matchcode.
  • Standardize – put a text string into a common format. Example using the Phone standardization definition – input: 9196778000   output: (919) 677 8000.

Note:  While all the examples above are using definitions from the English (United States) locale in the SAS Quality Knowledge Base for Contact Information, QKBs are available for dozens of locales.

You can also customize the definitions in the QKB using SAS DataFlux Data Management Studio.  This allows you to update the out-of-the-box QKB definitions or create your own data types and definitions to suit your project needs.  For example, you may need to create a definition to extract the clothing information from a free-form text field as shown the Field extraction example.  These customized definitions can then be used in SAS Data Preparation as part of your data quality transformations.  For more information on Customizing the QKB, you can view this YouTube video.

For more information on the SAS Quality Knowledge Base (QKB), you can view its documentation.

SAS Data Preparation 2.1: Data quality transformations was published on SAS Users.

12月 132017
 

PROC FREQ is one of the most popular procedures in the SAS language.  It is mostly used to describe frequency distribution of a variable or combination of variables in contingency tables.  However, PROC FREQ has much more functionality than that.  For an overview of all that it can do, see an introduction of the SAS documentation. SAS Viya does not have PROC FREQ, but that doesn’t mean you can’t take advantage of this procedure when working with BIG DATA. SAS 9.4m5 integration with SAS Viya allows you to summarize the data appropriately in CAS, and then pass the resulting summaries to PROC FREQ to get any of the statistics that it is designed to do, from your favorite SAS 9.4 coding client. In this article, you will see how easy it is to work with PROC FREQ in this manner.

These steps are necessary to accomplish this:

  1. Define the CAS environment you will be using from a SAS 9.4m5 interface.
  2. Identify variables and/or combination of variables that define the dimensionality of contingency tables.
  3. Load the table to memory that will need to be summarized.
  4. Summarize the data with a CAS enable procedure, use PROC FEDSQL for high cardinality cases.
  5. Write the appropriate PROC FREQ syntax using the WEIGHT statement to input cell count data.

Step 1: Define the CAS environment

Before you start writing any code, make sure that there is an _authinfo file in the appropriate user directory in your system (for example, c:\users\<userid>$$ with the following information:

host <cas server name> port <number> user "<cas user id>" password "<cas user password>"

This information can be found by running the following statement in the SAS Viya environment from SAS Studio:

cas;  
caslib _all_ assign;

 

Then, in your SAS 9.4m5 interface, run the following program to define the CAS environment that you will be working on:

options cashost=" " casport=;
cas mycas user=;
libname mycas cas;
/** Set the in memory shared library if you will be using any tables already promoted in CAS **/
libname public cas caslib=public;

 

PROC FREQ FOR BIG DATA

Figure 1

Figure 1 shows the log and libraries after connecting to the CAS environment that will be used to deal with Big Data summarizations.

Step 2: Identify variables

The variables here are those which will be use in the TABLE option in PROC FREQ.  Also, any numeric variable that is not part of TABLE statement can be used to determine the input cell count.

Step 3: Load tale to memory

There are two options to do this.  The first one is to load the table to the PUBLIC library in the SAS Viya environment directly.  The second option is to load it from your SAS 9.4m5 environment into CAS. Loading data into CAS can be as easy as writing a DATA step or using other more efficient methods depending on where the data resides and its size.  This is out of scope of this article.

Step 4: Summarize the data with a CAS enable procedure

Summarizing data for many cross tabulations can become very computing expensive, especially with Big Data. SAS Viya offers several ways to accomplish this (i.e. PROC MEANS, PROC MDSUMMARY, PROC FEDSQL). There are ways to optimize performance for high cardinality summarization when working with CAS.  PROC FEDSQL is my personal favorite since it has shown good performance.

Step 5: Write the PROC FREQ

When writing the PROC FREQ syntax make sure to use the WEIGHT statement to instruct the algorithm to use the appropriate cell count in the contingency tables. All feature and functionality of PROC FREQ is available here so use it to its max! The DATA option can point to the CAS library where your summarized table is, so there will be some overhead for data transfer to the SAS 9 work server. But, most of the time the summarized table is small enough that the transfer may not take long.  An alternative to letting PROC FREQ do the data transfer is doing a data step to bring the data from CAS to a SAS base library in SAS 9 and then running PROC FREQ with that table as the input.

Example

Figure 2 below shows a sample program on how to take advantage of CAS from SAS 9.4m5 to produce different analyses using PROC FREQ.

PROC FREQ for big data

Figure 2

 

Figure 3 shows the log of the summarization in CAS for a very large table in 1:05.97 minutes (over 571 million records with a dimensionality of 163,964 for all possible combinations of the values of feature, date and target).  The PROC FREQ shows three different ways to use the TABLE statement to produce the desired output from the summarized table which took only 1.22 seconds in the SAS 9.4m5 environment.

PROC FREQ for big data

Figure 3

Program

 

/** Connect to Viya Environment **/
options cashost="xxxxxxxxxxx.xxx.xxx.com" casport=5570;
cas mycas user=calara;
libname mycas cas datalimit=all;
 
/** Set the in memory shared library **/
libname public cas caslib=public datalimit=all;
 
/** Define macro variables **/
/* CAS Session where the FEDSQL will be performed */
%let casref=mycas;
 
/* Name of the output table of the frequency counts */
%let outsql=sql_sum;
 
/* Variables for cross classification cell counts */
%let dimvars=date, feature, target;
 
/* Variable for which frequencies are needed */
%let cntvar=visits;
 
/* Source table */
%let intble=public.clicks_summary;
 
proc FEDSQL sessref=&casref.;
	create table &outsql. as select &dimvars., count(&cntvar.) as freq 
		from &intble. group by &dimvars.;
	quit;
run;
 
proc freq data=&casref..&outsql.;
	weight freq;
	table date;
	table date*target;
	table feature*date / expected cellchi2 norow nocol chisq noprint;
	output out=ChiSqData n nmiss pchi lrchi;
run;

PROC FREQ for Big Data was published on SAS Users.

12月 072017
 

authorization in SAS ViyaAuthorization determines what a user can see and do in an application. An authorization system is used to define access control policies, and those policies are later enforced so that access requests are granted or denied. To secure resources in SAS Viya there are three authorization systems of which you need to be aware.

The General Authorization system secures folders within the SAS Viya environment and the content of folders, for example, reports or data plans. It can also secure access to functionality.

The CAS Authorization system secures CAS libraries, tables (including rows and columns) and CAS actions.

The File system authorization system secures files and directories on the OS, for example code or source tables for CAS libraries.

In this post, I will provide a high-level overview of the General and CAS authorization systems.  If you want to dig into more detail please see the SAS Viya Administration Guide Authorization section.

An important factor in authorization is the identity of the user making the request. In a visual deployment of SAS Viya the General and CAS authorization systems use the same identity provider, an LDAP server. The other common feature between the two authorization systems is that they are deny-based. In other words, access to resources is implicitly disallowed by default. This is important as it marks a change for those familiar with metadata authorization in SAS 9.4.

You can administer both general and CAS authorization from SAS Environment Manager. CAS authorization may also be administered from CAS Server Monitor and from the programming interfaces via the accessControl action set. In SAS Viya 3.3, command-line interfaces are available to set authorization for both systems.

General Authorization System

The general authorization system is used to administer authorization for folders, content and functionality. The general authorization system is an attribute based authorization system which determines authorization based on the attributes of the:

  • Subject: attributes that describe the user attempting the access e.g. user, group, department, role, job title etc.,
  • Action: attributes that describe the action being attempted e.g. read, delete, view, update etc.
  • Resource (or object): attributes that describe the object being accessed (e.g. the object type, location, etc.)
  • Context (environment): attributes that deal with time, location etc.

This attribute model supports Boolean logic, in which rules contain “IF, THEN” statements about who is making the request, the resource, the context and the action.

In the general authorization system, information about the requesting user, the target resource, and the environment can influence access. Each access request has a context that includes environmental data such as time and device type. Environmental constraints can be incorporated using conditions.

Permission inheritance in the general authorization system flows through a hierarchy of containers. Each container conveys settings to its child members. Each child member inherits settings from its parent container. Containers can be folders, or rest endpoints which contain functionality. For example, a folder will pass on its permissions to any children which can be additional sub-folders, or content such as reports or data plans.

In the general authorization system precedence, the way in which permission conflicts are resolved, is very simple, the only factor that affects precedence is the type of rule (grant or prohibit). Prohibit rules have absolute precedence. If there is a relevant prohibit rule, effective access is always prohibited.

So, a deny anywhere in the object or identity hierarchy for the principal (user making the request) means that access is denied.

CAS Authorization

The CAS authorization system mostly focuses on data. It makes sense therefore that it is implemented in the style of a database management system (DBMS). DBMS style authorization systems focus on securing access to data. The permissions relate to data for example, read, write, update, select etc., and some CAS-specific ones like promote and limited promote.

Permission inheritance in the general authorization system flows through a hierarchy of objects, The hierarchy is CASLIB > table > rows/columns.

In the CAS authorization system precedence, the way in which permission conflicts are resolved is determined by where an access control is set and who can access control is assigned to.   The precedence rules are:

  • Direct access controls have precedence over inherited settings.
  • Identity precedence from highest to lowest is user > groups > authenticated users.

To put this another way, if a direct access control is found on an object it will determine access. If multiple direct access controls are found, a control for a user will be used over a control for a group, which will be similarly be used over a control for all authenticated users.  If no direct access control is found on, for example, a table, settings will be inherited from the CASLIB in a similar manner.

A final note on CASLIBSs, there may be additional authorization that needs to be considered to provide access to data. For example, for a path based CASLIB if host-layer access requirements are not met, grants in the CAS authorization layer do not provide access.

This has been a brief look at the two authorization systems in a SAS VIYA environment. The table below summarizes some of the information in this blog.

authorization in SAS Viya

As I noted at the start, you can find much more detail in the SAS Viya Administration Guide An introduction to authorization in SAS Viya was published on SAS Users.

12月 072017
 

In SAS Viya, deployments identities are managed by the environments configured identity provider. In Visual SAS Viya deployments the identity provider must be an LDAP (Lightweight Directory Access Protocol)  server. Initial setup of a SAS Viya Deployment requires configuration to support reading the identity information (users and groups) from LDAP. SAS Viya 3.3 adds support for multi-tenancy which has implications for the way users and groups must be stored in LDAP. For the SAS Administrator, this means at least a rudimentary understanding of LDAP is required. In this blog post, I will review some key LDAP basics for the SAS Viya administrator.

A basic understanding of LDAP l ensures SAS Viya administrators can speak the same language as the LDAP administrator.

What is LDAP?

LDAP is a lightweight protocol for accessing directory servers. A directory server is a hierarchical object orientated database. An LDAP server can be used to organize anything. One of the most common uses is as an identity management system, to organize user and group membership.

LDAP directories are organized in a tree manner:

  • A directory is a tree of directory entries.
  • An entry contains a set of attributes.
  • An attribute has a name, and one or more values.

LDAP basics for the SAS Viya administrator

Below is an entry for a user Henrik. It has common attributes like:

  • uid User id
  • cn Common Name
  • L Location
  • DisplayName: name to display

The attribute value pairs are the details of the entry.

The objectclass attribute is worth a special mention. Every entry has at least one objectclass attribute and often more than one. The objectclass is used to provide the rules for the object including required and allowed attributes. For example, the inetorgperson object class specifies attributes about people who are part of an organization, including items such as uid, name, employeenumber etc.

LDAP Tree

Let’s now look at the organization of the tree. DC is the “domain component.” You will often see examples of LDAP structures that use DNS names for the domain component, such as: dc=sas,dc=com. This is not required, but since DNS itself often implies organizational boundaries, it usually makes sense to use the existing naming structure. In this example the domain component is “dc=geldemo,dc=com”. The next level down is the organizational unit (ou).  An organizational unit is a grouping or collection of entries. Organizational units can contain additional organizational units.

But how do we find the objects we want in the directory tree? Every entry in a directory has a unique identifier, called the Distinguished Name (DN). The distinguished name is the full path to the object in the directory tree. For example, the distinguished name of Henrik is uid=Henrik,ou=users, ou=gelcorp,dc=viyademo,dc=com. The distinguished name is the path to the object from lowest to highest (yes it seems backward to me to).

LDAP Queries and Filters

Like any other database LDAP can be queried and it has its own particular syntax for defining queries. LDAP queries are boolean expressions in the format

<em><strong>attribute operator value</strong></em>

<em><strong>uid = sasgnn</strong></em>

 

Attribute can be any valid LDAP attribute (e.g name, uid, city etc.) and value is the value that you wish to search for.  The usual operators are available, including:

Using LDAP filters, you can link two or more Boolean expressions together using the “filter choices” and/or. Unusually, the LDAP “filter choices” are always placed in front of the expressions. The search criteria must be put in parentheses and then the whole term has to be bracketed one more time. Here are some examples of LDAP queries that may make the syntax easier to follow:

  • (sn=Jones): return all entries with a surname equal to Jones.
  • (objectclass=inetorgperson) return entries that use the inegorgperson object class.
  • (mail=*): return all entries that have the mail attribute.
  • (&(objectclass=inetorgperson)(o=Orion)): return all entries that use the inetorgperson object class and that have the organization attribute equal to Orion (people in the Orion organization).
  • (&(objectclass=GroupofNames)(|(o=Orion)(o=Executive))) return all entries that use the groupofNames object class and that have the organization attribute equal to Orion OR the organization attribute equal to Executive (groups in the Orion or Executive organizations).

Why do I need to know this?

How will you apply this LDAP knowledge in SAS Viya? To enable SAS Viya to access your identity provider, you must update the SAS Identities service configuration. As an administrator, the most common items to change are:

  • BaseDN the entry in the tree from which the LDAP server starts it search.
  • ObjectFilter the filter used to identity and limit the users and groups returned.

There is a separate BaseDN and ObjectFilter for users and for groups.

To return users and groups to SASVIYA from our example LDAP server we would set:

sas.identities.providers.ldap.group.BasedN=ou=gelcorp,ou=groups,dc=viyademo,dc=com

sas.identities.providers.ldap.users.BasedN= ou=gelcorp,ou=users,dc=viyademo,dc=com

 

This would tell SASVIYA to begin its search for users and groups at those locations in the tree.

The object filter will then determine what entries are returned for users and groups from a search of the LDAP tree starting at the BaseDN. For example:

sas.identities.providers.ldap.group.objectFilter: 
(&amp;(objectClass=GroupOfNames)(o=GELCorp LTD))

sas.identities.providers.ldap.users.objectFilter: 
(&amp;(objectClass=inetOrgPerson)(o=GELCorp LTD))

 

There are a lot of LDAP clients available that will allow you to connect to an LDAP server and view, query, edit and update LDAP trees and their entries. In addition, the ldif file format is a text file format that includes data and commands that provide a simple way to communicate with a directory so as to read, write, rename, and delete entries.

This has been a high-level overview of LDAP. Here are some additional sources of information that may help.

Basic LDAP concepts

LDAP Query Basics

Quick Introduction to LDAP

How To Use LDIF Files to Make Changes to an OpenLDAP System

LDAP basics for the SAS Viya administrator was published on SAS Users.