12月 212017
 

With SAS Data Management, you can set up SAS Data Remediation to manage and correct data issues. SAS Data Remediation allows user- or role-based access to data exceptions.

When a data issue is discovered it can be sent automatically or manually to a remediation queue where it can be corrected by designated users. The issue can be fixed from within SAS Remediation without the need of going to the affected source system. For more efficiency, the remediation process can also be linked to a purpose designed workflow.

It involves a few steps to set up a remediation process that allows you to correct data issues from within SAS Remediation:

  • Set up Data Management job to retrieve data and correct data in remediation.
  • Set up a Workflow to control the remediation process.
  • Register the remediation service.

Set up Data Management job to retrieve and correct data in remediation

To correct data issues from within Data Remediation we need two real-time data management jobs to retrieve and send data. The retrieve job will read the record in question to populate its data in the remediation UI and a send job to write the corrected data back to the data source or a staging area first.

Retrieve and sent job

If the following remediation fields are available in the retrieve or send job’s External Data Provider node, data will be passed to the fields. The field values can be used to identify and work with the correct record:

REM_KEY (internal field to store issues record id)
REM_USERNAME (the current remediation user)
1.  REM_ITEM_NAME
2.  REM_ISSUE_NAME
3.  REM_APPLICATION
4.  REM_SUBJECT_AREA
5.  REM_PACKAGE_NAME
6.  REM_STATUS
7.  REM_ASSIGNEE

Retrieve Action

The "retrieve” action occurs when the issue is opened in SAS Data Remediation. Data Remediation will only pass REM_ values to the data management job if the fields are present in the External Data Provider node. Although the REM_ values are the only way the data management job can communicate with SAS Data Remediation but they are not all required, meaning you can just call the fields in the External Data Provider node you need.

The job’s output fields will be displayed in the Remediation UI as edit fields to correct the issue record. it's best to use a Field Layout node as the last job node to pass out the wanted fields with desired labels.

Note: The retrieve job should only return one record.

A simple example of a retrieve job would be to have the issue record id coming from REM_KEY into the data management job to select the record from the source system.

Send Action

The “send” action occurs when pressing the “Commit Changes” button in the Data Remediation UI. All REM_ values in addition to the output fields of the retrieve job (the issue record edit fields) are passed to the send job. The job will receive values for those fields present in the External Data Provider node.

The send job can now work with the remediation record and save it to a staging area or submit it to the data source directly.

Note: Only one row will be sent as an input to the send job. Any data values returned by the send job will be ignored by Data Remediation.

Move jobs the Data Management Server

When both jobs are written, and tested you need to move them to Data Management Server into a Real-Time Data Services sub-directory for Data Remediation to call them.

When Data Remediation is calling the jobs, it will use the user credentials for the person logged on to Data Remediation. Therefore, you need to make sure that the jobs on Data Management Server have been granted the right permission.

Set up a Workflow to control the remediation process

Although you don’t need to involve a workflow in SAS Data Remediation but to improve efficiency it might be a good using one.

You can design your own workflow using SAS Workflow Studio or you can use a prepared workflow already coming with Data Remediation. You need to make sure that the desired workflow is loaded on to Workflow Server to link it to the Data Remediation Service.

Using SAS Workflow will help you to better control Data Remediation issues.

Register the remediation service

We can now register our remediation service in SAS Data Remediation. Therefore, we go to Data Remediation Administrator “Add New Client Application.”

Under Properties we supply an ID, which can be the name of the remediation service as long as it is unique, and a Display name, which is the name showing in the Remediation UI.

Next we set up the edit UI for the issue record. Under Issue User Interface we go to: User default remediation UI…. Using Data Management Server:

The Server address is the fully qualified address for Data Management Server including the port it is listening on. For example: http://dmserver.company.com:21036.

The Real-time service to retrieve item attributes and Real-time service to send item attributes needs to point to the retrieve/send job respectively on Data Management Server, including the job suffix .ddf as well as any directories under Real-Time Data Services where the jobs are stored.

 

 

 

 

 

 

Under the tab Subject Area, we can register different subject categories for this remediation service.  When calling the remediation service we can categorize different remediation issues by setting different subject areas.

Under the tab Issues Types, we can register issue categories. This enables us to categorize the different remediation issues.

At Task Templates/Select Templates you can set the workflow to be used for a particular issue type.

By saving the remediation service you will be able to use it. You can now assign data issues to the remediation service to efficiently correct the data and improve your data quality from within SAS Data Remediation.

Manage remediation issues using SAS Data Management was published on SAS Users.

12月 202017
 

US health care costs are the highest in the developed world, despite per-capita healthcare spending roughly double that of the others, but with outcomes worse-or-equal-to comparable countries. A 2014 Commonwealth Fund study ranked the United States 11th out of 11 major industrial nations in overall healthcare. An aging population is [...]

Using data to discover what’s driving hospital readmissions was published on SAS Voices by Leo Sadovy

12月 202017
 

I previously showed an easy way to visualize a regression model that has several continuous explanatory variables: use the SLICEFIT option in the EFFECTPLOT statement in SAS to create a sliced fit plot. The EFFECTPLOT statement is directly supported by the syntax of the GENMOD, LOGISTIC, and ORTHOREG procedures in SAS/STAT. If you are using another SAS regression procedure, you can still visualize multivariate regression models:

  • If a procedure supports the STORE statement, you can save the model to an item store and then use the EFFECTPLOT statement in PROC PLM to create a sliced fit plot.
  • If a procedure does not support the STORE statement, you can manually create the "slice" of observations and score the model on the slice.

Use PROC PLM to score regression models

Most parametric regression procedures in SAS (GLM, GLIMMIX, MIXED, ...) support the STORE statement, which enables you to save a representation of the model in a SAS item store. The following program creates sample data for 500 patients in a medical study. The call to PROC GLM fits a linear regression model that predicts the level of cholesterol from five explanatory variables. The STORE statement saves the model to an item store named 'GLMModel'. The call to PROC PLM creates a sliced fit plot that shows the predicted values versus the systolic blood pressure for males and females in the study. The explanatory variables that are not shown in the plot are set to reference values by using the AT option in the EFFECTPLOT statement:

data Heart;    /* create example data */
set sashelp.heart(obs=500);
where cholesterol < 400;
run;
 
proc glm data=Heart;
   class Sex Smoking_Status BP_Status;
   model Cholesterol = Sex      Smoking_Status BP_Status  /* class vars  */
                       Systolic Weight;                   /* contin vars */
   store GLMModel;                    /* save the model to an item store */
run;
 
proc plm restore=GLMModel;                       /* load the saved model */
   effectplot slicefit / at(Smoking_Status='Non-smoker' BP_Status='Normal'
                            Weight=150);   /* create the sliced fit plot */
run;
Visualize multivariate regression models: Sliced fit plot by using PROC PLM in SAS

The graph shows a sliced fit plot. The footnote states that the lines obtained by slicing through two response surfaces that correspond to (Smoking_Status, BP_Status) = ('Non-smoker', 'Normal') at the value Weight = 150. As shown in the previous article, you can specify multiple values within the AT option to obtain a panel of sliced fit plots.

Create a sliced fit plot manually by using the SCORE statement

The nonparametric regression procedures in SAS (ADAPTIVEREG, GAMPL, LOESS, ...) do not support the STORE statement. Nevertheless, you can create a sliced fit plot using a traditional scoring technique: use the DATA step to create observations in the plane of the slice and score the model on those observations.

There are two ways to score regression models in SAS. The easiest way is to use PROC SCORE, the SCORE statement, or the CODE statement. The following DATA step creates the same "slice" through the space of explanatory variables as was created by using the EFFECTPLOT statement in the previous example. The SCORE statement in the ADAPTIVEREG procedure then fits the model and scores it on the slice. (Technical note: By default, PROC ADAPTIVEREG uses variable selection techniques. For easier comparison with the model from PROC GLM, I used the KEEP= option on the MODEL statement to force the procedure to keep all variables in the model.)

/* create the scoring observations that define the slice */
data Score;
length Sex $6 Smoking_Status $17 BP_Status $7; /* same as for data */
Cholesterol = .;             /* set response variable to missing   */
Smoking_Status='Non-smoker'; /* set reference levels ("slices")    */
BP_Status='Normal';          /*     for class vars                 */
Weight=150;                  /*     and continuous covariates      */
do Sex = "Female", "Male";       /* primary class var */
   do Systolic = 98 to 272 by 2; /* evenly spaced points for X variable */
      output;
   end;
end;
run;
 
proc adaptivereg data=Heart;
   class Sex Smoking_Status BP_Status;
   model Cholesterol = Sex      Smoking_Status BP_Status
                       Systolic Weight / nomiss 
   /* for comparison with other models, FORCE all variables to be selected */
                       keep=(Sex Smoking_Status BP_Status Systolic Weight);
   score data=Score out=ScoreOut Pred;   /* score the model on the slice */
run;
 
proc sgplot data=ScoreOut;             
   series x=Systolic y=Pred / group=Sex;  /* create sliced fit plot */
   xaxis grid; yaxis grid;
run;

The output, which is not shown, is very similar to the graph in the previous section.

Create a sliced fit plot manually by using the missing value trick

If your regression procedure does not support a SCORE statement, an alternative way to score a model is to use "the missing value trick," which requires appending the scoring data set to the end of the original data. I like to add an indicator variable to make it easier to know which observations are data and which are for scoring. The following statements concatenate the original data and the observations in the slice. It then calls the GAMPL procedure to fit a generalized additive model (GAM) by using penalized likelihood (PL) estimation.

/* missing value trick: append score data to original data */
data All;
set Heart         /* data to fit the model */
    Score(in=s);  /* grid of values on which to score model */
ScoreData=s;      /* SCoreData=0 for orig data; =1 for scoring observations */
run;
 
proc gampl data=All;
   class Sex Smoking_Status BP_Status;
   model Cholesterol = Param(Sex Smoking_Status BP_Status)
                       Spline(Systolic Weight);
   output out=GamOut pred;
   id ScoreData Sex Systolic; /* include these vars in output data set */
run;
 
proc sgplot data=GamOut(where=(ScoreData=1)); /* plot only the scoring obs */
   series x=Systolic y=Pred / group=Sex;  /* create sliced fit plot */
   xaxis grid; yaxis grid;
run;
Visualize multivariate regression models: create sliced fit plot in SAS by using the missing value trick

The GAMPL procedure does not automatically include all input variables in the output data set; the ID statement specifies the variables that you want to output. The OUTPUT statement produces predicted values for all observations in the ALL data set, but the call to PROC SGPLOT creates the sliced plot by using only the observations for which ScoreData = 1. The output shows the nonparametric regression model from PROC GAMPL.

You can also use the ALL data set to overlay the original data and the sliced fit plot. The details are left as an exercise for the reader.

Summary

The EFFECTPLOT statement provides an easy way to create a sliced fit plot. You can use the EFFECTPLOT statement directly in some regression procedures (such as LOGISTIC and GENMOD) or by using the STORE statement to save the model and PROC PLM to display the graph. For procedures that do not support the STORE statement, you can use the DATA step to create "the slice" (as a scoring data set) and use traditional scoring techniques to evaluate the model on the slice.

The post How to create a sliced fit plot in SAS appeared first on The DO Loop.

12月 202017
 

Running SAS programs in batchWhile SAS program development is usually done in an interactive SAS environment (SAS Enterprise Guide, SAS Display Manager, SAS Studio, etc.), when it comes to running SAS programs in a production or operations environment, it is routinely done in batch mode.

Why run SAS programs in batch mode?

First and foremost, this is done for automation, as the batch process does not require human participation at the time of run. It can be scheduled to run (using Operating System scheduler or other scheduling software) while we sleep, at any time of the day or at any time interval between two consecutive runs.

Running SAS programs in batch mode allows streamlining SAS processing by eliminating the possibility of human error, submitting multiple SAS jobs (programs) all at once or in a sequence securing programs and/or data dependencies.

SAS batch processing also takes care of self-documenting, as it automatically generates and stores SAS logs and outputs.

Imagine the following scenario. Every night, a SAS batch process “wakes up” at 3 a.m. and runs an ETL process on a SAS Application server that extracts multiple tables from a database, transforms, combines, and loads them into a SAS datamart; then moves some data tables across the network and loads them into SAS LASR server, so when you are back to work in the morning your SAS Visual Analytics application has all its data refreshed and ready to roll. Of course, the process schedule can be custom-tailored to your particular needs; your batch jobs may run every 15 minutes, once a week, every first Friday of the month – you name it.

What is a batch script file?

To submit a single SAS program in batch mode manually, you could submit an OS command that looks something like the following:

Unix/Linux

sas /sas/code/proj1/job1.sas -log /sas/code/proj1/job1.log

DOS/Windows

"C:\Program Files\SASHome\SASFoundation\9.4\Sas.exe" -SYSIN c:\proj1\job1.sas -NOSPLASH -ICON -LOG c:\proj1\job1.log

However, submitting an OS command manually has too many drawbacks: it’s too much typing, it only submits one SAS program at a time, and most importantly – it is manual, which means it is prone to human error.

Usually, these OS commands are packaged into so called batch files (shell scripts in Unix) that allow for sequential, parallel, as well as conditional execution of multiple OS line commands. They can be run either manually, or automatically – on schedule, or called by other batch scripts.

In a Windows/DOS Operating System, these script files are called batch files and have .bat filename extensions. In Unix-like operating systems, such as Linux, these script files are called shell scripts and have .sh filename extensions.

Since Windows batch files are similar, but slightly different from the Unix (and its open source cousin Linux) shell scripts, in the below examples we are going to use Unix/Linux shell scripts only, in order to avoid any confusion. And we are going to use terms Unix and Linux interchangeably.

Here is the typical content of a Linux shell script file to run a single SAS program:

#!/usr/bin/sh
dtstamp=$(date +%Y.%m.%d_%H.%M.%S)
pgmname="/sas/code/project1/program1.sas"
logname="/sas/code/project1/program1_$dtstamp.log"
/sas/SASHome/SASFoundation/9.4/sas $pgmname -log $logname

Note, that the shell script syntax allows for some basic programming features like current datetime function, formatting, and variables. It also provides some conditional processing similar to “if-then-else” logic. For detailed information on the shell scripting language you may refer to the following BASH shell script tutorial or any other source of many dialects or flavors of the shell scripting (C Shell, Korn Shell, etc.)

Let’s save the above shell script as the following file:
/sas/code/project1/program1.sh

How to submit a SAS program via Unix script

In order to run this shell script we would submit the following Linux command:
/sas/code/project1/program1.sh

Or, if we navigate to the directory first:
cd /sas/code/project1

then we can submit an abbreviated Linux command
./program1.sh
When run, this shell script not only executes a SAS program (program1.sas), but for every run it also creates and saves a uniquely named SAS Log file. You may create the SAS log file in the same directory where the SAS code is stored, as specified in the script shell above, or specify another directory of your choice.

For example, it creates the following SAS log file:
/sas/code/project1/program1_2017.12.06_09.15.20.log

The file name uniqueness is achieved by adding a date/time stamp suffix between the SAS program name and .log file name extension, in this particular case indicating that this SAS log file was created on December 6, 2017, at 09:15:20 (hours:minutes:seconds).

Unix script for submitting multiple SAS programs

Unix scripts may contain not only OS commands, but also other Unix script calls. You can mix-and-match OS commands and other script calls.

When scripts are created for each individual SAS program that you intend to run in a batch, you can easily combine them into a program flow by creating a flow script containing those single program scripts. For example, let’s create a script file /sas/code/project1/flow1.sh with the following contents:

/sas/code/project1/program1.sh
/sas/code/project1/program2.sh
/sas/code/project1/program3.sh

When submitted as

/sas/code/project1/flow1.sh

it will sequentially execute three scripts - program1.sh, program2.sh, and program3.sh, each of which will execute the corresponding SAS program - program1.sas, program2.sas, and program3.sas, and produce three SAS logs - program1.log, program2.log, and program3.log.

Unix script file permissions

In order to be executable, UNIX script files must have certain permissions. If you create the script file and want to execute it yourself only, the file permissions can be as follows:

-rwxr-----, or 740 in octal representation.

This means that you (the Owner of the script file) have Read (r), Write (w) and Execute (x) permission as indicated by the green highlighting; Group owning the script file has only Read (r) permission as indicated by yellow highlighting;  Others have no permissions to the script file at all as indicated by red highlighting.

If you want to give yourself (Owner) and Group execution permissions then your script file permissions can be as:

-rwxr-x---, or 750 in octal representation.

In this case, your group has Read (r) and Execute (x) permissions as highlighted in yellow.

In Unix, file permissions are assigned using the chmod Unix command.

Note, that in both examples above we do not give Others any permissions at all. Remember that file permissions are a security feature, and you should assign them at the minimum level necessary.

Conditional execution of scripts and SAS programs

Here is an example of a Unix script file that allows running multiple SAS programs and OS commands at different times.

#!/bin/sh

#1 extract data from a database
/sas/code/etl/etl.sh

&gt;#2 copy data to the Visual Analytics autoload directory
scp -B userid@sasAPPservername:/sas/data/*.sas7bdat userid@sasVAservername:/sas/config/.../AutoLoad

#3 run weekly, every Monday
dow=$(date +%w)
if [ $dow -eq 1 ]
then
   /sas/code/alerts_generation.sh
fi

#4 run monthly, first Friday of every month
dom=$(date +%d)
if [ $dow -eq 5 -a $dom -le 7 ]
then
   /sas/code/update_history.sh
   /sas/code/update_transactions.sh
fi

In this script, the following logical operators are used: -eq (equal), -le (less or equal), -a (logical and).

As you can see, the script logic takes care of branching to execute different SAS programs when certain timing conditions are met. With such an approach, you would need to schedule only this single script to run at a specified time/interval, say daily at 3 a.m.

In this case, the script will “wake up” every morning at 3 a.m. and execute its component scripts either unconditionally, or conditionally.

If one of the included programs needs to run at a different, lesser frequency (e.g. every Monday, or monthly on first Friday of every month) the script logic will trigger those executions at the appropriate times.

In the above script example steps #1 and #2 will execute every time (unconditionally) the script runs (daily). Step #1 runs ETL program to extract data from a database, step #2 copies the extracted data across the network from SAS Application server to the SAS LASR Analytic server’s drop zone from where they are automatically loaded (autoloaded) into the LASR.

Step #3 will run conditionally every Monday ( $dow -eq 1). Step #4 will run conditionally every first Friday of a month ($dow -eq 5 -a $dom -le 7).

For more information on how to format date for use in shell scripts please refer to this post.

Do you run your SAS programs in batch?

Please share your batch experiences in the comment section below. I am sure the rest of us will really appreciate it!

Running SAS programs in batch under Unix/Linux was published on SAS Users.

12月 192017
 

SAS' tag line is The Power to Know©, But what makes SAS so powerful? Ask our users and they'll tell you -- it's because SAS allows them to answer questions which previously could not be answered. How does SAS do this? SAS built a 4th generation programming language which makes [...]

What makes SAS so powerful was published on SAS Voices by David Pope

12月 192017
 

In my last article, Managing SAS Configuration Directory Security, we stepped through the process for granting specific users more access without opening up access to everyone. One example addressed how to modify security for autoload. There are several other aspects of SAS Visual Analytics that can benefit from a similar security model.

You can maintain a secure environment while still providing one or more select users the ability to:

  • start and stop a SAS LASR Analytic Server.
  • load data to a SAS LASR Analytic Server.
  • import data to a SAS LASR Analytic Server.

Requirements for these types of users fall into two areas: metadata and operating system.

The metadata requirements are very well documented and include:

  • an individual metadata identity.
  • membership in appropriate groups (for example: Visual Analytics Data Administrators for SAS Visual Analytics suite level administration; Visual Data Builder Administrators for data preparation tasks; SAS Administrators for platform level administration).
  • access to certain metadata (refer to the SAS Visual Analytics 7.3: Administration Guide for metadata permission requirements).

Operating System Requirements

Users who need to import data, load data, or start a SAS LASR Analytic Server need the ability to authenticate to the SAS LASR Analytic Server host and write access to some specific locations.

If the SAS LASR Analytic Server is distributed users need:

If the compute tier (the machine where the SAS Workspace Server runs) is on Windows, users need the Log on as a batch job user right on the compute machine.

In addition, users need write access to the signature files directory, the path for the last action logs for the SAS LASR Analytic Server, and the PIDs directory in the monitoring path for the SAS LASR Analytic Server.

Signature Files

There are two types of signature files: server signature files and table signature files. Server signature files are created when a SAS LASR Analytic Server is started. Table signature files are created when a table is loaded into memory. The location of the signature files for a specific SAS LASR Analytic Server can be found on the Advanced properties of the SAS LASR Analytic Server in SAS Management Console.

SAS Configuration Directory Security for SAS Visual Analytics

On Linux, if your signature files are in /tmp you may want to consider relocating them to a different location.

Last Action Logs and the Monitoring Path

In the SAS Visual Analytics Administrator application, logs of interactive actions for a SAS LASR Analytic Server are written to the designated last action log path. The standard location is on the middle tier host in <SAS_CONFIG_ROOT>/Lev1/Applications/SASVisualAnalytics/VisualAnalyticsAdministrator/Monitoring/Logs. The va.lastActionLogPath property is specified in the SAS Visual Analytics suite level properties. You can access the SAS Visual Analytics suite level properties in SAS Management Console under the Configuration Manager: expand SAS Applicaiton Infrastructure, right-click on Visual Analytics 7.3 to open the properties and select the Advanced tab.

The va.monitoringPath property specifies the location of certain monitoring process ID files and logs. The standard location is on the compute tier in <SAS_CONFIG_ROOT>/Lev1/Applications/SASVisualAnalytics/VisualAnalyticsAdministrator/Monitoring/. This location includes two subdirectories: Logs and PIDs. You can override the default monitoring path by adding the va.monitoringPath extended attribute to the SAS LASR Analytic Server properties.

Host Account and Group

For activities like starting the SAS LASR Analytic Server you might want to use a dedicated account such as lasradm or assign the access to existing users. If you opt to create the lasradm account, you will need to also create the related metadata identity.

For group level security on Linux, it is recommended that you create a new group, for example sasusers, to reserve the broader access provided by the sas group to only platform level administrators. Be sure to include in the membership of this sasusers group any users who need to start the SAS LASR Analytic Server or that need to load or import data to the SAS LASR Analytic Server.

Since the last action log path, the monitoring path, and the autoload scripts location all fall under <SAS_CONFIG_ROOT>/Lev1/Applications/SASVisualAnalytics/VisualAnalyticsAdministrator, you can modify the ownership of this folder to get the right access pattern.

A similar pattern can also be applied to the back-end store location for the data provider library that supports reload-on-start.

Don’t forget to change the ownership of your signature files location too!

SAS Admin Notebook: Managing SAS Configuration Directory Security for SAS Visual Analytics was published on SAS Users.

12月 192017
 

Compressing a data setCompressing a data set is a process that reduces the number of bytes that are required to represent each observation in a file. You might choose to enable compression to reduce the storage requirements for the file and to lessen the number of I/O operations that are needed to read from or write to the data during processing.

Compression is enabled by the COMPRESS= system option, the COMPRESS= option in the LIBNAME statement, and the COMPRESS= data set option. The COMPRESS= system option compresses all data set sets that are created during a SAS session, and the COMPRESS= option in the LIBNAME statement compresses all data sets for a particular SAS® library. The COMPRESS= data set option is the most popular of these methods because you compress data sets individually as they are created.

The COMPRESS= data set option can be set to CHAR (or YES), NO, and BINARY. The following example illustrates using COMPRESS=YES:

data new(compress=yes);
set old;
run;

 

While compression is a useful tool in your programming toolbox, it isn't a tool that you should use on every data set. When you request compression by using the COMPRESS= option, SAS considers the following information:

  • The header information of the data set to determine how many variables are in the program data vector
  • whether the variables are character or numeric
  • the lengths of the variables

SAS doesn't consider data values at all. The compression overhead for Microsoft 32-bit Windows and 64-bit Windows is 12 bytes, whereas 64-bit UNIX hosts require 24 bytes of overhead. When SAS determines that it is possible to recoup the 12 or 24 bytes of overhead per observation that compression requires, then SAS attempts to compress the data. If that 12 or 24 bytes per observation can't be recouped, the data set size is increased when the compression is completed. So, you should determine ahead of time whether your data set is a good candidate for compression.

In the following example, a data set is created in the Windows operating environment with two variables having lengths, respectively, of 3 and 5 bytes. Because it is impossible to recoup the 12 bytes that are needed per observation for compression overhead, SAS automatically disables compression and a note is written to the SAS log that indicates the same.

571  data new(compress=char);
572     x='abc';
573     y='defgh';
574  run;
 
NOTE: Compression was disabled for data set WORK.NEW because compression overhead would increase
      the size of the data set.
NOTE: The data set WORK.NEW has 1 observations and 2 variables.

 

The compression process doesn’t recognize individual variables within an observation. Instead, the process sees each observation as a large collection of bytes that are run together end to end. In the COMPRESS= data set option, you enable compression by specifying either CHAR (YES) and BINARY. These values for the option differ slightly in the types of data values that they target for compression.

Using the COMPRESS=CHAR|YES option

Specifying COMPRESS=CHAR (or YES) targets data with repeating single characters and variables with stored lengths that are longer than most of the values. As a result, blank spaces pad the end of values that are shorter than the number of bytes of storage.

In thinking about conserving space, customers often shorten the storage lengths of variables by using a LENGTH statement. When you shorten the lengths of your variables, you remove the best opportunity for SAS to compress. For example, if a numeric variable can be stored accurately in 4 bytes, the remaining 4 bytes (in an 8-byte variable) will all be zeros. This situation is perfect for compression. However, when you shorten the length to 4 bytes, the layout of the value is no longer suitable for compression. The only reason to truncate the storage length by using the LENGTH statement is to save disk space. All values are expanded to the full size of 8 bytes in the program data vector to perform computations in DATA and PROC steps. You'll use extra CPU resources to uncompress the data set as well as to expand variables back to 8 bytes.

Using the COMPRESS=BINARY option

When you use COMPRESS=BINARY, patterns of multiple characters across the entire observation are compressed. Binary compression uses two techniques at the same time. This option searches for the following:

  1. Repeating byte sequences (for example, 10 blank spaces or 10 zero bytes in a row)
  2. Repeating byte patterns (for example, the repeated pattern in the hexadecimal value 0102030405FAF10102030405FBF20102030405FCF3)

With that in mind, you can see that the bytes in a numeric variable are just as likely to be compressed as those in a character variable because the compression process does not consider those bytes to be numeric or character. They are just viewed as bytes. Consider a missing value that is represented in hexadecimal notation as FFFF000000000001. In the middle of that value is a string of five zero bytes (0x00) that can be replaced by two compression code-bytes. So, what starts as a sequence of 8 bytes ends up as a sequence of 5 bytes.

Keep in mind

As mentioned earlier, although compression saves space and is a great tool to keep handy in your SAS toolbox, it’s not meant for all your data sets. Some data sets are not going to compress well and the data set will grow larger, so know your data. Also, you’ll want to consider the extra CPU resources that are required to read a compressed file due to the overhead of uncompressing each observation.

What can compression do for you? was published on SAS Users.

12月 192017
 

Most of us put on a few pounds over the holidays. It's difficult not to, with the holiday dinners, and then all the special desserts, snacks, and candies. But there might also be another source of extra calories that you hadn't thought about - fast-food! Let's have a look at [...]

The post Counting fast-food calories over the holiday appeared first on SAS Learning Post.

12月 192017
 

If you’ve ever used Amazon or Netflix, you’ve experienced the value of recommendation systems firsthand. These sophisticated systems identify recommendations autonomously for individual users based on past purchases and searches, as well as other behaviors. By supporting an automated cross-selling approach, they empower brands to offer additional products or services [...]

Customer Intelligence 360: The digital shapeshifter of recommendation systems was published on Customer Intelligence Blog.