Tech

7月 102018
 

Did you know that 80 percent of an analytics life cycle is time spent on data preparation? For many SAS users and administrators, data preparation is what you live and breathe day in and day out.

Your analysis is only as good as your data, and that's why we wanted to shine a light on the importance of data preparation. I reached out to some of our superstar SAS users in the Friends of SAS community (for Canadian SAS customers and partners) for the inside scoop on different kinds of data preparation tasks they deal with on a daily basis.

Meet Kirby Wu, Actuarial Analyst for TD Insurance

At TD Insurance, SAS is used by many different teams for many different functions.

Kirby uses SAS mainly for its data preparation capabilities. This includes joining tables, cleaning the data, summarizations, segmentation and then sharing this ready-to-use data with the appropriate departments. A day in the life of Kirby includes tackling massive data sets containing billions of claim records. He needs powerful software to perform ETL (extract, transform and load) tasks and manage this data, and that’s where SAS comes in.

"SAS is the first step in our job to access good quality data," says Kirby. "Being an actuary, we use SAS to not only pick up data, but to do profiling, tech inquires and, most importantly, for data quality control purposes. Then we present the data to various teams to take advantage of the findings to improve the business."

Many actuaries have some basic SAS skills to understand the data set. Once they output the data, it is shared across departments and teams for others to make use of.

Prior to his work at TD Insurance, Kirby also used SAS for analytics. He ran GLM analysis where he encountered huge data sets. "When data comes out, we want to understand it, as well as performing statistical analysis on it," says Kirby. "SAS largely influences what direction to go in, and what variable we think is good to use."

Kirby left us with four reasons why he prefers SAS for data preparation.

  1. SAS is an enterprise solution, and the application itself is tried and proven.
  2. Working in insurance, there are many security concerns dealing with sensitive data. SAS provides reassurance in terms of data security.
  3. SAS has been serving the market for many years, and its capabilities and reputation are timeless.
  4. SAS offers a drag-and-drop GUI, as well as programming interfaces for users of varying skill levels.

Meet John Lam of CIBC

Since John joined the bank 15 years ago, he has been using SAS for ETL processes. John saves both time and money using Base SAS when he solves complex ETL tasks in his day-to-day work, and is mainly responsible for data preparation.

John accesses data from multiple source systems and transforms it for business consumption. He works on the technical side and passes on the transformed data within the same organization to the business side. The source data typically comes in at the beginning of each month, but the number of files varies month to month.

"SAS is a great tool," shares John. "The development time is a lot less and helps us save a lot of time on many projects."

John also shared with us some past experiences with complex issues where SAS would have come in handy. He once encountered a situation where he needed to calculate the length of time it would take for someone to receive benefits. However, this calculation method is very complicated and varies greatly depending on how the gap is structured.

"When I look back, the process that took us two to three weeks would have only taken us two to three days if we had used SAS," says John. "SAS would have provided a less complex way of figuring out the problem using date functions!"

Meet Horst Wolter, Manager at TD Bank

"My bread and butter is Base SAS," explains Horst. "The bank has data all over the place in multiple platforms and in multiple forms. We encounter a lot of data – from mainframe to Unix to PC, and flat files or mainframe SAS data sets."

Regardless of the platform or data he is dealing with, users always request slices and dices of the data. Horst takes all available data and finds ways of matching and merging different pieces together to create something that is relevant and easy to understand.

The majority of work Horst does is with credit card data. "I check database views that has millions of rows, which includes historical data."

The bank deals with millions of customers over many years, resulting in many records. Needless to say, the sizes of data he deals with are quite large! Accessing, processing and managing this data for business insight is a battle SAS helps Horst fight every day.

Sharing Is Caring!

How are you using SAS? Share in a few sentences in the comments!

About Friends of SAS

If you’re not familiar with Friends of SAS, it is an exclusive online community available only to our Canadian SAS customers and partners to recognize and show our appreciation for their affinity to SAS. Members complete activities called "challenges" and earn points that can be redeemed for rewards. There are opportunities to build powerful connections, gain privileged access to SAS resources and events, and boost your learning and development of SAS, all in a fun environment.

Interested in learning more about Friends of SAS? Feel free to email Natasha.Ulanowski@sas.com or Martha.Casanova@sas.com with any questions or to get more details.

No typical SAS user: how three professionals prep data using SAS was published on SAS Users.

7月 062018
 

SAS Visual Text Analytics provides dictionary-based and non-domain-specific tokenization functionality for Chinese documents, however sometimes you still want to get N-gram tokens. This can be especially helpful when the documents are domain-specific and most of the tokens are not included into the SAS-provided Chinese dictionary.

What is an N-gram?

An N-gram is a sequence of N items from a given text with n representing any positive integer starting from 1. When n is 1, it refers to a unigram; when n is 2, it refers to a bigram; when n is 3, it refers to a trigram. For example, suppose we have a text in Chinese "我爱中国。", which means "I love China." Its N-gram sequence looks like the following:

n Size N-gram Sequence
1 [我], [爱], [中], [国], [。]
2 [我爱], [爱中], [中国], [国。]
3 [我爱中], [爱中国], [中国。]

How many N-gram tokens are in a given sentence?

If Token_Count_of_Sentence is number of words in a given sentence, then the number of N-grams would be:

Count of N-grams = Token_Count_of_Sentence – ( n - 1 )

The following table shows the N-gram token count of "我爱中国。" with different n sizes.

n Size N-gram Sequence Token Count
1 [我], [爱], [中], [国], [。] 5 = 5- (1-1)
2 [我爱], [爱中], [中国], [国。] 4 = 5- (2-1)
3 [我爱中], [爱中国], [中国。] 3 = 5- (3-1)

In real actual language processing (NLP) tasks, we often want to get unigram, bigram and trigram together when we set N as 3. Similarly, when we set N as 4, we want to get unigram, bigram, trigram, and four-gram together.

N-gram theory is very simple and under some conditions it has big advantage over dictionary-based tokenization method, especially when the corpus you are working on has many vocabularies out of the dictionary or you don't have a dictionary at all.

How to get N-grams with SAS?

SAS is a powerful programming language when you manipulate data. Below you'll find a program I wrote, using the DATA step to get N-grams.

data data_test;
   infile cards dlm='|' missover;
   input _document_ text :$100.;
cards;
1|我爱中国。
;
run;
 
data NGRAMS;
   set data_test;
   _tmpStr_ = text;
   do while (klength(_tmpStr_)>0);  
      _maxN_=min(klength(_tmpStr_), 3);  
      do _i_=1 to _maxN_;
         _term_ = ksubstr(_tmpStr_, 1, _i_);
         output;  
      end;  
      if klength(_tmpStr_)>1 then _tmpStr_ = ksubstr(_tmpStr_, 2);  
      else _tmpStr_ = '';
   end;
   keep _document_ _term_ _i_;
run;

Let's see the SAS results.

proc sort data=NGRAMS;
   by _document_ _i_;
run;
 
proc print; run;

N-gram results

N-grams tokenization is the first step of NLP tasks. For most NLP tasks the second step is to calculate the term frequency–inverse document frequency (TF-IDF). Here's the approach:

tfidf(t,d,D) = tf(t,d) * idf(t,D)
IDF(t) = log_e(total number of documents / number of documents that contain term t)

Where t denotes the terms; d denotes each document; D denotes the collection of documents.

Suppose that you need to handle process lots of documents -- let me show you how to do it using SAS Viya. I used these four steps.

Step 1: Start CAS Server and create a CAS library.

cas casauto host="host.example.com" port=5570;
libname mycas cas;
 
<h4>Step 2: Load your data into CAS. </h4>
Here to simply the code, I only tried 3 sentences for demo purpose. 
data mycas.data_test;
   infile cards dlm='|' missover;
   input _document_ fact :$100.;
cards;
1|我爱中国。
2|我是中国人。
3|我是山西人。
;
run;

Once the data in loaded to CAS, you may run following code to check the column information and record count of your corpus.

proc cas;
  table.columnInfo / table="data_test";
run;
 
  table.recordCount / table="data_test";
run;
quit;

Step 3: Tokenize texts into N-grams

%macro TextToNgram(dsin=, docvar=, textvar=, N=, dsout=);
proc cas;
   loadactionset "dataStep";
   dscode =
      "data &dsout;
         set &dsin;
         length _term_ varchar(&N);
         _tmpStr_ = &textvar;
         do while (klength(_tmpStr_)>0);
            _maxN_=min(klength(_tmpStr_), &N);
            do _i_=1 to _maxN_;
              _term_ = ksubstr(_tmpStr_, 1, _i_);
              output;
            end;
            if klength(_tmpStr_)>1 then _tmpStr_ = ksubstr(_tmpStr_, 2); 
            else _tmpStr_ = ''; 
         end;
         keep &docvar _term_;
      run;";
   runCode code = dscode; 
run;
quit;
%mend TextToNgram;
 
%TextToNgram(dsin=data_test, docvar=_document_, textvar=text, N=3, dsout=NGRAMS);

Step 4: Calculate TF-IDF.

%macro NgramTfidfCount(dsin=, docvar=, termvar=, dsout=);
proc cas;
simple.groupBy / table={name="&dsin"}
                 inputs={"&docvar", "&termvar"}
                 aggregator="n" 
                 casout={name="NGRAMS_Count", replace=true};
run;
quit;
 
proc cas;
simple.groupBy / table={name="&dsin"}
                 inputs={"&docvar", "&termvar"}
                 casout={name="term_doc_nodup", replace=true};
run;
 
simple.groupBy / table={name="term_doc_nodup"}
                 inputs={"&docvar"}
                 casout={name="doc_nodup", replace=true};
run;
numRows result=r/ table={name="doc_nodup"};
totalDocs = r.numrows;
run;
 
simple.groupBy / table={name="term_doc_nodup"}
                 inputs={"&termvar"}
                 aggregator="n" 
                 casout={name="term_numdocs", replace=true};
run;
 
mergePgm = 
    "data &dsout;"
      || "merge NGRAMS_Count(keep=&docvar &termvar _score_ rename=(_score_=tf))
            term_numdocs(keep=&termvar _score_ rename=(_score_=numDocs));"
      || "by &termvar;"
      || "idf=log("||totalDocs||"/numDocs);"
      || "tfidf=tf*idf;"
      || "run;";
print mergePgm;
dataStep.runCode / code=mergePgm;
run;
quit;
%mend NgramTfidfCount;
 
%NgramTfidfCount(dsin=NGRAMS, docvar=_document_, termvar=_term_, dsout=NGRAMS_TFIDF);

Now let's see the TFIDF result of the first sentence.

proc print data=sascas1.NGRAMS_TFIDF;
   where _document_=1;
run;

ngram results

These N-gram methods are not designed only for Chinese documents; and documents in any language can be tokenized with this method. However, the tokenization granularity of English documents is different from Chinese documents, which is word-based rather than character-based. To handle English documents, you only need to make small changes to my code.

How to get N-grams and TF-IDF count from Chinese documents was published on SAS Users.

6月 302018
 

Introduced with SAS Visual Analytics 8.2 is a new object named: Key Value. The intent of this object is call attention to an aggregated value for a measure, a category, or both. For additional specifics,

I’ve mocked up several reports to show some of the combinations available to give you an idea of what the Key Value object can look like. Toward the end of the blog, I will add additional reports to provide design ideas for placement and action assignments. Click on any image to enlarge.

Text Style with Measure Value Highlight

Here you can see in the report, I am highlighting two measure values. Both are representing the Highest value but I selected one to show the aggregation and the other not. I was able to mimic similar headings by renaming my data item. Be sure to look at the Options and Roles pane for the assignments to understand how I accomplished this.

Infographic Style with Measure Value Highlight

Here is the same information represented using the Infographic style. Seeing the same report using the different styles allows you to quickly determine the most powerful and appropriate visualization to meet your needs. We cannot control the size of the circle, only the color. In this case, the circles are different thicknesses because of the number of characters used to represent the measure values inside the visual.

Text Style displaying both Measure Value and Category

In this report, I have shown how to use the Text style to display both a Category value and a Measure value. As you can see, only one can be Highlighted, i.e. given the largest font. Notice that in this report I used the object’s Title to help explain the key value being displayed, this is a recommended best practice.

Infographic Style displaying both Measure Value and Category

Below I am using the Infographic style and notice in the Options pane below that I had to use the Additional information attribute to better label the data. Make sure that when you are in the design phase and toggling between text and infographic to review and test the available Key Value Style attributes to better label the visual.

Text Style with Category Value Highlight

In this report, I show how to highlight the Category value using the Text style. Since I chose to not use any of the available label attributes it is critical that I use the object’s Title to better explain the key value displayed.

Infographic Style with Category Value Highlight

In this report, you can see how I changed the layout from the previous report to make the Key Value object side-by-side the other report objects. If you are interested in using the Infographic style with the circle enabled then you may have to adjust your report design to accommodate for the space the circle needs to display. Remember not to shy away from adding white space to your report, it can assist when adding emphasis to a particular visual, in this case a key value.

Summary

Some important things to remember about using the Key Value object in your reports:

  • Use the Key Value object’s Title to inform your users what the number or category value represents so there is no ambiguity as to if they are looking at the maximum or minimum value.
  • When determining which style you prefer, Text or Infographic, it may be easier to make a duplicate of the Page and then adjust the style attributes till you find the desired combination.
  • Take time to adjust the arrangement of objects on your report to get the most pleasing configuration. Don’t shy away from leaving white space in your report. You can also experiment with the Container object using the Precision container type to layer the Key Value object.
  • The Key Value object will be affected by Report and Page Prompts like any other report object and you can even define Actions to filter the Key Value object.

Here are some additional examples of using the Key Value object:

In this example, you can see from the Actions Diagram how the Key Value object is being filtered. First, by the two page prompts and second, there is a direct filter action defined from the List Control object

In this example, we can see from the Actions Diagram that I used the Container object. I then selected the Precision container type and overlaid the Key Value object on the Line Chart. The only filters applied to these report objects are the page prompts.

And in this last example, you can see how I have no Report or Page prompts or any other filters impacting the Key Value objects. Therefore, these values are representative for the entire data.

Key Value Object in SAS Visual Analytics was published on SAS Users.

6月 232018
 

Once upon a Time

TranscodingOnce upon a time, Oliver S. Füßling merely occupied a line in a SAS® program. But one day, he lost his last name, and a quest began to help our hero find the rest of his name.

Our Story Begins

The SAS Training Center wanted to re-create course data for the "Introduction to Programming 1" class. The updated class uses SAS® Studio, a new programming environment that incorporates a UTF-8 SAS session encoding. However, the course data sets contained national language characters, which are not available on an English keyboard. As a result, depending on how those programs were submitted in the new environment, they experienced the following transcoding problems:

  • character substitution
  • data truncation
  • invalid-data errors

Like the Training Center, you might encounter similar transcoding issues if you have programs that:

  • contain national language characters
  • are created in the WLatin-1 SAS session encoding
  • you move to a UTF-8 SAS session encoding.

This story explains how you can move such programs successfully to a UTF-8 environment and avoid substitution characters, data truncation, and invalid-data errors.

The programs in the "Introduction to Programming 1" class were originally submitted via an earlier English edition of the SAS® Foundation. However, the sample program in this story is created in SAS® 9.4 (English).

When the program is opened in the Enhanced Editor window, this is how a shortened version of the program looks:

Note: If you would like a copy of this program for your own testing, see the Epilogue heading at the end of this post.

In SAS 9.4 (English) for the Windows environment, the default session encoding is WLatin-1. You can see the encoding in the log by running either of the following sets of statements:

  • PROC OPTIONS OPTION=ENCODING;
    RUN;
  • %PUT ENCODING= %SYSFUNC(getOption(ENCODING));

When programs are saved from the Enhanced Editor window, the encoding for the program file defaults to Default - Western (Windows), as shown below.

When the program file that is shown earlier, which contains the name Oliver S. Füßling, is uploaded and included into the SAS Studio code editor, Oliver's last name displays replacement characters rather than the expected national language characters.

Note: This display shows SAS Studio open in a Google Chrome browser. In this browser, you see two characters (diamonds with white question marks) that are substituted for the national language characters.  If you use SAS Studio in Microsoft Internet Explorer, the display shows only one diamond, and it truncates the remainder of the name.

To begin resolving the display problem, you need to look at the code-editor status bar (bottom of the window).

Notice that there is a text-encoding setting that informs SAS Studio of the encoding of the external file. That setting is shown to the right on the status bar. In the display above, that encoding is UTF-8.

Be aware that this text-encoding setting differs from the UTF-8 SAS session encoding that is displayed by the SAS ENCODING system option, which is generated in the log when you run the OPTIONS procedure. In SAS Studio, the default text encoding is UTF-8, regardless of the session encoding. Because the pgm.sas program was saved originally from the Enhanced Editor in the default Western-Windows encoding, it is not in the encoding that the SAS Studio code editor expects.

To fix the display issue, you can use either of the following options:

Option 1

1.  Right-click the program file and select Open with text encoding.

2.  In the Select Text Encoding dialog box, select the windows-1252 encoding value from the Navigation Pane menu.

The Windows code page 1252 represents the character set that is used by Western European languages, including English, in Microsoft Windows operating environments. The WLatin-1 encoding is the SAS equivalent for the 1252 Windows code page.[1]

3.  Click OK to save your selection before you exit the dialog box.

Option 2

From the General tab in the Preferences dialog box, select a value for the default text encoding.

When you set the value in this way, the change is not reflected immediately in the existing code- editor window. You must close the program and re-open it for the setting to take effect. Any programs that you open later will retain the same setting unless you change the setting or override it by selecting another value in the Select Text Encoding dialog box.

Oliver's last name is displayed correctly in the code editor when you use the windows-1252 setting to open the file, as shown below:

However, Oliver's last name is truncated on the HTML Results tab when you submit the program.

The Plot Thickens

Although the problem is fixed in the code editor when you submit the program, Oliver's last name is truncated as Füßli in the output. However, you do not receive any notes or warnings in the log about that truncation. So, why does the truncation happen?  The ü (U-umlaut) and the ß (German Eszett) are stored as single-byte characters (SBCS) in WLatin-1, but those characters require two bytes in UTF-8. As a result, there is not adequate space to print the remaining characters in the name.

When you submit programs that contain national language characters from a single-byte encoding to a UTF-8 environment, you must be prepared to modify the program to use wider informats when you create your variables. Otherwise, character truncation can occur.

You can correct this problem easily by enlarging the column to accommodate the extra bytes that are used to store the characters in UTF-8.

Here is the modified INPUT statement that successfully reads the data in a UTF-8 SAS session. The character informat for the Lname variable is increased from $7. to $9.

input StudID $12. Age Fname $6. Mi :$2. Lname $9.;

After you increase the informat, Oliver's last name is correct when you view it on the HTML Results tab.

A Subplot Appears

What if the program file is included and executed by using the %INCLUDE statement rather than by submitting it from the code editor?

In this situation, the program stops processing with the following errors:

NOTE: The data set WORK.TEST has 1 observations and 5 variables.
NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       cpu time            0.00 seconds
 
ERROR: Invalid characters were present in the data.
ERROR: An error occurred while processing text data.
NOTE: The SAS System stopped processing this step because of errors.
 
NOTE: There were 1 observations read from the data set WORK.TEST.

In this case, the HTML Results tab does not display a last name at all.

To eliminate this error, you need to use the ENCODING= option in the %INCLUDE statement, as shown below.

%include "your-directory/pgm.sas" /encoding="windows-1252";

By including the ENCODING="windows-1252" option in the %INCLUDE statement, the program now executes successfully, as shown by the notes in the log:

 NOTE: The data set WORK.TEST has 1 observations and 5 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.01 seconds
       cpu time            0.01 seconds
 
 
 NOTE: There were 1 observations read from the data set WORK.TEST.

Happily Ever After (or, The End)!

The moral of this story is that there are many ways to avoid transcoding problems when you have national language characters in SAS programs that you save from a SAS®9 (English) session and move to a UTF-8 environment. Hopefully, you can use the tips that are provided to avoid such issues. However, if you still have problems, you can call on another hero, SAS Technical Support, for help!

Epilogue

The following program is the one used throughout this story. You can copy and paste it for your own use.

data test;
   input StudID $12. Age Fname $6. Mi :$2. Lname $7.;
   datalines;
120400310496 15 Oliver S. Füβling
;
 
proc print;
run;

Additional Resources

6月 232018
 

Once upon a Time

TranscodingOnce upon a time, Oliver S. Füßling merely occupied a line in a SAS® program. But one day, he lost his last name, and a quest began to help our hero find the rest of his name.

Our Story Begins

The SAS Training Center wanted to re-create course data for the "Introduction to Programming 1" class. The updated class uses SAS® Studio, a new programming environment that incorporates a UTF-8 SAS session encoding. However, the course data sets contained national language characters, which are not available on an English keyboard. As a result, depending on how those programs were submitted in the new environment, they experienced the following transcoding problems:

  • character substitution
  • data truncation
  • invalid-data errors

Like the Training Center, you might encounter similar transcoding issues if you have programs that:

  • contain national language characters
  • are created in the WLatin-1 SAS session encoding
  • you move to a UTF-8 SAS session encoding.

This story explains how you can move such programs successfully to a UTF-8 environment and avoid substitution characters, data truncation, and invalid-data errors.

The programs in the "Introduction to Programming 1" class were originally submitted via an earlier English edition of the SAS® Foundation. However, the sample program in this story is created in SAS® 9.4 (English).

When the program is opened in the Enhanced Editor window, this is how a shortened version of the program looks:

Note: If you would like a copy of this program for your own testing, see the Epilogue heading at the end of this post.

In SAS 9.4 (English) for the Windows environment, the default session encoding is WLatin-1. You can see the encoding in the log by running either of the following sets of statements:

  • PROC OPTIONS OPTION=ENCODING;
    RUN;
  • %PUT ENCODING= %SYSFUNC(getOption(ENCODING));

When programs are saved from the Enhanced Editor window, the encoding for the program file defaults to Default - Western (Windows), as shown below.

When the program file that is shown earlier, which contains the name Oliver S. Füßling, is uploaded and included into the SAS Studio code editor, Oliver's last name displays replacement characters rather than the expected national language characters.

Note: This display shows SAS Studio open in a Google Chrome browser. In this browser, you see two characters (diamonds with white question marks) that are substituted for the national language characters.  If you use SAS Studio in Microsoft Internet Explorer, the display shows only one diamond, and it truncates the remainder of the name.

To begin resolving the display problem, you need to look at the code-editor status bar (bottom of the window).

Notice that there is a text-encoding setting that informs SAS Studio of the encoding of the external file. That setting is shown to the right on the status bar. In the display above, that encoding is UTF-8.

Be aware that this text-encoding setting differs from the UTF-8 SAS session encoding that is displayed by the SAS ENCODING system option, which is generated in the log when you run the OPTIONS procedure. In SAS Studio, the default text encoding is UTF-8, regardless of the session encoding. Because the pgm.sas program was saved originally from the Enhanced Editor in the default Western-Windows encoding, it is not in the encoding that the SAS Studio code editor expects.

To fix the display issue, you can use either of the following options:

Option 1

1.  Right-click the program file and select Open with text encoding.

2.  In the Select Text Encoding dialog box, select the windows-1252 encoding value from the Navigation Pane menu.

The Windows code page 1252 represents the character set that is used by Western European languages, including English, in Microsoft Windows operating environments. The WLatin-1 encoding is the SAS equivalent for the 1252 Windows code page.[1]

3.  Click OK to save your selection before you exit the dialog box.

Option 2

From the General tab in the Preferences dialog box, select a value for the default text encoding.

When you set the value in this way, the change is not reflected immediately in the existing code- editor window. You must close the program and re-open it for the setting to take effect. Any programs that you open later will retain the same setting unless you change the setting or override it by selecting another value in the Select Text Encoding dialog box.

Oliver's last name is displayed correctly in the code editor when you use the windows-1252 setting to open the file, as shown below:

However, Oliver's last name is truncated on the HTML Results tab when you submit the program.

The Plot Thickens

Although the problem is fixed in the code editor when you submit the program, Oliver's last name is truncated as Füßli in the output. However, you do not receive any notes or warnings in the log about that truncation. So, why does the truncation happen?  The ü (U-umlaut) and the ß (German Eszett) are stored as single-byte characters (SBCS) in WLatin-1, but those characters require two bytes in UTF-8. As a result, there is not adequate space to print the remaining characters in the name.

When you submit programs that contain national language characters from a single-byte encoding to a UTF-8 environment, you must be prepared to modify the program to use wider informats when you create your variables. Otherwise, character truncation can occur.

You can correct this problem easily by enlarging the column to accommodate the extra bytes that are used to store the characters in UTF-8.

Here is the modified INPUT statement that successfully reads the data in a UTF-8 SAS session. The character informat for the Lname variable is increased from $7. to $9.

input StudID $12. Age Fname $6. Mi :$2. Lname $9.;

After you increase the informat, Oliver's last name is correct when you view it on the HTML Results tab.

A Subplot Appears

What if the program file is included and executed by using the %INCLUDE statement rather than by submitting it from the code editor?

In this situation, the program stops processing with the following errors:

NOTE: The data set WORK.TEST has 1 observations and 5 variables.
NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       cpu time            0.00 seconds
 
ERROR: Invalid characters were present in the data.
ERROR: An error occurred while processing text data.
NOTE: The SAS System stopped processing this step because of errors.
 
NOTE: There were 1 observations read from the data set WORK.TEST.

In this case, the HTML Results tab does not display a last name at all.

To eliminate this error, you need to use the ENCODING= option in the %INCLUDE statement, as shown below.

%include "your-directory/pgm.sas" /encoding="windows-1252";

By including the ENCODING="windows-1252" option in the %INCLUDE statement, the program now executes successfully, as shown by the notes in the log:

 NOTE: The data set WORK.TEST has 1 observations and 5 variables.
 NOTE: DATA statement used (Total process time):
       real time           0.01 seconds
       cpu time            0.01 seconds
 
 
 NOTE: There were 1 observations read from the data set WORK.TEST.

Happily Ever After (or, The End)!

The moral of this story is that there are many ways to avoid transcoding problems when you have national language characters in SAS programs that you save from a SAS®9 (English) session and move to a UTF-8 environment. Hopefully, you can use the tips that are provided to avoid such issues. However, if you still have problems, you can call on another hero, SAS Technical Support, for help!

Epilogue

The following program is the one used throughout this story. You can copy and paste it for your own use.

data test;
   input StudID $12. Age Fname $6. Mi :$2. Lname $7.;
   datalines;
120400310496 15 Oliver S. Füβling
;
 
proc print;
run;

Additional Resources

6月 222018
 

reference CAS tables using a one-level nameReferencing tables

With the SAS language the way one references a table in code is by using a two-level name or a one-level name. With two-level names, one supplies the libref as well as the table name i.e. PROC MEANS DATA = WORK.TABLE;. By default, all one-level names also refer to SASWORK i.e. PROC MEANS DATA = TABLE;

CAS

To reference CAS tables using a one-level name we will issue two statements that alter which libref houses the tables referenced as one-level names. First, we will create a CAS libref (line 77) followed by the “options user” statement on line 80. It is line 80 that changes the default location from SASWORK to the CAS libref i.e. CASWORK.Figure 1. Statements to alter the default location for one-level names

How to reference one-level CAS tables

From this point on all one-level names referenced in code will be tables managed by CAS. In figure 2 we are creating a one-level CAS table called baseball by reading the two-level named table SASHELP.BASEBALL. This step executes in the SAS Programing Runtime Engine (a SAS Viya based workspace server) and creates the table CASWORK.BASEBALL. Because of the “options user” statement we can now also reference that table using a one-level name i.e. BASEBALL.

Figure 2. Loading a two-level named table into a one-level named table that is managed by CAS

In Figure 3 we will use a DATA Step to read a CAS table and write a CAS table using one-level names. We can also see by reviewing the notes in the SAS log that this DATA Step ran in CAS using multiple threads.

Figure 3. DATA Step referencing one-level named tables

In Figure 4 we observe this PROC MEANS is processing the one-level named table BASEBALL. By reviewing the notes in the SAS log we can see this PROC MEANS ran distributed in CAS.

Figure 4. PROC MEANS referencing one-level named CAS table

Because the default location for one-level names is SASWORK, all tables in SASWORK are automatically deleted when the SAS session ends. When one changes the default location for one-level names, like we just did, it is a best practice to use a PROC DELETE as the last statement in your code to delete all one-level tables managed by CAS, figure 5.

Figure 5. PROC DELETE deleting all one-level CAS tables

Conclusion

It is a very common SAS coding technique to read a source table from a non-SAS data store and write it to SASWORK. By using the technique describe in this blog one now has options on where to store the one-level tables names. As for me, I prefer storing them in CAS so I benefit from the distributed process (faster runtimes) that CAS offers.

How to reference CAS tables using a one-level name was published on SAS Users.

6月 222018
 

As a follow up to my previous blog, I want to address connecting to SAS Viya 3.3 using a One-Time-Password generated by SAS 9.4. I will talk about how this authentication flow operates and when we are likely to require it.

To start with, a One-Time-Password is generated by a SAS 9.4 Metadata Server when we connect to a resource via the metadata. For example, whenever we connect to the SAS 9.4 Stored Process Server we leverage a One-Time-Password. Sometimes this is referred to as a “trusted connection,” in that the resource we are connecting to is configured to “trust” the single-use credential generated by the SAS 9.4 Metadata Server.

To make the connection, the client application connects to the SAS 9.4 Metadata Server and requests the One-Time-Password (OTP). This OTP is sent by the client to the resource along with the username that has “@!*(generatedpassworddomain)*!” appended to it. The resource then connects back to the SAS 9.4 Metadata Server to validate the OTP and allow access.

What Does OTP mean for SAS Viya?

First and foremost, we cannot use the OTP to access the SAS Viya 3.3 Visual Interfaces. OTP is not a mechanism to allow SAS Viya 3.3 to be authenticated by SAS 9.4.

The One-Time-Password enables a process running in SAS 9.4 Maintenance 5 (M5), that does not have the end-user credentials, to access SAS Cloud Analytic Services running on SAS Viya 3.3. The easiest and clearest example is that a SAS 9.4 M5 Stored Process can now access the advanced analytics features of SAS Cloud Analytic Services. Equally, the same process would work with a SAS 9.4 M5 Workspace Server that has been configured for “trusted authentication,” where the operating system process runs as a launch credential rather than the end user.

How Does the OTP Work?

If we continue the example of a SAS 9.4 M5 Stored Process, the SAS code in the Stored Process includes a CAS statement or CAS LIBNAME. In the CAS statement the authdomain is specified as _sasmeta_; this tells the Stored Process to connect to SAS 9.4 M5 Metadata to obtain credentials. The SAS 9.4 M5 Metadata returns a One-Time-Password to the Stored Process and this is used in the connection to SAS Cloud Analytic Services.

SAS Cloud Analytic Services authenticates the incoming connection using the OTP. Since the user is flagged with “@!*(generatedpassworddomain)*!” SAS Cloud Analytic Services knows not to authenticate the user against the PAM stack on the host. SAS Cloud Analytic Services instead connects to the SAS Viya 3.3 SAS Logon Manager to obtain an internal OAuth token to authenticate the connection.

The SAS Viya 3.3 SAS Logon Manager has been configured with information about the SAS 9.4 M5 environment, specifically, the host running the SAS Web Infrastructure Platform, in the form of a URL. Since the user is “@!*(generatedpassworddomain)*!”, SAS Viya 3.3 SAS Logon Manager knows to send this to the SAS 9.4 M5 Web Infrastructure Platform to validate the OTP. Once the OTP is validated, the SAS Viya 3.3 Logon Manager can generate an internal OAuth token, including retrieving the end-users group information from the Identities microservice. This internal OAuth token is returned to SAS Cloud Analytic Services and the session launched.

The diagram below describes these steps:

SAS Viya connecting with SAS 9.4

The general steps include:

1.     The SAS 9.4 M5 SAS Server, running with a launch credential (Stored Process, Pooled Workspace, or Workspace Server) requests a One-Time Password from the Metadata Server for the connection to SAS Cloud Analytic Services.
2.     The SAS 9.4 M5 SAS Server connects to the CAS Server Controller, sending the One-Time Password.
3.     The CAS Controller connects to SAS Logon Manager to obtain an internal OAuth token using the One-Time Password.
4.     SAS Logon Manager connects via the SAS 9.4 M5 Middle-Tier to validate the One-Time Password.
5.     SAS 9.4 M5 Middle-Tier connects to the Metadata Server to validate the One-Time Password.
6.     SAS Logon Manager connects to the identities microservice to fetch custom and LDAP group information for the validated End-User.
7.     The identities microservice either looks up the validated End-User in its cache or connects to Active Directory using the LDAP Service Account to update the cache.
8.     SAS Logon Manager returns a valid internal OAuth token to the SAS CAS Server Controller.
9.     SAS CAS Server Controller launches the CAS Session Controller as the service account for the End-User.

Note that none of the processes are running as the end-user. The SAS 9.4 process is running with a launch credential, either sassrv or some other account, whilehe SAS Cloud Analytic Services session runs as the account starting the SAS Cloud Analytic Services process, by default the CAS account.

What do we need to configure?

Now that we understand how the process operates, we can look at what we need to configure to make this work correctly. We need to make changes on both the SAS 9.4 M5 side and the SAS Viya 3.3 side. For SAS 9.4 M5 we need to:

1.     Register the SAS CAS Server in Metadata. As of SAS 9.4 M5, the templates for adding a server include SAS Cloud Analytic Services.
2.     Optionally we might also register libraries against the SAS CAS Server in the SAS 9.4 M5 Metadata.

For SAS Viya 3.3 we need to:

1.     Configure SAS Logon Manager with the information about the SAS 9.4 M5 Web Infrastructure Platform, under sas.logon.sas9, as shown below.
2.     Ensure the usernames from SAS 9.4 M5 are the same as those returned by the SAS Identities microservice.

The SAS Viya 3.3 SAS Logon Manager will need to be restarted after adding the definition shown here:

Conclusion

By leveraging the One-Time-Password, we make the power of SAS Cloud Analytic Services directly available to a wider range of SAS 9.4 M5 server process. This means our end-users, whether they are using SAS Stored Process Server, Pooled Workspace Server, or even a Workspace Server using a launched credential, can now directly access SAS Cloud Analytic Services.

SAS Viya connecting with SAS 9.4 One-Time-Passwords was published on SAS Users.

6月 192018
 

CAS DATA StepCloud Analytic Services (CAS) is really exciting. It’s open. It’s multi-threaded. It’s distributed. And, best of all for SAS programmers, it’s SAS. It looks like SAS. It feels like SAS. In fact, you can even run DATA Step in CAS. But, how does DATA Step work in a multi-threaded, distributed context? What’s new? What’s different? If I’m a SAS programming wizard, am I automatically a CAS programming wizard?

While there are certain _n_ automatic variable as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID; 
 
        uniqueID = _n_;
 
run;

CAS DATA Step

Creating a unique ID in CAS DATA Step is a bit more complicated. Each thread maintains its own _n_. So, if we just use _n_, we’ll get duplicate IDs. Each thread will produce an uniqueID field value of 1, 2..and so on. …. When the thread output is combined, we’ll have a bunch of records with an uniqueID of 1 and a bunch with an uniqueID of 2…. This is not useful.

To produce a truly unique ID, you need to augment _n_ with something else. _threadID_ automatic variable can help us get our unique ID as shown below:

DATA tableWithUniqueID;
SET tableWithOutUniqueID;
 
        uniqueID = put(_threadid_,8.) || || '_' || Put(_n_,8.);
 
run;

While there are surely other ways of doing it, concatenating _threadID_ with _n_ ensures uniqueness because the _threadID_ uniquely identifies a single thread and _n_ uniquely identifies a single row output by that thread.

Aggregation with DATA Step

Now, let’s look at “whole table” aggregation (no BY Groups).

SAS DATA Step

Aggregating an entire table in SAS DATA Step usually looks something like below. We create an aggregator field (totSalesAmt) and then add the detail records’ amount field (SaleAmt) to it as we process each record. Finally, when there are no more records (eof), we output the single aggregate row.

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + SaleAmt;
      keep totSalesAmt;
      if eof then output;
 
run;

CAS DATA Step

While the above code returns one row in single-engine SAS, the same code returns multiple rows in CAS — one per thread. When I ran this code against a table in my environment, I got 28 rows (because CAS used 28 threads in this example).

As with the unique ID logic, producing a total aggregate is just a little more complicated in CAS. To make it work in CAS, we need a post-process step to bring the results together. So, our code would look like this:

DATA aggregatedTable ;
SET detailedTable end=eof;
 
      retain threadSalesAmt 0;
      threadSalesAmt = threadSalesAmt + SaleAmt;
      keep threadSalesAmt;
      if eof then output;
 
run;
 
DATA aggregatedTable / single=yes;
SET aggregatedTable end=eof;
 
      retain totSalesAmt 0;
      totSalesAmt = totSalesAmt + threadSalesAmt;
      if eof then output;
 
run;

In the first data step in the above example, we ran basically the same code as in the SAS DATA Step example. In that step, we let CAS do its distributed, multi-threaded processing because our table is large. Spreading the work over multiple threads makes the aggregation much quicker. After this, we execute a second DATA Step but here we force CAS to use only one thread with the single=yes option. This ensures we only get one output row because CAS only uses one thread. Using a single thread in this case is optimal because we’ll only have a few input records (one per thread from the previous step).

BY-GROUP Aggregation

Individual threads are then assigned to individual BY-Groups. Since each BY-Group is processed by one and only one thread, when we aggregate, we won’t see multiple output rows for a BY-Group. So, there shouldn’t be a need to consolidate the thread results like there was with “whole table” aggregation above.

Consequently, BY-Group aggregation DATA Step code should look exactly the same in CAS and SAS (at least for the basic stuff).

Concluding Thoughts

Coding DATA Step in CAS is very similar to coding DATA Step in SAS. If you’re a wizard in one, you’re likely a wizard in the other. The major difference is accounting for CAS’ massively parallel processing capabilities (which manifest as threads). For more insight into data processing with CAS, check out the SAS Global Forum paper.

Threads and CAS DATA Step was published on SAS Users.

6月 142018
 

In SAS Visual Analytics 8.2 on SAS Viya 3.3, there are a number of new data features available. Some of these features are completely new, and some are features from the 7.x release that had not yet been included in the 8.1 release.  I’ll cover a few of these new features in this post.

First of all, the Data pane interface has changed to enable users to access actions via fewer and better organized menus.

Data item properties can also be displayed for viewing or editing with a single click.

The new Change data source action displays a Repair report window if report data items are not in the new data source.  The window enables you to replace the missing data items with replacement data items from the new data source before continuing with the change.

Speaking of mapping, in SAS Visual Analytics 8.2, linked selections and filters can automatically be add to objects, and the objects may use different data sources. In that case, you can manually map data sources from the data pane.  The + icon enables you to add additional pairs of mappings.

When you create a new Geography data item in SAS Visual Analytics 8.2, in addition to using Predefined names and codes or your own custom latitude and longitude data items, you can now also use custom polygon shapes to display your own custom regions. Once you select Custom polygon shapes, you specify, in additional dialogs, the characteristics of your polygon provider.  You can use a CAS table or an Esri Feature Service.

For more information on custom polygons, see my previous blog here.

If you need to use and Esri shape file for your polygon data, there are macros available in VA 8.2 to convert the data to a SAS dataset and to load the data into CAS.

  • %SHPCNTNT display the contents of the shape file
  • %SHPIMPRT converts the shapefile into a SAS dataset and loads it into CAS.

The Custom Sort feature is also back in SAS Visual Analytics 8.2. Just right-click the data item, select Custom sort, and then select and order your data values.

For creating a new derived data item, there are several new calculations available for measures:

And speaking of creating calculated data items, you’ll want to check out three useful new operators that are available in SAS Visual Analytics 8.2:

A look at the new data pane and data item features in SAS Visual Analytics 8.2 was published on SAS Users.

6月 092018
 

SAS Studio is the latest way you can access SAS. This newer interface allows users to reach SAS through a web browser, offering a number of unique ways that SAS can be optimized. At SAS Global Forum 2018, Lora Delwiche (SAS) and Susan J Slaughter (Avocet Solutions) gave the presentation, “SAS Studio: A New Way to Program in SAS.” This post reviews the paper, offering you insights of how to enhance your SAS Studio programming performance.

This new interface is a popular one, as it is included in Base SAS and used for SAS University Edition and SAS OnDemand for Academics. It can be considered a self-serving system, since you write programs in SAS Studio itself that are then processed through SAS and delivered results. Its ease of accessibility from a range of computers is putting it in high demand – which is why you should learn how to optimize its use.

How to operate

A SAS server processes your coding and returns the results to your browser, in order to make the programs run successfully. By operating in Programmer mode, you are given the capabilities to view Code, Log, and Results. On the right side of the screen you can write your code, and the toolbar allows you to access the many different tools that are offered.

SAS Studio

Libraries are used to access your SAS data sets, where you can also see the variables contained in each set. You can create your own libraries, and set the path for your folder through SAS Studio.

In order to view each data set, the navigation pane can also be used. Right click on the data set name and select “Open” to access files through this method. These datasets can be adjusted in a number of ways: columns can be shifted around by dragging the headings; column sizes can be adjusted; the top right corner has arrows to view more information; clicking on the column heading will sort that data.

 

In order to control your data easily, filters can be used. Filters are accessed by right-clicking the column heading and selecting the filter that best fits your needs.

How to successfully code

A unique feature to SAS Studio is its code editor that will automatically format your code. Clicking on the icon will properly format each statement and put it on its own line. Additionally, syntax help pops up as you type to give you possible suggestions in your syntax, a tool that can be turned on or off through the Preferences window.

One tool that’s particularly useful is the snippet tool, where you can copy and paste frequently used code.

Implementing and Results

After code is written, the Log tool can help you review your code, whereas Results will generate your code carried out after it has been processed. The Results tab will give you shareable items that can be saved or printed for analysis purposes.

Conclusion

These insights offer just a glimpse of all of the capabilities in programming through SAS Studio. Through easy browser access, your code can be shared and analyzed with a few clicks.

Additional Resources

Additional SAS Global Forum Proceedings
SAS Studio Videos
SAS Studio Courses
SAS Studio Programming Starter Guide
SAS Studio Blogs
SAS Studio Community

Other SAS Global Forum Programming Papers of Interest

Code Like It Matters: Writing Code That's Readable and Shareable
Paul Kaefer

Identifying Duplicate Variables in a SAS ® Data Set
Bruce Gilsen

Macros I Use Every Day (And You Can, Too!)
Joe DeShon

Merge with Caution: How to Avoid Common Problems when Combining SAS Datasets
Joshua M. Horstman

SAS Studio: A new way to program in SAS was published on SAS Users.