sas programming

9月 172020
 

Unquote by removing matching quotesBefore we delve into unquoting SAS character variables let’s briefly review existing SAS functionality related to the character strings quoting/unquoting.

%QUOTE and %UNQUOTE macro functions

Don’t be fooled by these macro functions’ names. They have nothing to do with quoting or un-quoting character variables’ values. Moreover, they have nothing to do with quoting or un-quoting even macro variables’ values. According to the %QUOTE Macro Function documentation it masks special characters and mnemonic operators in a resolved value at macro execution.  %UNQUOTE Macro Function unmasks all special characters and mnemonic operators so they are interpreted as macro language elements instead of as text. There are many other SAS “macro quoting functions” (%SUPERQ, %BQUOTE, %NRBQUOTE, all macro functions whose name starts with %Q: %QSCAN, %QSUBSTR, %QSYSFUNC, etc.) that perform some action including masking.

Historically, however, SAS Macro Language uses terms “quote” and “unquote” to denote “mask” and “unmask”. Keep that in mind when reading SAS Macro documentation.

QUOTE function

Most SAS programmers are familiar with the QUOTE function that adds quotation marks around a character value. It can add double quotation marks (by default) or single quotation marks if you specify that in its second argument.

This function goes even further as it doubles any quotation mark that already existed within the value to make sure that an embedded quotation mark is escaped (not treated as an opening or closing quotation mark) during parsing.

DEQUOTE function

There is also a complementary DEQUOTE function that removes matching quotation marks from a character string that begins with a quotation mark. But be warned that it also deletes all characters to the right of the first matching quotation mark. In my view, deleting those characters is overkill because when writing a SAS program, we may not know what is going to be in the data and whether it’s okay to delete its part outside the first matching quotes. That is why you need to be extra careful if you decide to use this function. Here is an example of what I mean. If you run the following code:

data a;
   input x $ 1-50;
   datalines;
'This is what you get. Let's be careful.'
;
 
data _null_;
   set a;
   y = dequote(x);
   put x= / y=;
run;

you will get the following in the SAS log:

y=This is what you get. Let

This is hardly what you really wanted as you have just lost valuable information – part of the y character value got deleted: 's be careful. I would rather not remove the quotation marks at all than remove them at the expense of losing meaningful information.

$QUOTE informat

The $QUOTE informat does exactly what the DEQUOTE() function does, that is removes matching quotation marks from a character string that begins with a quotation mark. You can use it in the example above by replacing

y = dequote(x);

with the INPUT() function

y = input(x, $quote50.);

Or you can use it directly in the INPUT statement when reading raw data from datalines or an external file:

input x $quote50.;

Both, $QUOTE informat and DEQUOTE() function, in addition to removing all characters to the right of the closing quotation mark do the following unconventional, peculiar things:

  • Remove a lone quotation mark (either double or single) when it’s the only character in the string; apparently, the lone quotation mark is matched to itself.
  • Match single quotation mark with double quotation mark as if they are the same.
  • Remove matching quotation marks from a character string that begins with a quotation mark; if your string has one or more leading blanks (that is, a quotation mark is not the first character), nothing gets removed (un-quoted).

If the described behavior matches your use case, you are welcome to use either $QUOTE informat or DEQUOTE() function. Otherwise, please read on.

UNQUOTE function definition

Up to this point such a function did not exist, but we are about to create one to justify the title. Let’s keep it simple and straightforward. Here is what I propose our new unquote() function to do:

  • If first and last non-blank characters of a character string value are matching quotation marks, we will remove them. We will not consider quotation marks matching if one of them is a single quotation mark and another is a double quotation mark.
  • We will remove those matching quotation marks whether they are both single quotation marks OR both double quotation marks.
  • We are not going to remove or change any other quotation marks that may be present within those matching quotation marks that we remove.
  • We will remove leading and trailing blanks outside the matching quotation marks that we delete.
  • However, we will not remove any leading or trailing blanks within the matching quotation marks that we delete. You may additionally apply the STRIP() function if you need to do that.

To summarize these specifications, our new UNQUOTE() function will extract a character substring within matching quotation marks if they are the first and the last non-blank characters in a character string. Otherwise, it returns the character argument unchanged.

UNQUOTE function implementation

Here is how such a function can be implemented using PROC FCMP:

libname funclib 'c:\projects\functions';
 
proc fcmp outlib=funclib.userfuncs.v1; /* outlib=libname.dataset.package */
   function unquote(x $) $32767;
      pos1 = notspace(x); *<- first non-blank character position;
      if pos1=0 then return (x); *<- empty string;
 
      char1 = char(x, pos1); *<- first non-blank character;
      if char1 not in ('"', "'") then return (x); *<- first non-blank character is not " or ' ;
 
      posL = notspace(x, -length(x)); *<- last non-blank character position;
 
      if pos1=posL then return (x); *<- single character string;
 
      charL = char(x, posL); *<- last non-blank character;
      if charL^=char1 then return (x); *<- last non-blank character does not macth first;
 
      /* at this point we should have matching quotation marks */
      return (substrn(x, pos1 + 1, posL - pos1 - 1)); *<- remove first and last quotation character;
   endfunc; 
run;

Here are the highlights of this implementation:

We use multiple RETURN statements: we sequentially check for different special conditions and if one of them is met we return the argument value intact. The RETURN statement does not just return the value, but also stops any further function execution.

At the very end, after making sure that none of the special conditions is met, we strip the argument value from the matching quotation marks along with the leading and trailing blanks outside of them.

NOTE: SAS user-defined functions are stored in a SAS data set specified in the outlib= option of the PROC FCMP. It requires a 3-level name (libref.datsetname.packagename) for the function definition location to allow for several versions of the same-name function to be stored there.

However, when a user-defined function is used in a SAS DATA Step, only a 2-level name can be specified (libref.datasetname). If that data set has several same-name functions stored in different packages the DATA Step uses the latest function definition (found in a package closest to the bottom of the data set).

UNQUOTE function results

Let’s use the following code to test our newly minted user-defined function UNQUOE():

libname funclib 'c:\projects\functions';
options cmplib=funclib.userfuncs;
 
data A;
   infile datalines truncover;
   input @1 S $char100.;
   datalines;
'
"
How about this?
    How about this?
"How about this?"
'How about this?'
"How about this?'
'How about this?"
"   How about this?"
'      How about this?'
'      How "about" this?'
'      How 'about' this?'
   "     How about this?"
   "     How "about" this?"
   "     How 'about' this?"
   '     How about this?'
;
 
data B;
   set A;
   length NEW_S $100;
   label NEW_S = 'unquote(S)';
   NEW_S = unquote(S);
run;

This code produces the following output table:

Example of character string unquoting
As you can see it does exactly what we wanted it to do – removing matching first and last quotation marks as well as stripping out blanks outside the matching quotation marks.

DSD (Delimiter-Sensitive Data) option

This INFILE statement’s option is particularly and extremely useful when using LIST input to read and un-quote comma-delimited raw data. In addition to removing enclosing quotation marks from character values, the DSD option specifies that when data values are enclosed in quotation marks, delimiters within the value are masked, that is treated as character data (not as delimiters). It also sets the default delimiter to a comma and treats two consecutive delimiters as a missing value.

In contrast with the above UNQUOTE() function, the DSD option will not remove enclosing quotation marks if there are same additional quotation marks present inside the character value.  When DSD option does strip enclosing quotation marks it also strips leading and trailing blanks outside and within the removed quotation marks.

Additional Resources

Your thoughts?

Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.

How to unquote SAS character variable values was published on SAS Users.

9月 172020
 

Unquote by removing matching quotesBefore we delve into unquoting SAS character variables let’s briefly review existing SAS functionality related to the character strings quoting/unquoting.

%QUOTE and %UNQUOTE macro functions

Don’t be fooled by these macro functions’ names. They have nothing to do with quoting or un-quoting character variables’ values. Moreover, they have nothing to do with quoting or un-quoting even macro variables’ values. According to the %QUOTE Macro Function documentation it masks special characters and mnemonic operators in a resolved value at macro execution.  %UNQUOTE Macro Function unmasks all special characters and mnemonic operators so they are interpreted as macro language elements instead of as text. There are many other SAS “macro quoting functions” (%SUPERQ, %BQUOTE, %NRBQUOTE, all macro functions whose name starts with %Q: %QSCAN, %QSUBSTR, %QSYSFUNC, etc.) that perform some action including masking.

Historically, however, SAS Macro Language uses terms “quote” and “unquote” to denote “mask” and “unmask”. Keep that in mind when reading SAS Macro documentation.

QUOTE function

Most SAS programmers are familiar with the QUOTE function that adds quotation marks around a character value. It can add double quotation marks (by default) or single quotation marks if you specify that in its second argument.

This function goes even further as it doubles any quotation mark that already existed within the value to make sure that an embedded quotation mark is escaped (not treated as an opening or closing quotation mark) during parsing.

DEQUOTE function

There is also a complementary DEQUOTE function that removes matching quotation marks from a character string that begins with a quotation mark. But be warned that it also deletes all characters to the right of the first matching quotation mark. In my view, deleting those characters is overkill because when writing a SAS program, we may not know what is going to be in the data and whether it’s okay to delete its part outside the first matching quotes. That is why you need to be extra careful if you decide to use this function. Here is an example of what I mean. If you run the following code:

data a;
   input x $ 1-50;
   datalines;
'This is what you get. Let's be careful.'
;
 
data _null_;
   set a;
   y = dequote(x);
   put x= / y=;
run;

you will get the following in the SAS log:

y=This is what you get. Let

This is hardly what you really wanted as you have just lost valuable information – part of the y character value got deleted: 's be careful. I would rather not remove the quotation marks at all than remove them at the expense of losing meaningful information.

$QUOTE informat

The $QUOTE informat does exactly what the DEQUOTE() function does, that is removes matching quotation marks from a character string that begins with a quotation mark. You can use it in the example above by replacing

y = dequote(x);

with the INPUT() function

y = input(x, $quote50.);

Or you can use it directly in the INPUT statement when reading raw data from datalines or an external file:

input x $quote50.;

Both, $QUOTE informat and DEQUOTE() function, in addition to removing all characters to the right of the closing quotation mark do the following unconventional, peculiar things:

  • Remove a lone quotation mark (either double or single) when it’s the only character in the string; apparently, the lone quotation mark is matched to itself.
  • Match single quotation mark with double quotation mark as if they are the same.
  • Remove matching quotation marks from a character string that begins with a quotation mark; if your string has one or more leading blanks (that is, a quotation mark is not the first character), nothing gets removed (un-quoted).

If the described behavior matches your use case, you are welcome to use either $QUOTE informat or DEQUOTE() function. Otherwise, please read on.

UNQUOTE function definition

Up to this point such a function did not exist, but we are about to create one to justify the title. Let’s keep it simple and straightforward. Here is what I propose our new unquote() function to do:

  • If first and last non-blank characters of a character string value are matching quotation marks, we will remove them. We will not consider quotation marks matching if one of them is a single quotation mark and another is a double quotation mark.
  • We will remove those matching quotation marks whether they are both single quotation marks OR both double quotation marks.
  • We are not going to remove or change any other quotation marks that may be present within those matching quotation marks that we remove.
  • We will remove leading and trailing blanks outside the matching quotation marks that we delete.
  • However, we will not remove any leading or trailing blanks within the matching quotation marks that we delete. You may additionally apply the STRIP() function if you need to do that.

To summarize these specifications, our new UNQUOTE() function will extract a character substring within matching quotation marks if they are the first and the last non-blank characters in a character string. Otherwise, it returns the character argument unchanged.

UNQUOTE function implementation

Here is how such a function can be implemented using PROC FCMP:

libname funclib 'c:\projects\functions';
 
proc fcmp outlib=funclib.userfuncs.v1; /* outlib=libname.dataset.package */
   function unquote(x $) $32767;
      pos1 = notspace(x); *<- first non-blank character position;
      if pos1=0 then return (x); *<- empty string;
 
      char1 = char(x, pos1); *<- first non-blank character;
      if char1 not in ('"', "'") then return (x); *<- first non-blank character is not " or ' ;
 
      posL = notspace(x, -length(x)); *<- last non-blank character position;
 
      if pos1=posL then return (x); *<- single character string;
 
      charL = char(x, posL); *<- last non-blank character;
      if charL^=char1 then return (x); *<- last non-blank character does not macth first;
 
      /* at this point we should have matching quotation marks */
      return (substrn(x, pos1 + 1, posL - pos1 - 1)); *<- remove first and last quotation character;
   endfunc; 
run;

Here are the highlights of this implementation:

We use multiple RETURN statements: we sequentially check for different special conditions and if one of them is met we return the argument value intact. The RETURN statement does not just return the value, but also stops any further function execution.

At the very end, after making sure that none of the special conditions is met, we strip the argument value from the matching quotation marks along with the leading and trailing blanks outside of them.

NOTE: SAS user-defined functions are stored in a SAS data set specified in the outlib= option of the PROC FCMP. It requires a 3-level name (libref.datsetname.packagename) for the function definition location to allow for several versions of the same-name function to be stored there.

However, when a user-defined function is used in a SAS DATA Step, only a 2-level name can be specified (libref.datasetname). If that data set has several same-name functions stored in different packages the DATA Step uses the latest function definition (found in a package closest to the bottom of the data set).

UNQUOTE function results

Let’s use the following code to test our newly minted user-defined function UNQUOE():

libname funclib 'c:\projects\functions';
options cmplib=funclib.userfuncs;
 
data A;
   infile datalines truncover;
   input @1 S $char100.;
   datalines;
'
"
How about this?
    How about this?
"How about this?"
'How about this?'
"How about this?'
'How about this?"
"   How about this?"
'      How about this?'
'      How "about" this?'
'      How 'about' this?'
   "     How about this?"
   "     How "about" this?"
   "     How 'about' this?"
   '     How about this?'
;
 
data B;
   set A;
   length NEW_S $100;
   label NEW_S = 'unquote(S)';
   NEW_S = unquote(S);
run;

This code produces the following output table:

Example of character string unquoting
As you can see it does exactly what we wanted it to do – removing matching first and last quotation marks as well as stripping out blanks outside the matching quotation marks.

DSD (Delimiter-Sensitive Data) option

This INFILE statement’s option is particularly and extremely useful when using LIST input to read and un-quote comma-delimited raw data. In addition to removing enclosing quotation marks from character values, the DSD option specifies that when data values are enclosed in quotation marks, delimiters within the value are masked, that is treated as character data (not as delimiters). It also sets the default delimiter to a comma and treats two consecutive delimiters as a missing value.

In contrast with the above UNQUOTE() function, the DSD option will not remove enclosing quotation marks if there are same additional quotation marks present inside the character value.  When DSD option does strip enclosing quotation marks it also strips leading and trailing blanks outside and within the removed quotation marks.

Additional Resources

Your thoughts?

Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments below.

How to unquote SAS character variable values was published on SAS Users.

9月 022020
 

SAS offering free learning resources in celebration of programmers

For more than 40 years, SAS programmers have crafted software and solutions that transform the world. From statistics to data science, to analytics and artificial intelligence, people writing code have architected a new economy with incredible opportunities. SAS Programmer Week honors those people by offering free learning resources available for everyone, from students to early career professionals to SAS veterans.

Running from Sept. 7-11, SAS Programmer Week leads up to the international Day of the Programmer on Saturday, Sept. 12. Training resources will be available for free through a variety of YouTube and video tutorials, webinars, blogs and documentation.

There will be three different tracks for new, experienced and analytics-focused users, with new content released each day. The week culminates with SAS certification prep content that will have participants ready to pursue a valuable SAS credential.

For instance, Tech Republic named SAS as one of 7 data science certifications to boost your resume and salary. CIO Magazine puts SAS among the top 11 big data and analytics certifications for 2020.

Since SAS programmers are busy and may not have all day to engage with the materials, SAS Programmer Week is flexible. Participants can access the material when they want to learn a specific skill related to the day’s topic or consume the material in snippets when they have time.

Interested participants can visit the SAS Programmer Week website to register today, preview the materials and schedule, and jump-start their career journeys.

 

All hail the SAS programmer! was published on SAS Users.

8月 242020
 

I got a lot of feedback about my recent article about how to find roots of nonlinear functions by using the SOLVE function in PROC FCMP. A colleague asked how the FCMP procedure stores the functions. Specifically, why the OUTLIB= option on the PROC FCMP statement use a three-level syntax: OUTLIB=libref.DataSetName.PackageName. The three levels are a libref, a data set name, and a package name. The documentation is terse about what the third level (the package name) is used for and why it is important. This article describes how the FCMP-defined functions are stored, and how you can use the package name to call different versions of a function.

This article is my attempt to "reverse engineer" how PROC FCMP stores functions based on what I have read and observed. In addition to the FCMP documentation, I recommend reading Secosky (2007) and Eberhardt (2009). Feel free to add your own knowledge in the comments.

How FCMP functions are stored

I started writing about the capabilities of the FCMP procedure in 2012, but the procedure itself goes back to SAS 9.2. Modern versions of SAS store functions in an analytic store (which is read by using PROC ASTORE) or in an item store (which is read by using PROC PLM). But these binary storage formats had not yet been developed back in the pre-9.2 days. So PROC FCMP stores functions in a SAS data set. That means you can use PROC PRINT to investigate how PROC FCMP stores functions.

When you use the OUTLIB= option in PROC FCMP, you specify a three-level name: OUTLIB=libref.DataSetName.PackageName. The first two levels specify the name of a SAS data set. This data set is created if it doesn't exist, or it is modified if it already exists. The third level is used as a text field in a variable named _KEY_, which enables one data set to contain functions that belong to different packages. The package name becomes important if two packages define a function that has the same name.

To demonstrate, let's define some functions and store them in a data set named Work.MyFuncs. The following statements create two functions (A and B) that belong to the 'PROD' (for 'Production') package and one function (A) that belongs to the 'DEV' (for 'Development') package. Notice that both packages have a function named 'A'. The following statements define the functions and use PROC PRINT to display a portion of the Work.MyFuncs data set:

/* Store all functions in the data set WORK.MyFuncs */
/* Define functions in 'PROD' package */
proc fcmp outlib=work.MyFuncs.Prod;
   function A(x);
      return( x );      /* in the 'Prod' pkg, A(x) equals x */
   endsub;
   function B(x);
      return( x > 0 );
   endsub;
quit;
 
/* Define functions in 'DEV' package */
proc fcmp outlib=work.MyFuncs.Dev;
   function A(x);
      return( 2*x );    /* the 'Dev' pkg uses a different definition for A(x) */
   endsub;
quit;
 
proc print data=work.MyFuncs;
   var _Key_ Sequence Type Subtype Name;
run;

The output from PROC PRINT is shown. The data set contains 20 rows. I have put a red rectangle around rows 1–13 and another around rows 14–20. Each rectangle defines a package. The names of the packages are defined by the observations where Subtype='Package', which are highlighted in yellow. The Type, Subtype, and Name columns indicate how the FCMP statements that define the functions are stored in the data set. The _KEY_ column identifies which rows define which functions. There are other columns (not shown) that store the actual content of each function.

This output shows how the third level of the OUTLIB= option is used. The _KEY_ column records the package name and appends each function name ('A' or 'B') to the name of the package. So PROC FCMP knows that there are three stored functions whose full names are PROD.A, PROD.B, and DEV.A.

Calling a function from the DATA step

Since there are two functions called 'A', what happens if I call 'A' from a SAS DATA step? The answer is that the DATA step uses the most recent definition, which in this example is DEV.A. To alert you to the fact that calling 'A' is ambiguous, the SAS log displays a warning. You can also use the _DISPLAYLOC_ flag on the CMPLIB= system option to display the origin of each call to an FCMP function, as follows:

/* Tell the DATA step where to look for unresolved functions.
   The _DISPLAYLOC_ flag shows the full name for each call to an FCMP function */
options cmplib=(work.MyFuncs _DISPLAYLOC_); 
data Want;
   x = 1; y = A(x);   /* y is the result of the latest definition */
run;
 
proc print data=Want noobs; run;
WARNING: Function 'A' was defined in a previous package. 'A' in current
         package DEV will be used as default when the package name is not
         specified.
 
NOTE: Function 'A' loaded from work.MyFuncs.DEV.

The value of the X and Y variables make it clear that the function DEV.A was called (because A(x)=2*x in that definition). The WARNING and NOTE in the SAS log reinforce this fact.

Choosing which package to call

The WARNING in the previous section says that the current (most recent) package "will be used as default when the package name is not specified." This message seems to imply that you can somehow call PROD.A, which is the other stored function that is named 'A'. This is, in fact, true. The PROC FCMP documentation states, "to select a specific subroutine when there is ambiguity, use the package name and a period as the prefix to the subroutine name."

You cannot specify the package name directly in the DATA step, but you can specify the package name in an FCMP function. So, for example, you can define a function called 'ChooseA' that includes a flag that indicates which package to use. The following PROC FCMP statements define a function that will call either PROD.A or DEV.A, depending on the value of a parameter. This wrapper function can then be called in the DATA step:

/* In PROC FCMP, you can "dis-ambiguate" by using a two-level function name */
proc fcmp outlib=work.MyFuncs.Choose;
   function ChooseA(x, choice $);
      if upcase(choice)="DEV" then
        return( Dev.A(x) );
      else
        return( Prod.A(x) );
   endsub;
quit;
 
data WantChoice;
   x = 1;
   y_Dev  = ChooseA(x, "Dev");   /* call Dev.A */
   y_Prod = ChooseA(x, "Prod");  /* call Prod.A */
run;
 
proc print data=WantChoice noobs; run;

From the definitions of DEV.A and PROD.A, you can verify that each function was called correctly. Because the _DISPLAYLOC_ option is still active, the SAS log also indicates that each function was called.

Summary

This article was motivated by a question about how the FCMP procedure stores functions. The answer is that the OUTLIB= option on the PROC FCMP statement requires a libref, a data set name, and a package name. In most circumstances, you do not need to use the package name. The package name becomes important only if two different packages each support a function that has the same name. In that case, you can use the package name to disambiguate the function call.

Personally, I prefer to avoid having two packages that define the same function, but if you cannot avoid it, this trick shows you how to handle it. Eberhardt (2009, p. 15) discusses a related issue, which is how to call functions that are stored in two (or more) different data sets.

The post How does PROC FCMP store functions? appeared first on The DO Loop.

8月 192020
 

Finding the root (or zero) of a nonlinear function is an important computational task. In the case of a one-variable function, you can use the SOLVE function in PROC FCMP to find roots of nonlinear functions in the DATA step. This article shows how to use the SOLVE function to find the roots of a user-defined function from the SAS DATA step.

Some ways to find the root of a nonlinear function in SAS

I have previously blogged about several ways to find the roots of nonlinear equations in SAS, including:

The DATA step does not have a built-in root-finding function, but you can implement one by using PROC FCMP.

PROC FCMP: Implement user-defined functions in SAS

PROC FCMP enables you to write user-defined functions that can be called from the DATA step and from SAS procedures that support DATA step programming statements (such as PROC NLIN, PROC NLMIXED, and PROC MCMC). To demonstrate finding the roots of a custom function, let's use the same function that I used to show how to find roots by using SAS/IML. The following statements use PROC FCMP to define a function (FUNC1) and to store the function to WORK.FUNCS. You must use the CMPLIB= system option to tell the DATA step to look in WORK.FUNCS for unknown functions. The DATA step evaluates the function for input arguments in the interval [-3, 3]. The SGPLOT procedure creates a graph of the function:

/* use PROC FCMP to create a user-defined function (FUNC1) */
proc fcmp outlib=work.funcs.Roots;
   function Func1(x);         /* create and store a user-defined function */
      return( exp(-x**2) - x**3 + 5*x +1 );
   endsub;
quit;
options cmplib=work.funcs;   /* DATA step will look here for unresolved functions */
 
data PlotIt;   
do x = -3 to 3 by 0.1;       /* evaluate function on [-3, 3] */
   y = Func1(x);
   output;
end;
run;
 
title "Graph of User-Defined Function";
proc sgplot data=PlotIt;
   series x=x y=y;
   refline 0 / axis=y;
   xaxis grid;
run;

From the graph, it appears that the function has three roots (zeros). The approximate locations of the roots are {-2.13, -0.38, 2.33}, as found in my previous article. The next section shows how to use the SOLVE function in PROC FCMP to find these roots.

Use the SOLVE function to find roots

The SOLVE function is not a DATA step function. It is a "special function" in PROC FCMP. According to the documentation for the special functions, "you cannot call these functions directly from the DATA step. To use these functions in a DATA step, you must wrap the special function inside another user-defined FCMP function." You can, however, call the SOLVE function directly from procedures such as NLIN, NLMIXED, and MCMC.

I was confused when I first read the documentation for the SOLVE function, so let me give an overview of the syntax. The SOLVE function enables you to find a value x such that f(x) = y0 for a "target value", y0. Thus, the SOLVE function enables you to find roots of the function g(x) = f(x) – y0. However, in this article, I will set y0 = 0 so that x will be a root of f.

Because the function might have multiple roots, you need to provide a guess (x0) that is close to the root. The SOLVE function will start with your initial guess and apply an iterative algorithm to obtain the root. Thus, you need to provide the following information to the SOLVE function:

  • The name of an FCMP function (such as "Func1") that will evaluate the function.
  • An initial guess, x0, that is near a root. (You can also provide convergence criteria that tells the iterative algorithm when it has found an approximate root.)
  • The parameters to the function. A missing value indicates the (one) parameter that you are solving for. For a function of one variable, you will specify a missing value for x, which tells the SOLVE function to solve for x. In general, a function can take multiple parameters, so use a missing value in the parameter list to indicate which parameter you are solving for.

The following SAS program defines a new FCMP function called Root_Func1. That function takes one argument, which is the initial guess for a root. Inside the function, the SOLVE function finds the root of the "Func1" function (defined earlier). If it was successful, it returns the root to the caller. To test the Root_Func1 function, I call it from a DATA step and pass in the values -2, 0, and +2. I obtained these guesses by looking at the graph of the Func1 function.

/* to call SOLVE in the DATA step, wrap it inside an FCMP function */
proc fcmp outlib=work.funcs.Roots;
   function Root_Func1(x0);
      array opts[5] init absconv relconv maxiter status; /* initialize to missing */
      opts[1] = x0;        /* opts[1] is the initial guess */
      z = solve("Func1",   /* name of function */
                opts,      /* initial condition, convergence criteria, and status */
                0,         /* target: find x so that f(x) = 0 */
                .);        /* Params for f. A missing value indicates param to solve for */
      if status=0 then 
         return( z );      /* root found */
      else return (.);     /* a root was not found */
   endsub;
quit;
 
data Roots1;
input x0 @@;   
root = Root_Func1(x0);     /* x0 is a "guess" for a root of Func1 */
datalines;
-2 0 2
;
 
proc print data=Roots1 noobs; run;

The output shows that the function has found three roots, which are approximately {-2.13, -0.38, 2.33}. These are the same values that I found by using the FROOT function in SAS/IML.

The roots of functions that have parameters

Sometimes it is useful to include parameters in a function. For example, the following function of x contains two parameters, a and b:
    g(x; a, b) = exp(-x2) + a*x3 + b*x +1
Notice that the first function (Func1) is a special case of this function because f(x) = g(x; -1, 5). You can include parameters in the FCMP functions that define the function and that find the roots. When you call the SOLVE function, the last arguments are a list of parameters (x, a, b) and you should give specific values to the a and b parameters and use a missing value to indicate that the x variable is the variable that you want to solve for. This is shown in the following program:

/* you can also define a function that depends on parameters */
proc fcmp outlib=work.funcs.Roots;
   /* define the function. Note Func1(x) = Func2(x; a=-1, b=5) */
   function Func2(x, a, b);
      return( exp(-x**2) + a*x**3 + b*x +1 );
   endsub;
 
   function Root_Func2(x0, a, b);
      array opts[5] init absconv relconv maxiter status; /* initialize missing */
      init = x0;           /* pass in initial guess */
      z = solve("Func2",   /* name of function */
                opts,      /* initial condition, convergence opts, status */
                0,         /* Find x so that f(x) = 0 */
                ., a, b ); /* Params for f. A missing value indicates param to solve for */
      if status=0 then 
         return( z );      /* root found */
      else return (.);     /* a root was not found */
   endsub;
quit;
 
/* Evaluate Func2 for x in [-3,3] for three pairs of (a,b) values:
   (-1,5), (-1,3), and (-2,1) */
data PlotIt;
do x = -3 to 3 by 0.1;
   y = Func2(x, -1, 5); lbl="a=-1; b=5";  output;
   y = Func2(x, -1, 3); lbl="a=-1; b=3";  output;
   y = Func2(x, -2, 1); lbl="a=-2; b=1";  output;
end;
run;
 
title "Graph of User-Defined Functions";
proc sgplot data=PlotIt;
   series x=x y=y / group=lbl;
   refline 0 / axis=y;
   xaxis grid;
   yaxis min=-10 max=10;
run;

The three graphs are shown. You can see that two of the graphs have three roots, but the third graph has only one root. You can call the Root_Func2 function from the DATA step to find the roots for each function:

/* find the roots of Func2 for (a,b) values (-1,5), (-1,3), and (-2,1) */
data Roots2;
input a b x0;
root = Root_Func2(x0, a, b);
datalines;
-1 5 -2 
-1 5  0 
-1 5  2 
-1 3 -2 
-1 3  0 
-1 3  2 
-2 1  0
-2 1  2
;
 
proc print data=Roots2 noobs; run;

Notice that the roots for (a,b) = (-1, 5) are exactly the same as the roots for Func1. The roots for the function when (a,b) = (-1, 3) are approximately {-1.51, -0.64, 1.88}. The root for the function when (a,b) = (-2, 1) is approximately 1.06.

Notice that the initial guess x0 = 0 did not converge to a root for the function that has (a,b) = (-2, 1). It is well known that Newton-type methods do not always converge. This is in contrast to Brent's methods (used by the FROOT function in SAS/IML), which always converges to a root, if one exists.

Summary

In summary, this article shows how to use the SOLVE function in Base SAS to find the roots (zeros) of a nonlinear function of one variable. You first define a nonlinear function by using PROC FCMP. You can call the SOLVE function directly from some SAS procedures. However, if you want to use it in the DATA step, you need to define a second FCMP function that takes the function parameters and an initial guess and calls the SOLVE function. This article shows two examples of using the SOLVE function to find roots in SAS.

The post Find the root of a function by using the SAS DATA step appeared first on The DO Loop.

8月 122020
 

CVP engine as a magnifying glassIn my earlier blog post, Changing variable type and variable length in SAS datasets, I showed how you can effectively change variables lengths in a SAS data set. That approach works fine when you need to change length attribute for few variables, on a case by case basis. But what if you need to change lengths for all character variables in a data set? Or if you need to do this for all data sets in a data library? For example, you need to expand (increase) all your character variables lengths by 50%. Well, then the case-by-case approach becomes too laborious and inefficient.

What is a character variable’s length attribute?

Before reading any further, let’s take a quick quiz:

Q: A character variable length attribute represents a number of:

  1. Bits
  2. Bytes
  3. Centimeters
  4. Characters

If your answer is anything but B, it’s incorrect. According to the SAS documentation, length refers to the number of bytes used to store each of the variable's values in a SAS data set. You can use a LENGTH statement to set the length of both numeric and character variables.

It is true though that for some older encoding systems (ASCII, ISO/IEC 8859, EBCIDIC, etc.) there was no difference between the number of bytes and the number of characters as those systems were based on exactly one byte per character encoding. They are even called Single Byte Character Sets (SBCS) for that reason. The problem is they can accommodate only a maximum of 28=256 symbols which is not nearly enough to cover all the variety of natural languages, special characters, emojis etc.

Why would we want to expand character variable lengths?

Use case 1. Expanding character values range

For this scenario, let’s consider Internet traffic analysis where your data contains multiple character columns for Internet Protocol addresses (IP addresses) in 32-bit version 4 (IPv4, e.g. ‘125.255.501.780’). You transition to a newer 128-bit IPv6 standard (e.g. ‘2001:0000:3238:DFE1:0063:0000:0000:FEFB’) and need to modify your data structure to accommodate the new standard with longer character values.

Use case 2. Migrating SAS data to multi-byte encoding environment

In this scenario, you migrate/move SAS data sets from older SBCS environments to newer Multi-Byte-Character Set (MBCS) encoding environments. For such a case, the ability to increase character variables lengths in bulk with a simple action becomes especially significant and critical.

Currently, the most commonly used MBCS is Unicode which is supported by all modern operating systems, databases and web browsers. Out of different flavors of Unicode (UTF-8, UTF-16, UTF-32) the most popular is UTF-8. UTF-8 (8-bit Unicode Transformation Format) is a variable-width character set that uses from 1 to 4 one-byte (8-bit) code units per character; it is capable of encoding 1,112,064 various characters that covers most modern languages, including Arabic and Hebrew characters, hieroglyphs, emojis as well as many other special characters.

Since each UTF-8 encoded character may require somewhere between one and four bytes, and not all SBCS characters are represented by one byte in UTF-8, data migration from SBCS to UTF-8 may cause data truncation and subsequently data loss.

When SAS reads an SBCS-encoded data set and writes its records into UTF-8-encoded data set it throws an ERROR message in the log and stops execution:

ERROR: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

When SAS reads an SBCS-encoded data set and produces a UTF-8-encoded printed report only (without generating a UTF-8-encoded output data set) it generates a WARNING message (with identical description as the above ERROR message) while continuing execution:

WARNING: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

Either ERROR or WARNING is unacceptable and must be properly addressed.

How to expand all character variables lengths?

Regardless of character transcoding, SAS’ CVP Engine is short and effective answer to this question. CVP stands for Character Variable Padding which is exactly what this special-purpose engine does – it pads or expands, increases character variables by a number of bytes. CVP engine is part of Base SAS and does not require any additional licensing.

The CVP engine is a read-only engine for SAS data sets only. You can think of it as of a magnifying glass: it creates an expanded view of the character data descriptors (lengths) without changing them. Still we can use the CVP Engine to actually change a data set or a whole data library to their expanded character variables version. All we need to do is to define our source library as CVP library, for example:

libname inlib cvp 'c:\source_folder';

Then use PROC COPY to create expanded versions of our original data sets in a target library:

libname outlib 'c:\target_folder';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Or, if we need to expand character variable lengths for the whole library, then we use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

It’s that easy. And the icing on the cake is that CVP engine automatically adjusts the variables format widths to meet the expanded byte lengths for all converted character variables.

Avoiding character data truncation by using the CVP Engine

CVP Engine is a near-perfect SAS solution to the problem of potential data truncation when data is transcoded during migration or move from SBCS-based to MBCS-based systems.

To avoid data loss from possible data truncation during transcoding we can use the above code with a slight but important modification – define the target library with outencoding='UTF-8' option. It will result in our target data not only expanded lengthwise but properly encoded as well. Then we run this modified code in the old SBCS environment before moving/migrating our data sets to the new MBCS environment:

libname inlib cvp 'c:\source_folder';
libname outlib 'c:\utf8_target_folder' outencoding='UTF-8';
proc copy in=inlib out=outlib noclone;
   select dataset1 dataset2;
run;

Again, if you need to expand character variable lengths for the whole library, then you can use the same PROC COPY without the SELECT statement:

proc copy in=inlib out=outlib noclone;
run;

After that we can safely move our expanded, UTF-8-encoded data to the new UTF-8 environment.

Code notes

  • The code above will create a different version of your original data sets with desired encoding and expanded by 50% (default) character variables lengths. As shown below, this default behavior can be changed by using CVPBYTES= or CVPMULTIPLIER= options which explicitly define bytes expansion rate.
  • It is important to note that CVP option is specified on the input library since the CVP engine is read-only engine, thus available for input (read) processing only.
  • For the output library you specify your desired encoding option, in this case outencoding='UTF-8'.
  • The noclone option specifies not to copy data set attributes. This is needed to make sure the attributes are recreated rather than duplicated.
  • If you want to migrate your data sets using PROC MIGRATE, you should expand column lengths before using PROC COPY as shown above since the CVP engine is not currently supported with PROC MIGRATE.
  • The CVP engine supports only SAS data files (no SAS views, catalogs, item stores, and so on).

CVP Engine options

There are several options available with the CVP Engine. Here are the most widely used:

CVPBYTES=bytes - specifies the number of bytes by which to expand character variable lengths. The lengths of character variables are increased by adding the specified bytes value to the current length.

Example: libname inlib 'SAS data-library' cvpbytes=5;

The CVPBYTES= option implicitly specifies the CVP engine, that is if you specify the CVPBYTES= option you don’t have to specify CVP engine explicitly as SAS will use it automatically.

CVPMULTIPLIER=multiplier - specifies a multiplier value that expands character variable. The lengths of character variables are increased by multiplying the current length by the specified multiplier value. You can specify a multiplier value from 1 to 5, or you can specify 0 and then the CVP engine determines the multiplier automatically.

Example: libname inlib 'SAS data-library' cvpmultiplier=2.5;

The CVPMULTIPLIER= option also implicitly specifies the CVP engine, that is if you specify the CVPMULTIPLIER= option, you don’t have to specify CVP engine explicitly as SAS will use it automatically.

Note:

  • You cannot specify both the CVPMULTIPLIER= option and the CVPBYTES= option. Specify only one of these options.
  • If you explicitly assign the CVP engine but do not specify either CVPBYTES= or CVPMULTIPLIER=, then SAS defaults to using CVPMULTIPLIER=1.5 to increase the lengths of the character variables.

Additional Resources

Your thoughts?

Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments section below.

Expanding lengths of all character variables in SAS data sets was published on SAS Users.

8月 102020
 

A common operation in statistical data analysis is to center and scale a numerical variable. This operation is conceptually easy: you subtract the mean of the variable and divide by the variable's standard deviation. Recently, I wanted to perform a slight variation of the usual standardization:

  • Perform a different standardization for each level of a grouping variable.
  • Instead of using the sample mean and sample standard deviation, I wanted to center and scale the data by using values that I specify.

Although you can write SAS DATA step code to center and scale the data, PROC STDIZE in SAS contains options that handle both these cases.

Simulate data from two groups

Let's start by creating some example data. The following DATA step simulates samples drawn from two groups. The distribution of the first group is N(mu1, sigma) and the distribution of the second group is N(mu2, sigma). The first sample size is n1=0; the second size is n2=30. Notice that the data are ordered by the Group variable.

%let mu1 = 15;
%let mu2 = 16;
%let sigma = 5;
/* Create two samples N(mu1, sigma) and N(mu2, sigma), with samples sizes n1 and n2, respectively */
data TwoSample;
call streaminit(54321);
n1 = 20; n2 = 30;
Group = 1;
do i = 1 to n1;
   x = round(rand('Normal', &mu1, &sigma), 0.01);  output;
end;
Group = 2;
do i = 1 to n2;
   x = round(rand('Normal', &mu2, &sigma), 0.01);  output;
end;
keep Group x;
run;
 
ods graphics / width=500px height=360px;
title "Normal Data";
title2 "N(&mu1, &sigma) and N(&mu2, &sigma)";
proc sgpanel data=TwoSample noautolegend;
   panelby Group / columns=1;
   histogram x;
   density x / type=normal;
   colaxis grid;
run;

The graph shows a comparative histogram of the distribution of each group and overlays the density of the normal curve that best fits the data. You can run PROC MEANS to calculate that the estimates for the first group are mean=14.9 and StdDev=5.64. For the second group, the estimates are mean=15.15 and StdDev=4.27.

The "standard" standardization

Typically, you standardize data by using the sample mean and the sample standard deviation. You can do this by using PROC STDIZE and specify the METHOD=STD method (which is the default method). You can use the BY statement to apply the standardization separately to each group. The following statements standardize the data in each group by using the sample statistics:

/* use PROC STDIZE to standardize by the sample means and sample std dev */
proc stdize data=TwoSample out=ScaleSample method=std;
   by Group;
   var x;
run;
 
proc means data=ScaleSample Mean StdDev Min Max ndec=2; 
   class Group;
   var x;
run;

As expected, each sample now has mean=0 and unit standard deviation. Many multivariate analyses center and scale data in order to adjust for the fact that different variables are measured on different scales. PROC STDIZE has many other options for standardizing data.

Center and scale by using the DATA step

A second way to standardize the data is to use the DATA step to center and scale each variable and each group. You can overwrite the contents of each column, or (as I've done below), you can create a new variable that contains the standardized values. You need to specify values for the location and scale parameters. For these simulated data, I know the population values of the location and scale parameters. The following DATA step uses the parameters to center and scale the data:

/* because we know the population parameters, we can use them to center and scale the data */
/* DATA step method */
data ScaleData;
array mu[2] (&mu1 &mu2);    /* put parameter values into an array */
set TwoSample;
y = (x-mu[Group]) / &sigma;
run;
 
proc means data=ScaleData Mean StdDev Min Max ndec=2;
   class Group;
   var y;
run;

Because I use the population parameters instead of the sample estimates, the transformed variable does not have zero mean nor unit variance.

Wide form: Use the LOCATION and SCALE statements in PROC STDIZE

You can perform the same calculations by using PROC STDIZE. If you look up the documentation for the METHOD= option on the PROC STDIZE statement, you will see that the procedure supports the METHOD=IN(ds) method, which enables you to read parameters from a data set. I will focus on the LOCATION and SCALE parameters; see the documentation for other options.

You can specify the parameters in "wide form" or in "long form". In wide form, each column specifies a parameter and you can include multiple rows if you want to perform a BY-group analysis. You can use the LOCATION and SCALE statements to specify the name of the variables that contain the parameters. The following DATA step creates a data set that has three variables: Group, Mu, and Sigma. Each row specifies the location and scale parameter for centering and scaling data in the levels of the Group variable.

data ScaleWide;
Group=1; Mu=&mu1; Sigma=&sigma; output;
Group=2; Mu=&mu2; Sigma=&sigma; output;
run;
 
proc stdize data=TwoSample out=StdWide method=in(ScaleWide);
   by Group;
   var x;
   location mu;    /* column name METHOD=IN data set */
   scale    sigma; /* column name METHOD=IN data set */
run;
 
proc means data=StdWide Mean StdDev Min Max ndec=2; 
   class Group;
   var x;
run;

The output is identical to the output from the previous section and is not shown.

Long form: Use the _TYPE_ variable

You can also specify the location and scale parameters in long form. For a long-form specification, you need to create a data set that contains a variable named _TYPE_. The parameters for centering and scaling are specified in rows where _TYPE_="LOCATION" and _TYPE_="SCALE", respectively. You can also include one or more BY-group variables (if you are using the BY statement) and one or more columns whose names are the same as the variables you intend to transform.

For example, the example data in this article has a BY-group variable named Group and an analysis variable named X. The following DATA step specifies the values to use to center (_TYPE_='LOCATION') and scale (_TYPE_='SCALE') the X variable for each group:

data ScaleParam;
length _TYPE_ $14;
input Group _TYPE_ x;
datalines;
1 LOCATION 15
1 SCALE     5
2 LOCATION 16
2 SCALE     5
;
 
proc stdize data=TwoSample out=Scale2 method=in(ScaleParam);
   by Group;
   var x;
run;
 
proc means data=Scale2; 
   class Group;
   var x;
run;

Again, the result (not shown) is the same as when the DATA step is used to center and scale the data.

Summary

This article gives examples of using PROC STDIZE in SAS/STAT to center and scale data. Most analysts want to standardize data in groups by using the sample means and sample standard deviations of each group. This is the default behavior of PROC STDIZE. However, you can also center and scale data by using values that are different from the group statistics. For example, you might want to use a grand mean and a pooled standard deviation for the groups. Or, perhaps you are using simulated data and you know the population mean and variance of each group. Regardless, there are three ways to center and scale the data when you know the location and scale parameters:

  • Manually: You can use the DATA step or PROC IML or PROC SQL to write a formula that centers and scales the data.
  • Wide Form: PROC STDIZE supports the LOCATION and SCALE statements. If you use the METHOD=IN(ds) option on the PROC STDIZE statement, you can read the parameters from a data set.
  • Long Form: If the input data set contains a variable named _TYPE_, PROC STDIZE will use the parameter values where _TYPE_='LOCATION' to center the data and where _TYPE_='SCALE' to scale the data.

The post 4 ways to standardize data in SAS appeared first on The DO Loop.

7月 232020
 

Splitting one into smaller pieces

In his blog post, How to split one data set into many, Chris Hemedinger showed how to subset or split SAS data sets based on the values of categorical variables. For example, based on a value of variable REGION you may split a data set MARKETING into MARKETING_ASIA, MARKETING_AMERICA, MARKETING_EUROPE, and so on.

In some cases, however, we need to split a large data set into many – not by a subsetting variable values, but by a number of observations in order to produce smaller, better manageable data sets. Such an approach can be dictated by restrictions on the data set size imposed by hardware (memory size, transmission channel bandwidth etc.), processing time, or user interface convenience (e.g. search results displayed by pages).

We might need to split a data set into smaller tables of K observations or less each; or to split a data set into S equal (or approximately equal) pieces.

We might need to split a data set into sequentially selected subsets where the first K observations go into the first data set, the second K observations go into the second data set, and so on. Alternatively, we might need to randomly select observations from a data set while splitting it into smaller tables.

This blog post provides possible coding solutions for such scenarios.

Splitting a data set into smaller data sets sequentially

Let’s say we need to split a data set SASHELP.CARS (number of observation N=428) into several smaller datasets. We will consider the following two sequential observation selection scenarios:

  1. Each smaller data set should have maximum of K observations.
  2. There should be S smaller data sets of approximately same size.

Ideally, we would like to split a data set into K observations each, but it is not always possible to do as the quotient of dividing the number of observations in the original dataset N by K is not always going to be a whole number. Therefore, we will split it into several smaller data sets of K observations each, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Similarly, with the scenario 2, we will split the source data set into several smaller data sets of the same size, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Below is a SAS macro code that covers both these scenarios.

%macro split (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=);
/* Parameters:
/*   SRC_DATASET - name of the source data set     */
/*   OUT_PREFIX - prefix of the output data sets   */
/*   SPLIT_NUM - split number                      */
/*   SPLIT_DEF - split definition (=SETS or =NOBS) */
 
   %local I K S TLIST;
 
   /* number of observations &K, number of smaller datasets &S */
   data _null_;
      if 0 then set &SRC_DATASET nobs=N;
      if upcase("&SPLIT_DEF")='NOBS' then
         do;
            call symputx('K',&SPLIT_NUM); 
            call symputx('S',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations";
         end;
         else if upcase("&SPLIT_DEF")='SETS' then
         do;
            call symputx('S',&SPLIT_NUM); 
            call symputx('K',ceil(N/&SPLIT_NUM));
            put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets";
        end;
         else put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS.";
      stop; 
   run;
 
 
   /* terminate macro if nothing to split */
   %if (&K le 0) or (&S le 0) %then %return;
 
    /* generate list of smaller dataset names */
   %do I=1 %to &S;
      %let TLIST = &TLIST &OUT_PREFIX._&I;
   %end;
 
   /* split source dataset into smaller datasets */
   data &TLIST;
      set &SRC_DATASET;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;
 
%mend split;

The following are examples of the macro invocations:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);

These invocations will produce the following SAS logs:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);
***MACRO SPLIT: Incorrect SPLIT_DEF=SET value. Must be either SETS or NOBS.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);
***MACRO SPLIT: Splitting into datasets of no more than 100 observations
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_4 has 100 observations and 15 variables.
NOTE: The data set WORK.CARS_5 has 28 observations and 15 variables.
 
%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);
***MACRO SPLIT: Splitting into 3 datasets
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.CARS_1 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_2 has 143 observations and 15 variables.
NOTE: The data set WORK.CARS_3 has 142 observations and 15 variables.

Splitting a data set into smaller data sets randomly

For randomly splitting a data set into many smaller data sets we can use the same approach as above with a slight modification. In essence, we are going to randomly shuffle observations of our source data set first, and then apply the sequential splitting.

In order to implement this, we just need to replace the last data step in the above macro with the following 3 steps:

/* generate random numbers, R */
   data;
      set &SRC_DATASET;
      call streaminit(1234);
      R = rand('uniform');
   run;
 
   /* sort data in R order */
   proc sort;
      by R;
   run;
 
   /* split source dataset into smaller datasets */
   data &TLIST (drop=R);
      set;
      select;
         %do I=1 %to &S;
            when(_n_ <= &K * &I) output &OUT_PREFIX._&I; 
         %end;
      end;
   run;

This modified code will produce similar results (with the same information in the SAS log), however, smaller data sets will have their observations randomly selected from the source data set.

DATAn naming convention

You may have noticed that in this random splitting code I have not specified data set names neither in the DATA statement of the first DATA step, nor in the PROC SORT and not even in the SET statement of the last DATA step. Not only these shortcuts possible due to SAS’ DATAn naming convention, but it is a very robust way of dynamically assigning temporary data set names. This method is especially useful and appropriate for SAS macros as it guarantees that you do not accidentally overwrite a data set with the same name in SAS program that invokes your macro. Think about it: if you are a macro developer you need to make sure that whatever temporary data sets you create within your macro their names must be unique for a SAS session in order not to interfere with any data sets that may be created in the calling SAS program outside of your macro.

Here are defaults in SAS’ DATAn naming convention:

  • If you do not specify a name for the output data set in a DATA statement, SAS automatically assigns the default names WORK.DATA1, WORK.DATA2, and so on, to each successive data set that you create.
  • If you do not specify a name for the input data set in a SET statement, SAS automatically uses the last data set that was created. SAS keeps track of the most recently created data set through the reserved name _LAST_. When you execute a DATA or PROC step without specifying an input data set, by default, SAS uses the _LAST_ data set.

For more information on this useful SAS coding technique see special data set names and examples and warning on using special data set names.

Your thoughts?

Do you find this post useful? Have you ever split data sets into smaller ones based on a number of observations? Do you use special data set names and DATAn naming convention in your SAS coding? Please share your thoughts in the comments section below.

Splitting a data set into smaller data sets was published on SAS Users.

7月 212020
 

A client recently asked if there's a programmatic way to reduce memory requirements of a CAS table. In this post, you'll learn how to accomplish that when your SASWORK data set has lengthy character data.

What is SASWORK?

The SASWORK library is the temporary library automatically defined by SAS at the start of each SAS session or job. The SASWORK library stores temporary SAS files that you create, such as data sets, catalogs, formats, etc., as well as files created internally by SAS. To access files in the SASWORK library, specify a one-level name for the file or by using a two-level name, i.e. the WORK libref.

A programmatic CASLIB to SASWORK

It is a common practice for SAS programmers to read a source table and then store that source table in SASWORK. When the SASWORK data set has long character data, you can significantly reduce memory requirements by creating a path-based CASLIB to SASWORK and leveraging PROC CASUTIL with the CASDATA= option and use the IMPORTOPTIONS VARCHARCONVERSION=16 statement. The VARCHARCONVERSION=16 statement automatically converts all character data types with length of 16 or greater to a VARCHAR data type.

Code in Figure 1 below creates a macro variable called WORKPATH which we will use in our CASLIB statement:

/* The macro variable WORKPATH contains the path to SASWORK */
%let workpath = %sysfunc(quote(%sysfunc(pathname(work))));
%put &workpath;

(Figure 1: Creating the Macro Variable WORKPATH)

Figure 2 below is the resulting SAS log:


(Figure 2: SAS log of creating the Macro Variable WORKPATH)

Code in Figure 3 below creates our SASWORK data set that we will lift into CAS:

data saswork_table_with_char300;
   length a $ 300 b $ 15 c $ 16;
   a='a300'; b='b15' ; c='c16' ; 
   output;
   a='a300300'; b='b151515'; c='c161616'; 
   output;
   c='c161616161616161'; 
   b='b15151515151515';
   a="a300qzwsxedcrfvtgbyhnujmiklopqazwsxedcrfvtgbyhnujmikolp1234567890123456789012345678901234567890"; 
   output;
run;
 
proc contents data=saswork_table_with_char300;
title "Contents of WORK.SASWORK_TABLE_WITH_CHAR300";
run;

(Figure 3: Code to create a SASWORK data set using a one-level name, i.e. saswork_table_with_char300)

Notice in Figure 4 below the lengths of the character variables a, b, and c:


(Figure 4: Results from PROC CONTENTS)

Code in Figure 5 below creates a path-based CASLIB to SASWORK by using our macro variable WORKPATH:

proc cas;
   file log;
   table.dropCaslib /
   caslib='sas7bdat' quiet = true;
 run;
  addcaslib /
    datasource={srctype="path"}
    name="sas7bdat"
    path=&workpath ; 
 run;

(Figure 5: Path-based CASLIB to SASWORK)

Figure 6 below is resulting SAS log:


(Figure 6: SAS log of path-based CASLIB to SASWORK)

Code in Figure 7 below lifts the SASWORK data set into CAS and ensures the CAS table will contain VARCHAR data types for any character data with a length of 16 or greater:

proc casutil;
 load casdata="saswork_table_with_char300.sas7bdat" 
 casout="cas_table_with_varchar" 
 outcaslib="casuser"
 importoptions=(filetype="basesas", dtm="auto", debug="dmsglvli", 
                varcharconversion=16) ;
run;
quit;

(Figure 7: code that lifts the SASWORK data set into CAS)

Figure 8 below is the resulting SAS log:


(Figure 8: SAS log of code to lift the SASWORK data set into CAS)

Code in Figure 9 below displays the characteristics of our CAS table “cas_table_with_varchar”: Notice in Figure 10 that our variables a and c have a data type of VARCHAR.

proc cas;
   sessionProp.setSessOpt /
   caslib="casuser";
run;
   table.columninfo / table="cas_table_with_varchar";
quit;

(Figure 9: SAS code to display information of a CAS table)

Figure 10 below reveals CAS table characteristics:


(Figure 10: CAS table characteristics)

Conclusion: Conserve memory with smaller CAS tables

By leveraging the coding techniques in this blog, we can lift into CAS any data sets we previously stored in SASWORK. In addition, we convert all character data types with a length of 16 or greater into the VARCHAR data type, which can significantly reduce the size of our CAS tables.

LEARN MORE | SAS® Cloud Analytic Services 3.5: User’s Guide

How to create a path-based CASLIB to SASWORK for speedier analysis with SAS® Viya® was published on SAS Users.

7月 202020
 

I think we can all agree that lifelong learning is the future, for all of us. We know that we need to learn and develop all the time, simply to stay abreast. The world is changing fast, and we must change with it. Investing in analytics talent is an investment in your future.

Create a corporate learning culture

Organizations may embrace the concept of data in their work cultures, yet they fail to commit to developing the analytics skills their teams need to harness the data. Organizations often don’t know what skills their team has, the skills they’re lacking, or even where they need their employees to go. Corporate training brings new avenues for career development and professional training, and that can have a huge impact on retention.

Companies that understand the importance of continued skill development for their employees will set out a vision and develop a corporate learning culture. Leaders are needed who can drive a cultural change towards continued learning in the organization. Hiring curious people and rewarding that curiosity develops the right mindset. Revealing the knowledge gap that employees have is a good trigger to get the learning started.

When an organization wants to close the talent gap for Analytics and AI, SAS can offer support on the whole journey.

Company objectives and goals will determine how the skills development program will be shaped. Once the roles and desired skills levels are determined, SAS will provide a Learning Needs Assessment to uncover the skills gaps.

Based on the findings from the assessment, a program for learning and development will be proposed for the different target audiences and to work towards company goals.

With readily available digital assets, the first steps to a learning platform can be realized quickly. By dynamically developing more assets and creating domain and company specific materials, continuous learning is encouraged and facilitated.

Power of online learning

Today most of us are forced to work from home due to COVID-19. We know the future of work will look different; we are getting used to the digital channel for communicating and interacting. Though the digitization of learning has been going on for many years, learning content is now moving to the cloud, becoming accessible across multiple devices and teaching environments and often being generated, shared, and continually updated.

Millennials feel most comfortable with this digitization, but the wider workforce will ultimately benefit from access to digital learning assets. Integrated cloud-based platforms enable more than just new computer programs or smartphone apps. Smart organizations are now expanding their use of cloud-based learning to run personalized online courses, small tailored live web sessions, instructional videos, e-coaching, communities, virtual classrooms, and simulations games. SAS is responding to this need by offering several free options for learning including e-Learning, online tutorials, lab time, SAS Academy for Data Science, and a SAS Learning Subscription that offers more than 100 e-Learning courses. There is something for everyone!

Finding analytics talent within your workforce can be challenging. It can be hard to assess the skills and identify the talent you need for specific solutions. With an unparalleled depth of understanding in the industry, SAS can help identify, cultivate and grow analytics talent in your organization. In addition, we’ll help you to build a talent pipeline in partnership with local academic institutions, if needed.

Do you want to start today with learning essential skills? Take a look at how SAS can help your organization to get ahead, through our learning solutions customized to help you train your team. Or transform it.

Shaping the future of lifelong learning was published on SAS Users.