Leonid Batkhan

6月 162022
 

MS Excel logo No matter which powerful analytical tools data professionals use for their data processing, MS Excel remains the output of choice for many users and whole industries.

In banking and finance, for example, I have seen many SAS users create quite sophisticated data queries and data analysis projects in SAS Enterprise Guide. Yet, at the end, when the final datasets have been produced and validated, comes the manual part when users export those tables into Excel, and then combine and rearrange them by copying-pasting into a desired workbook for storing and distributing.

However, this heavily manual process can be not just fully automated, but also enhanced compared with the point-and-click “export to Excel” and “copy-paste” interactive process. Here is how.

Creating a single simple Excel sheet

Suppose, we want to convert SASHELP.CLASS data table to Excel. Here is a bare-bone solution using SAS Output Delivery System:

ods excel file='C:\Projects\SAS_to_Excel\Single_sheet.xlsx';
 
proc print data=SASHELP.CLASS noobs;
run;
 
ods excel close;

This code is pretty much self-explanatory. It will produce Single_sheet.xlsx Excel workbook file in the folder C:\Projects\SAS_to_Excel. When opened in Excel, it will look as follows:

If you browse through the ODS EXCEL documentation you will find a variety of options that allow you to customize Excel output. Let’s get a little fancy by utilizing some of them.

Creating several customized sheets in Excel workbook

The following code example creates two sheets (tabs) in a single workbook. In addition, it demonstrates some other features to enhance data visualization.

/* -------------------------------------------- */
/* Two sheets workbook with enhanced appearance */
/* -------------------------------------------- */
 
/* Formats for background & foreground coloring */
proc format;
   value hbg 
      50 <- 60   = #66FF99
      60 <- 70   = #FFFF99
      70 <- high = #FF6666
      ;
   value hfg
      low -< 50, 70 <- high = white;
run;
 
/* Define custom font style for ODS TEXT */
proc template;
   define style styles.MyStyle;
   parent=styles.htmlblue;
   style usertext from usertext /
      foreground  = #FF33CC
      font_weight = bold
      ;
   end;
run;
 
/* ODS Excel output file destination */
ods excel file = 'C:\Projects\SAS_to_Excel\Two_sheets_fancy.xlsx';
 
   /* Excel options for 1st sheet (tab) */
   ods excel options
      ( sheet_name      = 'SASHELP.CLASS'
        frozen_headers  = 'on'
        embedded_titles = 'on' )
      style = styles.MyStyle;
 
   title justify=left color='#4D7EBF' 'This is TITLE for SASHELP.CLASS';
 
   ods text='This is TEXT for SASHELP.CLASS';
 
   proc print data=SASHELP.CLASS noobs;
      var NAME;
      var SEX AGE / style = {just=C};
      var HEIGHT  / style = {background=hbg. foreground=hfg.};
      var WEIGHT;
   run;
 
   /* Excel options for 2nd sheet (tab) */
   ods excel options
      ( sheet_name      = 'SASHELP.CARS'
        frozen_headers  = 'on'
        embedded_titles = 'on' );
 
   title 'This is TITLE for SASHELP.CARS';
 
   proc print data=SASHELP.CARS noobs;
   run;
 
ods excel close;

Here are the code highlights:

  • PROC FORMAT creates two formats, HBG and HFG, which are used in the first PROC PRINT to illustrate cell text and background coloring.

    NOTE: In SAS, colors are specified as hexadecimal RGB values (cxRRGGBB). However, I found (although it seems undocumented) that in PROC FORMAT and PROC TEMPLATE these colors can be written as quoted (double or single) or unquoted values, as well as prefixed with either ‘#’ or ‘cx’ (for hexadecimal). For example, the following are all valid values: #FFAADD, '#FFAADD', cxFFAADD, 'cxFFAADD'.
  • PROC TEMPLATE defines usertext custom font style for ODS TEXT, which used in ODS EXCEL along with and as alternative to the TITLE statement.
  • ODS EXCEL FILE=file-specification statement opens the EXCEL destination and specifies the output file for ODS EXCEL. The output file-specification can be either a physical file name (must be in quotes) or a fileref (without quotes) assigned with FILENAME statement. It can point to a location on the machine where SAS is run (SAS server), or a network drive accessible from the SAS server.

    This statement follows by two separate ODS EXCEL OPTIONS statements – one per corresponding sheet (tab).
  • ODS EXCEL OPTIONS statement specifies destination-specific suboptions with space-delimited name='value' pairs. In particular, we use the following options:

    • SHEET_NAME= specifies the name for the next worksheet (worksheet names can be up to 28 characters long).
    • FROZEN_HEADERS='ON' specifies that headers are not scrolled when table is vertically scrolled (default is OFF). It is very convenient feature that keeps the title(s) and column names in view while user scrolls through the table rows.
    • EMBEDDED_TITLES='ON' specifies whether titles should appear in the worksheet (default is OFF).

    There is a variety of other useful ODS EXCEL options allowing further customization of your Excel workbook appearance and functionality.

  • TITLE statement is highly customizable with Output Delivery System as shown in this custom TITLE with ODS example.
  • ODS TEXT= statement inserts text into your ODS output. Unlike TITLE for which ODS EXCEL merges several cells, ODS TEXT places its text in a single cell (see screenshot below). The UserText style element that we modified using PROC TEMPLATE controls the font style, font color, and other attributes of the text that the ODS TEXT= statement produces.
  • In PROC PRINT, we use multiple VAR statements to select variables, determine their order and apply styles (text and background colors) to the printed values. There are much more ODS styles with PROC PRINT available for further customizations.
  • The next section of the code ( /* Excel options for 2nd sheet (tab) */ ) creates the second sheet in the same Excel workbook. Similarly, you can create as many sheets/tabs as you wish.
  • The last statement ods excel close; closes the ODS Excel destination so nothing more is written to the output file.

The following are the two screenshots illustrating the two sheets in the produced Excel workbook:

SAS creates Excel workbook with several sheets/tabs

Questions

Do you find this post useful? Do you have questions, comments, suggestions, other tips or tricks about creating MS Excel workbooks in SAS? Please share with us below.

Additional Resources

TUNE IN NOW | LEARN HOW TO READ AND WRITE EXCEL FILES WITH SAS

Automating Excel workbooks creation using SAS was published on SAS Users.

3月 142022
 

Irony of using square wheelsEvery year on March 14, we celebrate Pi Day in recognition of the famed mathematical constant π. This date in the format MM/DD is 3/14 which coincides with the first three digits of the π value 3.14.

As you all know, π represents the ratio of a circle's circumference to its diameter. Therefore, we can derive a circle's circumference C from its diameter d using the following famous formula:

C = πd (or C = 2πr using radius r).

While it is impossible to overstate the significance of π in its fundamental and legendary role of being a constant representing the circumference/diameter ratio, its usage for area and volume measurements is not as much of a “settled science” as you may think. In this post, I dare to debunk the notion of π being an inherent component of the omnipresent circle area and sphere volumes formulas:

A = πr2 = π/4 ⋅ d2

V = 4π/3 ⋅ r3 =  π/6 ⋅ d3

Believe it or not,  we can significantly simplify these formulas by getting rid of π altogether. Wanna know how? Just sit down, relax, open your mind and fasten your seat belt.

Misnomer alert!

(Definition: A misnomer is a wrong or inaccurate use of a name or term.)

Let’s start with some questionable and overreaching semantics in algebra. Have you ever noticed that we call a number to the power of 2 as the number squared? Similarly, we call a number to the power of 3 as the number cubed.  However, besides 2 and 3, we treat all other exponents Np equally by saying “N to the power of p”.

Here is how Wikipedia explains this exponentiation phenomenon:

“The expression b2 = bb is called "the square of b" or "b squared", because the area of a square with side-length b is b2.

Similarly, the expression b3 = bbb is called "the cube of b" or "b cubed", because the volume of a cube with side-length b is b3.

Wait a minute! Exponents b2 and b3 are mathematical operations from the Algebra Department. They have nothing to do with geometrical shapes and exist independently from geometry and area/volume calculations.

The same semantical problem persists with the reverse terms square root and cube root. All natural numbers N have their corresponding root operations, which we call their Nth root. So why do we need this suggestive verbal association of the area/volume measurement units to squares and cubes?

Moreover, the very word “area” itself is often replaced by square footage, square mileage, or square something thus further implying that areas and squares are synonymous. To me this is like calling a round wheel “a rolling square”.

Overall, this square/cube-centric language insidiously conditions our minds into thinking only in terms of squares and cubes regarding the units of area and volume measurement.

Why do we measure area of a circle in squares?

We all know from middle school’s elementary geometry that area of a circle of radius r is A = πr2 (or A = πd2/4 using diameter d since r = d/2). Many famous ancient mathematicians from Archimedes to Eudoxus of Cnidus to Hippocrates of Chios have proven this formula. There is also a multitude of its proofs using modern methods rooted in trigonometry and calculus.

However, all these methods have one funny thing in common. It is the presumption that an area must be measured in squares (square inches, square meters, square miles, etc.). Think about it for a minute: for thousands of years we have been measuring round areas by the number of squares with a side=1 covering that round area. C’mon! We invented the wheel, electricity, engines, airplanes, telephone, radio, TV, computers, spaceships, artificial intelligence and more, but we are still measuring round areas in squares!

It makes perfect sense measuring rectilinear areas in square units – they are a good fit for each other. But measuring round areas in squares?! What would you say if we were measuring square areas by the number of round coins covering them? How is that not as good as that the opposite?

Introducing circular units of areas

Defining units of measurement is totally arbitrary. All we need is a well-defined gauge. For example, we have a variety of length units such as foot (derived from the size of the human foot), inch (derived from the width of the human thumb), meter (derived from the Greek verb μετρέω  - measure, count) and so on. They are all convertible to each other, but their definitions are quite different.

When it comes to the units of area, it can be a square, a rectangle, a circle, an oval, a star, even a snowflake or any other two-dimensional shape of a specified size as long as its definition is clear and unambiguous.

Without further ado let’s define a circular unit of area as an area of a circle with the diameter d=1 unit of length (inch, foot, meter, etc.).

Measuring circle areas in circular units

In order to measure areas of different sized circles, we do not need to count the number of these single unit circles within the measured circle. We just place these two circles concentrically and expand or shrink our gauge circle until they align.  Here is a visual to illustrate this:

For a circle with the diameter d, we can express its area measured in the circular units Ac using the following formula:Ac(d) = d ⋅ d = d2

I would not call d2 as “d squared” here. The power of 2 is an indication of two-dimensional area while d is a one-dimensional length.

Let’s prove this formula. Suppose As(d) is a circle area with diameter d in square units and Ac(d) is the circle area in circular units.

As we know As(d) = πr2 = π (d/2)2 =  (π/4)d2. For the same circle its area in circular units is Ac(d).

For d=1, we have As(1) = πr2 = π (d/2)2 =  (π/4)d2 = π/4. The same circle area represents 1 circular area unit Ac(1)=1.

Therefore, Ac(d) =  Ac(1) d2 =  d2, that is we effectively eliminated π from the circle area calculation.

Measuring squares in circular units

Circle and square of equal size
At the same time, π is not the kind of number to easily get rid of. Similar to the law of conservation of energy, there is probably a law of conservation of π: whenever we get rid of it in one place, it appears in another place.

Let’s suppose we have a square with side length of x. Then its area in square units will be As(x) = x2. If we take a circle of the same area we get the following equation As(x) = x2 = (π/4)d2. From here d2 = (4/π)x2. Since d2 represents the number of circular units Ac we can say that the area of a square with side length of x in circular units is Ac(x) = (4/π)x2.

Notice that π-wise these two formulas for area, Circle_A=(π/4)d2 [square units]  and Square_A=(4/π)x2 [circular units], are literally upside-down version of each other.

Spheric units for 3D geometry

We can apply exact same approach to measuring volumes of three-dimensional objects. In the cubic unit’s paradigm, we define one cubic unit of volume as a cube with a side of one unit of length. Then the volume of a cube with a side of x is Cube_V=x3 [cubic units]  and the volume of a sphere with a diameter of d=2r is Sphere_V = (4/3)πr3 = (π/6)d3 [cubic units].

Alternatively, in the spheric unit’s paradigm we define one spheric unit of volume as a sphere with a diameter of one unit of length. Then the volume of a sphere with a diameter of d is Sphere_V=d3 [spheric units]  and the volume of a cube with a side of x is Cube_V=(6/π)x3 [spheric units].

Again, here I would not call d3 and x3 as “d cubed” and “x cubed” since the power of 3 is indicative of three-dimensional space while d and x are one-dimensional units of length.

Similarly to area calculations, π-wise these two formulas for volume, Sphere_V=(π/6)d3 [cubic units]  and Cube_V=(6/π)x3 [spheric units], are literally upside-down version of each other.

Is this practical?

If you still doubt the validity and/or practicality of my geometric deductions, check out circular mil units of area adopted for wire size measurement by the US National Electrical Code (NEC) and the Canadian Electrical Code (CEC).

A circular mil (cmil) is a unit of area equal to the area of a circle with a diameter of one mil (one thousandth of an inch or approximately 0.0254 mm).

While applied just to one particular unit of length – mil, it is exactly the same circular units of area concept as described in the previous sections.

As you can see, electrical engineers have figured out a use for circular units, perhaps because they are dealing with electrical wires, which are predominantly round in their cross section. Since electrical conductivity of wires is directly proportional to their cross-section area, it is just more convenient to use circular units (without π) for calculating the cross section and ultimately electrical conductivity. The same is true for the throughput capacity while transporting any substance via pipes (water, gas, oil, etc.) or moving blood through the blood vessels.

There are many other applications dealing with the circular shapes and spherical objects. In all these cases, it makes more sense to measure areas in circular units rather than squares and measure object volumes in spheric units rather than cubes. Take for example physics (gravitational, electrical and magnetic fields), astronomy (planets, stars, comets, and orbits), geology (areas around earthquake epicenters), telephony (areas of cellular tower coverage), etc.

Your thoughts, comments

I would like to hear from you. What do you think about all this? Please share your thoughts in the Comments section below.

ACCESS FREE SAS® SOFTWARE | SAS® OnDemand for Academics

Calculating circle areas and sphere volumes without π was published on SAS Users.

2月 162022
 

Editor’s note: SASjs, while free and useful for anyone, is a gateway to services from 4GL Apps, a company founded and operated by longtime SAS user Allan Bowe.Open code sharing


I would like to start this post with a well-known quote by George Bernard Shaw: “If you have an apple and I have an apple and we exchange these apples then you and I will still each have one apple. But if you have an idea and I have an idea and we exchange these ideas, then each of us will have two ideas.”

SAS Software provides vast functionality delivered with it, but perhaps the biggest asset of the SAS ecosystem is its ever-growing community of SAS users around the globe readily willing to share their SAS code and solutions to various problems. Take for example, SAS Communities, SAS blog or SAS Software on GitHub.

With open-source code taking the world by storm, SAS maintains and nurtures a viable mix of its rock-solid backbone – commercial SAS Software backed by its unparalleled Technical Support and open-source SAS solutions developed and shared by SAS users.

An introduction to SASjs

Here, I am going to introduce you to one more publicly available open source treasure called SASjs (also available on GitHub), which stands out from many others because of its thought-provoking, multifaceted approach.

In a nutshell, the SASjs is a collection of tools aimed at enabling and accelerating development and operations (DevOps) for a broad range of SAS applications, including SAS web development.

SASjs consists of the following three major components:

  1. SAS macro library (MacroCore)
  2. JavaScript library (Adapter)
  3. Command line interface (CLI)

SASjs MacroCore library

The MacroCore library is a centerpiece of the SASjs framework. It includes several hundred SAS macros facilitating various aspects of SAS coding, application development and deployments. All these macros are production quality and free to use (MIT-licensed). The contents of the MacroCore library organized by the following sections:

  • BASE library (SAS9/Viya) - prefixes: mf (macro functions), mp (macro procedures - macros).
  • FCMP library (SAS9/Viya) - prefix: mcf; used to generate user-defined functions using PROC FCMP.
  • META library (SAS9 only) - prefix: mm; used in SAS EBI environments that connect to the metadata server.
  • SERVER library (@sasjs/server only) - prefix: ms; used for building applications using @sasjs/server - an open source REST API for Desktop SAS.
  • VIYA library (Viya only) - prefixes: mv, mvf; used for interfacing with SAS Viya.
  • METAX library (SAS9 only) - prefixes: mmw, mmu, mmx.
  • DDL Library - prefix: mddl; provides the structure for any permanent table used by the macros.
  • LUA library - prefix: ml

All these macros (except METAX) are Operating System independent and operate under NOXCMD option. That option prevents the SAS program from executing OS commands and is often implemented by SAS Administrators for security reasons.

DOCUMENTATION PAGE – here you can find detailed descriptions for all of the SASjs MacroCore macros.

DOWNLOAD PAGE – here you can grab all these macros FREE and deploy (save) anywhere in your SAS environment.

Alternatively, you can make all these macros available in your SAS program by %INCLUDE-ing them directly from the Web – just add this code snippet to the beginning of your SAS program:

filename mcore url "https://raw.githubusercontent.com/sasjs/core/main/all.sas";
%inc mcore;

To give you a taste of what these macros are here is a small sample of them with examples of usage and brief descriptions:

  • %mp_copyfolder(&rootdir,&copydir) - copies the whole directory structure including files and subdirectories from one location (&rootdir) to another (&copydir). Uses only internal SAS functions without OS commands (NOXCMD option).
  • %mp_dirlist(path=/some/location, outds=myTable, maxdepth=MAX) - returns SAS table myTable listing all files and sub-directories up to specified depth. It also uses only internal SAS functions without OS commands (NOXCMD option).
  • %mp_csv2ds(inref=mycsv,outds=myds,baseds=template_ds) - imports relevant columns from mycsv CSV file to myds data table using a template dataset template_ds to provide the types and lengths.
  • %mf_getvarcount(my_dataset) - this macro function returns the number of variables in a dataset.
  • %mf_getvarlist(sashelp.class,dlm=%str(,),quote=double) - this macro function returns the dataset variable list directly from its header. The list can be separated by blanks, commas or any other delimiter (dlm=), and be unquoted or quoted (quote=).

SASjs Javascript library (Adapter)

In my earlier posts about integrating SAS with Google Maps, I employed and implemented the idea of using SAS for generating data-driven HTML and Javascript code for creating web pages.

SASjs, on the other hand, turns on its head this SAS-to-JavaScript paradigm. In a total reverse, it uses JavaScript to generate SAS code for running on SAS servers as jobs or stored processes.

Previously, I naively believed that JavaScript is a web browser language that exists and used only within web browsers to enhance interactions with HTML web pages. SASjs proved me wrong. It turns out JavaScript is a general-purpose language that can be used everywhere outside of the web browser.

SASjs extensively employs this JavaScript functionality. Its second component, SASjs Adapter, is a server-side JavaScript library (and a set of SAS macros) to handle the communication between the frontend HTML5 or client application and the SAS 9 or Viya backend SAS services.

SASjs uses the following two client-side components:

  • Node.js JavaScript runtime environment.
  • Visual Studio Code (VS Code) source code editor.

Node.js is an asynchronous event-driven JavaScript runtime designed to build scalable network applications.

Visual Studio Code (VS Code) is a lightweight but powerful source code editor, which runs on your desktop and is available for Windows, macOS and Linux. It comes with built-in support for JavaScript, TypeScript and Node.js and has a rich collection of extensions for other languages (such as C++, C#, Java, Python, PHP, Go, etc.) and runtimes (such as .NET and Unity). In essence, it is more than just a source code editor, it is an Integrated Development Environment (IDE) allowing you to perform programming along with development as well as debugging and testing the application.

SASjs command line interface (CLI)

The SASjs CLI is the third major component of SASjs that ties together and facilitates creating, building, deploying and executing various SAS web applications (for SAS 9 and SAS Viya).

It is a powerful and flexible tool covering a wide range of features and functionalities to support successful applications development and operations (DevOps). CLI’s primary actions include:

  • Create a template SASjs Git project repository.
  • Compile dependent macros, macro variables and pre/post code.
  • Create/build the master SAS deployment, including Jobs, Services and Tests.
  • Deploy/execute local scripts and remote SAS programs to create your app on the SAS Server.
  • Configure and generate Doxygen documentation for all SAS content.

There is also a feature to let you deploy your frontend as a service, bypassing the need to access the SAS Web Server.

SASjs support

Even though SASjs is an open source platform, it provides different levels of support for its products.

Report defects. Through this GitHub web page SAS users can raise issues with SASjs such as report defects, suggest improvements, etc.

Engage with SASjs developers. SAS users can also communicate with SASjs developers, receive free clarifications and consultations on various aspects of SASjs usage, or even hire them for involvement that is more extensive.

Questions? Thoughts? Comments?

Do you find this post useful? Have you learned anything new? Do you have any other SAS open-source assets you can share? Do you have questions, concerns, comments? Please share with us below.

SAS open-source treasures from around the world: SASjs was published on SAS Users.

1月 132022
 

3 ways of calculating length of overlapping of two integer intervalsIn this post, we tackle the problem of calculating the length of overlap of two date/time intervals. For example, we might have two events — each lasting several days — and need to calculate the number of shared days between both events. In other words, we need to calculate the length of the events’ overlap.

Such tasks are typical in clinical trials programming, health (and non-health) insurance claim processing, project management, planning and scheduling, etc.

The length of the overlap can be measured not only in days, but also in any other units of the date/time dimension: hours, minutes, seconds, weeks, months, years and so on. Moreover, the date/time application is just one special case of a broader task of calculating the overlap length of two integer intervals. An integer interval [x .. y] is a set of all consecutive integer numbers beginning with x and ending with y (boundaries included and x ≤ y).

Instead of suggesting a single “best” way of solving this problem, I offer three different strategies. This allows you to compare and weigh their pros and cons and decide for yourself which approach is most suitable for your circumstances. All three solutions presented in this blog post are applicable to the date/time use case, as well as its superset of integer intervals’ overlap length calculation.

Problem description

Suppose we have two events A [A1 .. A2] and B [B1 .. B2]. Event A lasts from date A1 until date A2 (A1 ≤ A2), and event B lasts from date B1 until date B2 (B1 ≤ B2).  Both starting and ending dates are included in the corresponding event intervals.

We need to calculate the overlap between these two events, defined as the number of days that belong to both events A and B.

Overlapping intervals: sample data

Before solving the problem, let’s create sample data to which we are going to apply our possible solutions.

The following data step creates data set EVENTS representing such a data sample:

data EVENTS;
   input A1 A2 B1 B2;
   format A1 A2 B1 B2 date9.;
   informat A1 A2 B1 B2 mmddyy10.;
   lines;
01/02/2022 01/05/2022 01/06/2022 01/10/2022
01/22/2022 01/30/2022 01/16/2022 01/18/2022
01/02/2022 01/05/2022 01/03/2022 01/10/2022
01/02/2022 01/05/2022 01/03/2022 01/04/2022
01/10/2022 01/15/2022 01/06/2022 01/14/2022
01/01/2022 01/05/2022 01/05/2022 01/09/2022
01/07/2022 01/13/2022 01/10/2022 01/13/2022
;

Now let’s go over the following three distinct possible solutions, each with its own merit and downside. At the end, you decide which solution you like the most.

Solution 1: Brute force

In general, brute force algorithms solve problems by exhaustive iteration through all possible choices until a solution is found. It rarely results in a clever, efficient solution. However, it is a straightforward, valid and important algorithm design strategy that is applicable to a wide variety of problems.

For integer intervals overlap calculation, a brute force approach would involve the following steps:

  1. Determine the earliest date in the definitions of our two intervals
  2. Determine the latest date in the definitions of our two intervals
  3. Create an iterative loop with index variable ranging from the earliest to the latest date
  4. Within this loop, increment a counter by one if index variable falls within both intervals
  5. At the end of this loop, the counter will equal the desired value for the overlap.

Here is the code implementation of the described above algorithm:

data RESULTS;
   set EVENTS;
   OVERLAP = 0;
   do i=min(A1,B1) to max(A2,B2);
      if (A1<=i<=A2) and (B1<=i<=B2) then OVERLAP + 1;
   end;
run;

As you can see, the code implementation of the brute force solution is quite simple. However, it is hardly efficient as it involves extensive looping.

Solution 2: Exhaustive logic

Exhaustive logic algorithms solve problems by splitting them into several smaller problems (logical units or classes). This way, instead of working on a problem in its entirety, you can work separately on smaller and simpler problems that are easier to comprehend, digest and solve.

Here is an illustration of various arrangements of integer intervals representing separate logical units (numbered 1...5) when calculating the intervals overlaps:

Different arrangements of date/time integer intervals

The following SAS code demonstrates an implementation of the exhaustive logic algorithm (each IF-THEN statement corresponds to the numbered logical unit in the above figure):

data RESULTS;
   set EVENTS;
   if A1<=B1 and A2>=B2    then OVERLAP = B2 - B1 + 1; else
   if A1>=B1 and A2<=B2    then OVERLAP = A2 - A1 + 1; else
   if A1<=B1<=A2 and B2>A2 then OVERLAP = A2 - B1 + 1; else
   if B1<=A1<=B2 and A2>B2 then OVERLAP = B2 - A1 + 1; else
   if B1>A2 or A1>B2       then OVERLAP = 0;
run;

Notice that we need to add 1 to each of the two days difference. That is because the number of days spanned is always 1 less than the number of days contained in an interval. Here is an illustration:

Days spanned vs. number of days

Solution 3: Holistic Math

Finally, a holistic math method looks at the problem in its entirety and condenses it to a comprehensive mathematical formula. While it is usually not the most obvious approach, and often requires some brain stretching and internalization, the effort is often rewarded by achieving the best possible solution.

The exhaustive logic solution may help in better understanding the problem and arriving at the generalized mathematical formula.

If you dwell on the exhaustive logic solution for a while, you may come to realization that all the variety of the logical units boils down to the following single simple formula:

Overlap = min(A2, B2) - max(A1, B1) + 1.

In other words, the overlap of two integer intervals is a difference between the minimum value of the two upper boundaries and the maximum value of the two lower boundaries, plus 1.

Positive OVERLAP value represents actual overlap days, while zero or negative value means that intervals do not overlap at all. The higher absolute value of the negative result – the farther apart the intervals are.  To correct this “negativity” effect we can just set all negative values to zero.

The following SAS code implements the described holistic math approach:

data RESULTS;
   set EVENTS;
   OVERLAP = min(A2,B2) - max(A1, B1) + 1;
   if OVERLAP<0 then OVERLAP = 0;
run;

As you can see, this in the most concise, elegant, and at the same time the most efficient solution with no iterative looping and complicated exhaustive (and exhausting) logic.

Output

Here is the output data table RESULTS showing our sample intervals (columns A1, A2, B1 and B2) and the resulting column OVERLAP (all three solutions described above should produce identical OVERLAP values):

Output data set with calculated days overlaps

Questions? Thoughts? Comments?

Which of the presented here three methods do you like most? Which one would you prefer under different circumstances? Can you come up with yet another way of calculating date intervals overlaps? Do you have questions, concerns, comments? Please share with us below.

Additional resources

Calculating the overlap of date/time intervals was published on SAS Users.

10月 142021
 

Trimming strings left and right

I am pretty sure you have never heard of the TRIMS function, and I would be genuinely surprised if you told me otherwise. This is because this function does not exist (at least at the time of this writing).

But don’t worry, the difference between "nonexistence" and "existence" is only a matter of time, and from now it is less than a blog away. Let me explain. Recently, I published two complementary blog posts:

[1] Removing leading characters from SAS strings

[2] Removing trailing characters from SAS strings

While working on these pieces and researching “prior art” I stumbled upon a multipurpose function in the SAS FedSQL Language that alone does either one or both of these things – remove leading or/and trailing characters from SAS strings.

FedSQL Language and Proc FedSQL

The FedSQL language is the SAS proprietary implementation of the ANSI SQL:1999 core standard. Expectedly, the FedSQL language is implemented in SAS by means of the FedSQL procedure (PROC FEDSQL). This procedure enables you to submit FedSQL language statements from a Base SAS session, and it is supported in both SAS 9.4 and SAS Viya.

Using the FEDSQL procedure, you can submit FedSQL language statements to SAS and third-party data sources that are accessed with SAS and SAS/ACCESS library engines. Or, if you have SAS Cloud Analytic Services (CAS) configured, you can submit FedSQL language statements to the CAS server.

FedSQL TRIM function

FedSQL language has its own vast FedSQL Functions library with hundreds of functions many of which replicate SAS 9.4 Functions. Many, but not all. Deep inside this FedSQL functions library, there is a unique treasure modestly called TRIM Function which is quite different from the BASE SAS Language TRIM() function.

While SAS 9.4 BASE TRIM() function capabilities are quite limited - it removes just trailing blanks from a character string, the FedSQL TRIM() function is way much more powerful. This triple-action function can remove not just trailing blanks, but also leading blanks, as well as both, leading and trailing blanks. On top of it, it can remove not just blanks, but any characters (although one character at a time). See for yourself, this function has the following pretty self-explanatory syntax:

TRIM( [BOTH | LEADING | TRAILING] [trim-character] FROM column)

Here trim-character specifies one character (in single quotations marks) to remove from column. If trim-character is not specified, the function removes blanks.

While being called a function, it does not look like a regular SAS function where arguments are separated by commas.  It looks more like an SQL statement (which it understandably is – it is part of the FedSQL language). However, this function is available only in PROC FEDSQL; it’s not available in SAS DATA steps or other PROC steps. Still, it gives us pretty good idea of what such a universal function may look like.

User-defined function TRIMS to remove leading or/and trailing characters in SAS strings

Let’s build such a function by means of the PROC FCMP for the outside the FedSQL usage (it is worth noticing that the FCMP procedure is not supported for FedSQL). To avoid confusion with the existing TRIM function we will call our new function TRIMS (with an ‘S’ at the end) which suits our purpose quite well denoting its plural purpose. First, we define what we are going to create.

User-defined TRIMS function

TRIMS Function

Removes leading characters, trailing characters, or both from a character string.

Syntax

TRIMS(function-modifier, string, trim-list, trim-list-modifier)

Required Arguments

  • function-modifier is a case-insensitive character constant, variable, or expression that specifies one of three possible operations:
    'L' or 'l' – removes leading characters.
    'T' or 't' – removes trailing characters.
    'B' or 'b' – removes both, leading and trailing characters.
  • string is a case-sensitive character constant, variable, or expression that specifies the character string to be trimmed.
  • trim-list is a case-sensitive character constant, variable, or expression that specifies character(s) to remove from the string.
  • trim-list-modifier is a case-insensitive character constant variable, or expression that supplements the trim-list.
    The valid values are those modifiers of the FINDC function that “add” groups of characters (e.g. 'a' or 'A', 'c' or 'C', 'd' or 'D', etc.) to the trim-list.

The following user-defined function implementation is based on the coding techniques described in the two previous posts, [1] and [2] that I mentioned above. Here goes.

 
libname funclib 'c:\projects\functions';
 
/* delete previous function definition during debugging */
options cmplib=funclib.userfuncs;
proc fcmp outlib=funclib.userfuncs.package1;
   deletefunc trims;
run;
 
/* new function defintion */
proc fcmp outlib=funclib.userfuncs.package1;
   function trims(f $, str $, clist $, mod $) $32767;
      from = 1;
      last = length(str);
      if upcase(f) in ('L', 'B') then from = findc(str, clist, 'K'||mod);
      if from=0 then return('');
      if upcase(f) in ('T', 'B') then last = findc(str, clist, 'K'||mod, -last); 
      if last=0 then return('');
      return(substr(str, from, last-from+1));      
   endfunc; 
run;

Code highlights

  • In the function definition, we first assign initial values of the target substring positions as from=1 and last=length(str).
  • Then for Leading or Both character removal, we calculate an adjusted value of from as a position of the first character in str that is not listed in clist and not defined by the mod
  • If from=0 then we return blank and stop further calculations as this means that ALL characters are to be removed.
  • Then for Trailing or Both character removal, we calculate an adjusted value of last as a position of the last character in str that is not listed in clist and not defined by the mod
  • If last=0 then we return blank and stop further calculations as this means that ALL characters are to be removed.
  • And finally, we return a substring of str starting at the from position and ending at the last position, that is with the length of last-from+1.

TRIMS function usage

Let’s define SAS data set SOURCE as follows:

data SOURCE;
   input X $ 1-30;
   datalines;
*00It's done*2*1**-
*--*1****9*55
94*Clean record-*00
;

In the following DATA step, we will create three new variables with removed leading (variable XL), trailing (variable XT) and both - leading and trailing (variable XB) characters '*' and '-' as well as any digits:

options cmplib=funclib.userfuncs;
data TARGET;
   set SOURCE;
   length XB XL XT $30;
   XB = trims('b', X, '*-', 'd');
   XL = trims('L', X, '*-', 'd');
   XT = trims('t', X, '*-', 'd');
run;

In this code we use the TRIM function three times, each time with a different first argument to illustrate how this affects the outcome.

Arguments usage highlights

  • The first argument of the TRIMS function specifies whether we remove characters from both leading and trailing positions ('b'), from leading positions only ('L'), or from trailing positions only ('t'). This argument is case-insensitive. (I prefer using capital 'L' for clarity since lowercase 'l' looks like digit '1').
  • The second argument specifies the name of the variable (X) that we are going to remove characters from (variable X is coming from the dataset SOURCE).
  • The third argument '*-' specifies which character (or characters) to remove. In our example we are removing '*' and '-'. If you do not need to explicitly specify any character here, you still must supply a null value ('') since it is a required argument. In this case, the fourth argument (trim-list-modifier) will determine the set of characters to be removed.
  • And finally, the fourth argument (case-insensitive) of the TRIMS function specifies the FINDC function modifier(s) to remove certain characters in bulk (in our example 'd' will remove all digits). If such modifier is not needed, you still must supply a null value ('') since all four arguments of the TRIMS function are positional and required.

Here is the output data table TARGRET showing the original string X and the resulting strings XB (Both leading and trailing characters removed), XL (Leading characters removed) and XT (Trailing characters removed) side by side:

Result of leading and trailing characters trimming

Conclusion

The new TRIMS function presented in this blog post goes ways further the ubiquitous LEFT and TRIM functions that remove the leading (LEFT) or trailing (TRIM) blanks. The TRIMS function handles ANY characters, not just blanks. It also expands the character deletion functionality of the powerful  FedSQL TRIM function beyond just removing any single leading and/or trailing character. The TRIMS function single-handedly removes any number of explicitly specified characters from either leading, trailing or both (leading and trailing) positions. Plus, it removes in bulk many implicitly specified characters. For example 'd' modifier removes all digits, 'du' modifier removes all digits ('d') and all uppercase letters ('u'), 'dup' modifier removes all digits ('d'), all uppercase letters ('u') and all punctuation marks ('p'); and so on as described by the FINDC function modifiers. The order in which modifier characters are listed does not matter.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Please share your thoughts with us below.

Introducing TRIMS function to remove any leading and/or trailing characters from SAS strings was published on SAS Users.

9月 022021
 

Deleting any trailing characters in SAS stringsIn my previous post, we solved the task of removing specified leading characters from SAS strings. In this post, we tackle the complementary task of removing trailing characters.

While removing trailing blanks is well covered in SAS by the TRIM() and TRIMN() functions, removing non-blank trailing characters remains a bit of a mystery that can pop up during text string processing.

For example, you may need to clean up the following strings by removing all trailing x's from them:

012345x
012345xxx
012345xx

These extra characters can result from data entry errors, prior data manipulations, etc. No matter how you get them in, you want them out.

How to remove any trailing characters

For instance, let’s remove all occurrences of the arbitrary trailing character '*'. The following diagram illustrates what we are going to achieve and how:
Diagram: Deleting trailing characters
In order to remove a specified character (in this example '*') from all trailing positions in a string, we need to search our string from right to left starting from the rightmost non-blank character position and find the position p of the first character in that string that is not equal to the specified character. Note, that despite our right-to-left direction of search the position p=10 is still enumerated from left-to-right. Then we can extract the substring starting from position 1 with the length equal to the found position p.

Unlike in our leading characters removal solution, out of two contenders for our search functionality, VERIFY and FINDC, the VERIFY function has to be dropped from the competition as it does not provide right-to-left search functionality. However, the FINDC function stays on track. Here is a possible solution using the FINDC() function.

Using FINDC() function with negative start-position

The FINDC(X, C, ’K’, -LENGTH(X)) function searches string X from right to left starting from the last non-blank character position determined by the optional start-position argument equal to LENGTH(X), and returns the position P of the first character that does not appear in C.

Here we use the K modifier that switches the default behavior of searching for any character that appears in C to searching for any character that does not appear in C.

The direction of search is defined by the minus sign in front of the start-position (a negative start-position argument translates into searching from right to left.)

Then we can apply the SUBSTR(X, 1, P) function that extracts a substring of X starting from position 1 with a length of P which is effectively a substring of the first P characters in X.

Special considerations

Before we proceed to the code implementation of the outlined solution let’s consider the following edge case.

If our string X consists of all '*' characters and nothing else the FINDC() function will find no character (and therefore no position) that is not equal to '*'. In this case it will return 0. However, 0 is not a valid third argument value for the SUBSTR(X, 1, P) function. Valid values are 1 . . . through VLENGTH(X) – the length attribute of X. Having a 0 value for the third argument will trigger the automatic data step variable _ERROR_=1 and the following note generated in the SAS log:

NOTE: Invalid third argument to function SUBSTR at line ## column #.

Therefore, we need to handle this special case separately, conditionally using SUBSTR(X, 1, P) for P>0 and assigning blank ('') otherwise.

Code implementation for removing trailing characters

Now we are ready to put everything together. First, let’s create a test data table:

data TEST;
   input X $ 1-20;
   datalines;
*It's done***
*********
**01234*ABC**
No trailing *'s
;

Then we apply the logic described above. The following DATA step illustrates our coding solution for deleting trailing characters:

data CLEAN (keep=X Y);
   set TEST;
   C = '*'; *<- trailing character(s) to be removed;
 
   P = findc(X, C, 'K', -length(X));
   if P then Y = substr(X, 1, P); 
        else Y = '';
 
   put _n_= / X= / P= / Y= /;
run;

The SAS log will show interim and final results by the DATA step iterations:

_N_=1
X=*It's done***
P=10
Y=*It's done
 
_N_=2
X=*********
P=0
Y=
 
_N_=3
X=**01234*ABC**
P=11
Y=**01234*ABC
 
_N_=4
X=No trailing *'s
P=15
Y=No trailing *'s

Here is the output data table CLEAN showing the original and the resulting strings X and Y side by side:
Removing any trailing characters in SAS strings

Conclusion

The solution presented in this blog post expands trailing character deletion functionality beyond solely blanks (which are handled by the TRIM and TRIMN functions). Moreover, using this coding technique, we can simultaneously remove a variety of trailing characters. For example, if we have a string X='012345xxx.%' and specify C = 'x.%' (the order of characters listed within the value of C does not matter), then all three characters 'x', '.', and '%' will be removed from all trailing positions of X. The resulting string will be Y='012345'.

In addition, numerous modifiers of the FINDC() function allow specifying many characters in bulk, without explicitly listing them one by one. For example, we may augment a list of characters being removed by adding the D modifier as in P = FINDC(X, C, 'KD', -LENGTH(X)) which will remove all trailing digits in addition to those characters specified in C. Similarly, we may throw in the U modifier as in P = FINDC(X, C, 'KDU', -LENGTH(X)) which adds all uppercase letters to the list of trailing characters to be removed. And so on.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions, concerns, comments? Please share with us below.

Removing trailing characters from SAS strings was published on SAS Users.

8月 232021
 

Illustration for trimming leading characters in SAS stringsAs in many other programming languages, there is a very useful SAS function that removes leading blanks in character strings. It is the ubiquitous LEFT function.

The LEFT(x) function left-aligns a character string x, which effectively removes leading blanks.

However, in many SAS applications we need a similar but more versatile data cleansing functionality allowing for removal of other leading characters, not just blanks. For example, consider some bank account numbers that are stored as the following character strings:

123456789
0123456789
000123456789

These strings represent the same account number recorded with either no, one, or several leading zeros. One way of standardizing this data is by removing the leading 0's. And while we're at it, why don’t we address the leading character removal functionality for any leading characters, not just zeros.

How to remove any leading characters

For example, let’s remove all occurrences of the arbitrary leading character '*'. The following diagram illustrates what we are going to achieve and how:

In order to remove a specified character (in this example '*') from all leading positions in a string, we need to search our string from left to right and find the position of the first character in that string that is not equal to the specified character. In this case, it’s a blank character in position 4. Then we can extract a substring starting from that position till the end of the string.

I can see two possible solutions.

Solution 1: Using VERIFY() function

The VERIFY (X, C) function searches string X from left to right and returns the position P of the first character that does not appear in the value of C.

Then we can apply the SUBSTR(X,P) function that extracts a substring of X starting from position P till the end of the string X.

Solution 2: Using FINDC() function


The FINDC(X, C, ‘K’) function also searches string X from left to right and returns the position P of the first character that does not appear in C. (The modifier ‘K’ switches the default behavior of searching for any character that appears in C to searching for any character that does not appear in C.)

Then, as with the VERIFY() function, we can apply the SUBSTR(X,P) function that extracts a substring of X starting from position P till the end of the string X.

Special considerations

So far so good, and everything will be just hunky-dory, right? Not really - unless we cover our bases by handling edge cases.

Have we thought of what would happen if our string X consisted of all '*' characters and nothing else? In this special case, both the verify() function and findc() function will find no position of the character that is not equal to '*' and thus return 0.

However, 0 is not a valid second argument value for the SUBSTR(X,P) function. Valid values are 1 . . . through length(X). Having a 0 value for the second argument will trigger the automatic data step variable _ERROR_=1 and the following note generated in the SAS log:

NOTE: Invalid second argument to function SUBSTR at line ## column #.

Therefore, we need to handle this special case separately, conditionally using SUBSTR(X,P) for P>0 and assigning blank ('') otherwise.

Code implementation for removing leading characters

Let’s put everything together. First, we'll create a test data table:

data TEST;
   input X $ 1-20;
   datalines;
*** It's done*
*********
**01234*ABC**
No leading *'s
;

Then we apply the logic described above. The following DATA step illustrates our two implemented coding solutions for removing leading characters:

data CLEAN (keep=X Y Z);
   set TEST;
   C = '*'; *<- leading character(s) to be removed;
 
   P1 = verify(X,C); *<- Solution 1;
   if P1 then Y = substr(X, P1);    
         else Y = '';
 
   P2 = findc(X,C,'K'); *<- Solution 2;
   if P2 then Z = substr(X, P2); 
         else Z = '';
 
   put _n_= / X= / P1= / Y= / P2= / Z= /;
run;

Alternatively, we can replace the IF-THEN-ELSE construct with this IFC() function one-liner:

data CLEAN (keep=X Y Z);
   set TEST;
   C='*'; *<- leading character(s) to be removed;
 
   P1 = verify(X,C); *<- Solution 1;
   Y = ifc(P1, substr(X, P1), '');
 
   P2 = findc(X,C,'K'); *<- Solution 2;
   Z = ifc(P2, substr(X, P2), '');
 
   put _n_= / X= / P1= / Y= / P2= / Z= /;
run;

The SAS log will show interim and final results by the DATA step iterations:

_N_=1
X=*** It's done*
P1=4
Y=It's done*
P2=4
Z=It's done*
 
_N_=2
X=*********
P1=10
Y=
P2=10
Z=
 
_N_=3
X=**01234*ABC**
P1=3
Y=01234*ABC**
P2=3
Z=01234*ABC**
 
_N_=4
X=No leading *'s
P1=1
Y=No leading *'s
P2=1
Z=No leading *'s

Here is the output data table CLEAN showing the original string X, and resulting strings Y (solution 1) and Z (solution 2) side by side:
Removing any leading characters in SAS strings
As you can see, both solutions (1 & 2) produce identical results.

Conclusion

Compared to the LEFT() function, the solution presented in this blog post not only expands leading character removal/cleansing functionality beyond the blank character exclusively. Using this coding technique we can simultaneously remove a variety of leading characters (including but not limited to blank). For example, if we have a string X=' 0.000 12345' and specify C = ' 0.' (the order of characters listed within the value of C does not matter), then all three characters ' ', '0', and '.' will be removed from all leading positions of X. The resulting string will be Y='12345'.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions, concerns, comments? Please share with us below.

Removing leading characters from SAS strings was published on SAS Users.

7月 062021
 

Complex loops in SAS programmingIterative loops are one of the most powerful and imperative features of any programming language, allowing blocks of code to be automatically executed repeatedly with some variations. In SAS we call them DO-loops because they are defined by the iterative DO statements. These statements come in three distinct forms:

  • DO with index variable
  • DO UNTIL
  • DO WHILE

In this blog post we will focus on the versatile iterative DO loops with index variable pertaining to SAS DATA steps, as opposed to its modest IML’s DO loops subset.

Iterative DO statement with index variable

The syntax of the DATA step’s iterative DO statement with index variable is remarkably simple yet powerful:

DO statement with index-variable

DO index-variable=specification-1 <, ...specification-n>;

...more SAS statements...

END;

It executes a block of code between the DO and END statements repeatedly, controlled by the value of an index variable. Given that angle brackets (< and >) denote “optional”, notice how index-variable requires at least one specification (specification-1) yet allows for multiple additional optional specifications (<, ...specification-n>) separated by commas.

Now, let’s look into the DO statement’s index-variable specifications.

Index-variable specification


Each specification denotes an expression, or a series of expressions as follows:

start-expression <TO stop-expression> <BY increment-expression> <WHILE (expression) | UNTIL (expression)>

Note that only start-expression is required here whereas <TO stop-expression>, <BY increment-expression>, and <WHILE (expression) or UNTIL (expression)> are optional.

Start-expression may be of either Numeric or Character type, while stop-expression and increment-expression may only be Numeric complementing Numeric start-expression.

Expressions in <WHILE (expression) | UNTIL (expression)> are Boolean Numeric expressions (numeric value other than 0 or missing is TRUE and a value of 0 or missing is FALSE).

Other iterative DO statements

For comparison, here is a brief description of the other two forms of iterative DO statement:

  • The DO UNTIL statement executes statements in a DO loop repetitively until a condition is true, checking the condition after each iteration of the DO loop. In other words, if the condition is true at the end of the current loop it will not iterate anymore, and processing continues with the next statement after END. Otherwise, it will iterate.
  • The DO WHILE statement executes statements in a DO loop repetitively while a condition is true, checking the condition before each iteration of the DO loop. That is if the condition is true at the beginning of the current loop it will iterate, otherwise it will not, and processing continues with the next statement after the END.

Looping over a list of index variable values/expressions

DO loops can iterate over a list of index variable values. For example, the following DO-loop will iterate its index variable values over a list of 7, 13, 5, 1 in the order they are specified:

data A; 
   do i=7, 13, 5, 1;
      put i=;
      output;
   end;
run;

This is not yet another form of iterative DO loop as it is fully covered by the iterative DO statement with index variable definition. In this case, the first value (7) is the required start expression of the required first specification, and all subsequent values (13, 5 and 1) are required start expressions of the additional optional specifications.

Similarly, the following example illustrates looping over a list of index variable character values:

data A1;
   length j $4;
   do j='a', 'bcd', 'efgh', 'xyz';
      put j=;
      output;
   end;
run;

Since DO loop specifications denote expressions (values are just instances or subsets of expressions), we can expand our example to a list of actual expressions:

data B;
   p = constant('pi');
   do i=round(sin(p)), sin(p/2), sin(p/3);
      put i=;
      output;
   end;
run;

In this code DO-loop will iterate its index variable over a list of values defined by the following expressions: round(sin(p)), sin(p/2), sin(p/3).

Infinite loops

Since <TO stop> is optional for the index-variable specification, the following code is perfectly syntactically correct:

data C;
   do j=1 by 1;
      output;
   end;
run;

It will result in an infinite (endless) loop in which resulting data set will be growing indefinitely.

While unintentional infinite looping is considered to be a bug and programmers’ anathema, sometimes it may be used intentionally. For example, to find out what happens when data set size reaches the disk space capacity… Or instead of supplying a “big enough” hard-coded number (which is not a good programming practice) for the loop’s TO expression, we may want to define an infinite DO-loop and take care of its termination and exit inside the loop. For example, you can use IF exit-condition THEN LEAVE; or IF exit-condition THEN STOP; construct.

LEAVE statement immediately stops processing the current DO-loop and resumes with the next statement after its END.

STOP statement immediately stops execution of the current DATA step and SAS resumes processing statements after the end of the current DATA step.

The exit-condition may be unrelated to the index-variable and be based on some events occurrence. For instance, the following code will continue running syntactically “infinite” loop, but the IF-THEN-LEAVE statement will limit it to 200 seconds:

data D;
   start = datetime();
   do k=1 by 1;
      if datetime()-start gt 200 then leave;
      /* ... some processing ...*/
      output; 
   end;
run;

You can also create endless loop using DO UNTIL(0); or DO WHILE(1); statement, but again you would need to take care of its termination inside the loop.

Changing “TO stop” within DO-loop will not affect the number of iterations

If you think you can break out of your DO loop prematurely by adjusting TO stop expression value from within the loop, you may want to run the following code snippet to prove to yourself it’s not going to happen:

data E;
   n = 4;
   do i=1 to n;
      if i eq 2 then n = 2;
      put i=;
      output;
   end;
run;

This code will execute DO-loop 4 times despite that you change value of n from 4 to 2 within the loop.

According to the iterative DO statement documentation, any changes to stop made within the DO group do not affect the number of iterations. Instead, in order to stop iteration of DO-loop before index variable surpasses stop, change the value of index-variable so that it becomes equal to the value of stop, or use LEAVE statement to jump out of the loop. The following two examples will do just that:

data F;
   do i=1 to 4;
      put i=;
      if i eq 2 then i = 4;
      output;
   end;
run;
 
data G;
   do i=1 to 4;
      put i=;
      if i eq 2 then leave;
      output;
   end;
run;

Know thy DO-loop specifications

Here is a little attention/comprehension test for you.

How many times will the following DO-loop iterate?

data H;
   do i=1, 7, 3, 6, 2 until (i>3);
      put i=;
      output;
   end;
run;

If your answer is 2, you need to re-read the whole post from the beginning (I am only partly joking here).

You may easily find out the correct answer by running this code snippet in SAS. If you are surprised by the result, just take a closer look at the DO statement: there are 5 specifications for the index variable here (separated by commas) whereas UNTIL (expression) belongs to the last specification where i=2. Thus, UNTIL only applies to a single value of i=2 (not to any previous specifications of i =1,7,3,6); therefore, it has no effect as it is evaluated at the end of each iteration.

Now consider the following DO-loop definition:

data Z;
   pi = constant('pi');
   do x=3 while(x>pi), 10 to 1 by -pi*3, 20, 30 to 35 until(pi);
      put x=;
      output;
   end;
run;

I hope after reading this blog post you can easily identify the index variable list of values the DO-loop will iterate over. Feel free to share your solution and explanation in the comments section below.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Do you have questions, other secrets, tips or tricks about the DO loop? Please share with us below.

Little known secrets of DO-loops with index variables was published on SAS Users.

6月 152021
 

Dealing with big dataIn this fast-paced data age, when the sheer volume of data (generated, collected, and waiting to be processed and analyzed) grows at a breathtaking rate, the speed of data processing becomes critically important. In many cases, if data is not processed within an allotted time frame, we lose all its value as it becomes obsolete and ultimately irrelevant. That is why computing power becomes of the essence.

However, computing power itself does not guarantee timely processing. How we use that power makes all the difference. Way too often good old sequential processing just does not cut it anymore and different computing methods are required. One  such method is parallel processing.

In my previous post Using shell scripts for massively parallel processing I demonstrated a script-centered technique of running in parallel multiple independent SAS processes in SAS environments lacking SAS/CONNECT.

In this post, we will take a shot at a slightly different task and solution. Instead of having several totally independent processes, now we have some common “pre-processing” part, then we run several independent processes in parallel, and then we combine the results of parallel processing in the “post-processing” portion of our program.

Problem: monthly data ingestion use case

For simplification, we are going to use a scenario similar to one in the previous blog post:

Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month. Only now, we will go a step further: combining all those daily tables into a monthly table.

Solution: combining sequential and parallel processing

The solution is comprised of the three major components:

  • Shell script running the main SAS program.
  • Main SAS program, consisting of three parts: pre-parallel processing, parallel processing, and post-parallel processing.
  • Single thread SAS program responsible for a single day data ingestion.

1. Shell script running main SAS program

Below shell script mainprog.sh runs the main SAS program mainprog.sas:

#!/bin/sh
 
# HOW TO CALL:
# nohup sh /path/mainprog.sh YYYYMM &
 
now=$(date +%Y.%m.%d_%H.%M.%S)
 
# getting YYYYMM as a parameter in script call
ym=$1
 
pgmname=/path/mainprog.sas
logname=/path/saslogs/mainprog_$now.log
sas $pgmname -log $logname -set inDate $ym -set logname $logname

The script is run a background mode as indicated by the ampersand at the end of its invocation command:

nohup sh /path/mainprog.sh YYYYMM &

We pass a parameter YYYYMM (e.g. 202106) indicating year and month of our request.

When we call SAS program mainprog.sas within the script we indicate the name of the SAS log file to be created (-log $logname) and also pass on inDate parameter (-set inDate $ym, which has the same value YYYYMM as parameter specified in the script calling command), and logname parameter (-set logname $logname). As you will see further, we are going to use these two parameters within mainprog.sas program.

2. Main SAS program

Here is an abridged version of the mainprog.sas program:

/* ======= pre-processing ======= */
 
/* parameters passed from shell script */
%let inDate = %sysget(inDate);
%let logname = %sysget(logname);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* output data library */
libname SASDL '/data/target';
 
/* number of days in month mm of year yyyy */
%let days = %sysfunc(day(%sysfunc(mdy(&mm+1,1,&yyyy))-1));
 
/* ======= parallel processing ======= */
%macro loop;
   %local threadprog looplogdir logdt workpath tasklist i z threadlog cmd;
   %let threadprog = /path/thread.sas;
   %let looplogdir = %substr(&logname,1,%length(&logname)-4)_logs;
   x "mkdir &looplogdir"; *<- directory for loop logs;
   %let logdt = %substr(&logname,%length(&logname)-22,19);
   %let workpath = %sysfunc(pathname(WORK));
   %let tasklist=;
   %do i=1 %to &days;
      %let z = %sysfunc(putn(&i,z2.));
      %let threadlog = &looplogdir/thread_&z._&logdt..log;
      %let tasklist = &tasklist DAY&i;
      %let cmd = sas &threadprog -log &threadlog -set i &i -set workpath &workpath -set inDate &inDate;
      systask command "&cmd" taskname=DAY&i;
   %end;
 
   waitfor _all_ &tasklist;
 
%mend loop;
%loop
 
/* ======= post-processing ======= */
 
/* combine daily tables into one monthly table */
data SASDL.TARGET_&inDate;
   set WORK.TARGET_&inDate._1 - WORK.TARGET_&inDate._&days;
run;

The key highlights of this program are:

  • We capture values of the parameters passed to the program (inDate and logname).
  • Based on these parameters, assign source directory and target data library SASDL.
  • Calculate number of days in a specific month defined by year and month.
  • Create a directory to hold SAS logs of all parallel threads; the directory name is matching the log name of the mainprog.sas.
  • Capture the WORK library location of the main SAS session running mainprog.sas as:

    %let workpath = %sysfunc(pathname(WORK));We use that location in the thread sessions to pass back to the main session data produced by the thread sessions.

  • Macro %do-loop generates a series of SYSTASK statements to spawn additional SAS sessions in the background mode, each ingesting data for a single day of a month:

    systask command "&cmd" taskname=DAY&i;The SYSTASK statement enables you to execute host-specific commands from within your SAS session or application. Unlike the X statement, the SYSTASK statement runs these commands as asynchronous tasks, which means that these tasks execute independently of all other tasks that are currently running. Asynchronous tasks run in the background, so you can perform additional tasks (including launching other asynchronous tasks) while the asynchronous task is still running.

    Restriction: SYSTASK statement is not supported on the CAS server.

  • Also, we generate a cumulative list of all tasknames assigned to each thread sessions:

    %let tasklist = &tasklist DAY&i;

  • Outside the macro %do-loop we use WAITFOR statement which suspends execution of the main SAS session until the specified tasks finish executing. Since we created a list of all daily thread sessions (&tasklist), this will synchronize all our parallel threads and continue mainprog.sas session only when all threads finished executing.
  • At the end of the main SAS session we concatenate all our daily data tables that have been created by parallel threads in the location of the WORK library of the main SAS session.

Using SAS macro loop to generate a series of SYSTASK statements for parallel processing is not the only method available. Alternatively, you can achieve this within a data step using CALL EXECUTE. In this case, each data step iteration will generate a single global SYSTASK statement and push it out of the data step boundaries where they will be sequentially executed (just like in the case of macro implementation). Since option NOWAIT is the default for SYSTASK statements, despite all of them being launched sequentially, their corresponding OS commands will be still running in parallel.

3. Single thread SAS program

Here is an abridged version of the thread.sas program:

/* inDate parameter */
%let inDate = %sysget(inDate);
 
/* parent program's WORK library */
%let workpath = %sysget(workpath);
libname MAINWORK "&workpath";
 
/* thread number */
%let i = %sysget(i);
 
/* year and month */
%let yyyy = %substr(inDate,1,4);
%let mm = %substr(inDate,5,2);
 
/* source data directory */
%let srcdir = /datapath/&yyyy/&mm;
 
/* create varlist macro variable to list all input variable names */
proc sql noprint;
   select name into :varlist separated by ' ' from SASHELP.VCOLUMN
   where libname='PARMSDL' and memname='DATA_TEMPLATE';
quit;
 
/* create fileref inf for the source file */
filename inf "&srcdir/source_data_&inDate._day&i..cvs";
 
/* create daily output data set */
data MAINWORK.TARGET_&inDate._&i; 
   if 0 then set PARMSDL.DATA_TEMPLATE;
   infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max;
   input &varlist;
run;

This program ingests a single .csv file corresponding to the &i-th day of &inDate (year and month) and creates a SAS data table MAINWORK.TARGET_&inDate._&i. To be available in the main SAS session the MAINWORK library is defined here in the same physical location as the WORK library of the main parental SAS session.

We also use a pre-created SAS data template PARMSDL.DATA_TEMPLATE - a zero-observations data set that contains descriptions of all the variables and their attributes.

Additional resources

Thoughts? Comments?

Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.

Using SYSTASK and SAS macro loops for massively parallel processing was published on SAS Users.

4月 142021
 

Improving programming jobs performance with massively parallel processingUntil recently, I used UNIX/Linux shell scripts in a very limited capacity, mostly as vehicle of submitting SAS batch jobs. All heavy lifting (conditional processing logic, looping, macro processing, etc.) was done in SAS and by SAS.  If there was a need for parallel processing and synchronization, it was also implemented in SAS. I even wrote a blog post  Running SAS programs in parallel using SAS/CONNECT®, which I proudly shared with my customers.

The post caught their attention and I was asked if I could implement the same approach to speed up processes that were taking too long to run.

However, it turned out that SAS/CONNECT was not licensed at their site and procuring the license wasn’t going to happen any time soon. Bummer!

Or boon? You should never be discouraged by obstacles. In fact, encountering an obstacle might be a stroke of luck. Just add a mixture of curiosity, creativity, and tenacity – and you get a recipe for new opportunity and success. That’s exactly what happened when I turned to exploring shell scripting as an alternative way of implementing parallel processing.

Running several batch jobs in parallel

UNIX/Linux OS allows running several scripts in parallel. Let’s say we have three SAS batch jobs controlled by their own scripts script1.sh, script2.sh, and script3.sh. We can run them concurrently (in parallel) by submitting these shell scripts one after another in background mode using & at the end. Just put them in a wrapper “parent” script allthree.sh and run it in background mode as:

$ nohup allthree.sh &

Here what is inside the allthree.sh: 

#!/bin/sh
script1.sh &
script2.sh &
script3.sh &
wait

With such an arrangement, allthree.sh “parent” script starts all three background tasks (and corresponding SAS programs) that will run by the server concurrently (as far as resources would allow.) Depending on the server capacity (mainly, the number of CPU’s) these jobs will run in parallel, or quasi parallel competing for the server shared resources with the Operating System taking charge for orchestrating their co-existence and load balancing.

The wait command at the end is responsible for the “parent” script’s synchronization. Since no process id or job id is specified with wait command, it will wait for all current “child” processes to complete. Once all three tasks completed, the parent script allthree.sh will continue past the wait command.

Get the UNIX/Linux server information

To evaluate server capabilities as it relates to the parallel processing, we would like to know the number of CPU’s.

To get this information we can ran the the lscpu command as it provides an overview of the CPU architectural characteristics such as number of CPU’s, number of CPU cores, vendor ID, model, model name, speed of each core, and lots more. Here is what I got:

Ha! 56 CPUs! This is not bad, not bad at all! I don’t even have to usurp the whole server after all. I can just grab about 50% of its capacity and be a nice guy leaving another 50% to all other users.

Problem: monthly data ingestion use case

Here is a simplified description of the problem I was facing.

Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month.  The existing process sequentially looped through all the CSV files, which (given the data volume) took about an hour to run.

This task was a perfect candidate for parallel processing since data ingestions of individual days were fully independent of each other.

Solution: massively parallel process

The solution is comprised of the two parts:

  • Single thread SAS program responsible for a single day data ingestion.
  • Shell script running multiple instances of this SAS program concurrently.

Single thread SAS process

The first thing I did was re-writing the SAS program from looping through all of the days to ingesting just a single day of a month-year. Here is a bare-bones version of the SAS program:

/* capture parameter &sysparm passed from OS command */ 
%let YYYYMMDD = &sysparm;
 
/* create varlist macro variable to list all input variable names */
proc sql noprint;
   select name into :varlist separated by ' ' from SASHELP.VCOLUMN
   where libname='PARMSDL' and memname='DATA_TEMPLATE';
quit;
 
/* create fileref inf for the source file */
filename inf "/cvspath/rawdata&YYYYMMDD..cvs";
 
/* create daily output data set */
data SASDL.DATA&YYYYMMDD; 
   if 0 then set PARMSDL.DATA_TEMPLATE;
   infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max;
   input &varlist;
run;

This SAS program (let’s call it oneday.sas) can be run in batch using the following OS command:

sas oneday.sas -log oneday.log -sysparm 202103

Note, that we pass a parameter (e.g. 202103 means year 2021, month 03) defining the requested year and month YYYYMM as -sysparm value.

That value becomes available in the SAS program as a macro variable reference &sysparm.

We also use a pre-created data template PARMSDL.DATA_TEMPLATE - a zero-observations data set that contains descriptions of all the variables and their attributes (see Simplify data preparation using SAS data templates).

Shell script running the whole process in parallel

Below shell script month_parallel_driver.sh puts everything together. It spawns and runs concurrently as many daily processes as there are days in a specified month-of-year and synchronizes all single day processes (threads) at the end by waiting them all to complete. It logs all its treads and calculates (and prints) the total processing duration. As you can see, shell script as a programming language is a quite versatile and powerful. Here it is:

#!/bin/sh
 
# HOW TO RUN:
# cd /projpath/scripts
# nohup sh month_parallel_driver.sh &
 
# Project path
proj=/projpath
 
# Program file name
prgm=oneday
pgmname=$proj/programs/$prgm.sas
 
# Current date/time stamp
now=$(date +%Y.%m.%d_%H.%M.%S)
echo 'Start time:'$now
 
# Reset timer
SECONDS=0
 
# Get YYYYMM as the script parameter
par=$1
 
# Extract year and month from $par
y=${par:0:4}
m=${par:4:2}
 
# Get number of days in month $m of year $y
days=$(cal $m $y | awk 'NF {DAYS = $NF}; END {print DAYS}')
 
# Create log directory
logdir=$proj/saslogs/${prgm}_${y}${m}_${now}_logs
mkdir $logdir
 
# Loop through all days of month $m of year $y
for i in $(seq -f "%02g" 1 $days)
do
   # Assign log name for a single day thread
   logname=$logdir/${prgm}_${y}${m}_thread${i}_$now.log
 
   # Run single day thread
   /SASHome/SASFoundation/9.4/sas $pgmname -log $logname -sysparm $par$i &
done
 
# Wait until all threads are finished
wait
 
# Calculate and print duration
end=$(date +%Y.%m.%d_%H.%M.%S)
echo 'End time:'$end
hh=$(($SECONDS/3600))
mm=$(( $(($SECONDS - $hh * 3600)) / 60 ))
ss=$(($SECONDS - $hh * 3600 - $mm * 60))
printf " Total Duration: %02d:%02d:%02d\n" $hh $mm $ss
echo '------- End of job -------'

This script is self-described by detail comments and can be run as:

cd /projpath/scripts
nohup sh month_parallel_driver.sh &

Results

The results were as expected as they were stunning. The overall duration was cut roughly by a factor of 25, so now this whole task completes in about two minutes vs. one hour before. Actually, now it is even fun to watch how SAS logs and output data sets are being updated in real time.

What is more, this script-centric approach can be used for running not just SAS processes, but non-SAS, open source and/or hybrid processes as well. This makes it a powerful amplifier and integrator for heterogeneous software applications development.

SAS Consulting Services

The solution presented in this post is a stripped-down version of the original production quality solution. This better serves our educational objective of communicating the key concepts and coding techniques. If you believe your organization’s computational powers are underutilized and may benefit from a SAS Consulting Services engagement, please reach out to us through your SAS representative, and we will be happy to help.

Additional resources

Thoughts? Comments?

Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.

Using shell scripts for massively parallel processing was published on SAS Users.