SAS programmers

7月 222022
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to filter CAS tables. Now it's time to focus on creating calculated columns on a CAS table.

Load and explore data

First, I'll load the cars.csv file into memory using my CAS connection object conn, followed by the upload_file method from the SWATpackage. The upload_file method loads a client-side file into memory on the CAS server and returns a reference to the CAS table. I'll confirm this by using the display function to display the value of the tbl object. Remember, the data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

tbl = conn.upload_file('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv', 
                       casout={'name':'cars', 
                               'caslib':'casuser'})
display(type)
# and the command results:
NOTE: Cloud Analytic Services made the uploaded file available as table CARS in caslib CASUSER(Peter).
NOTE: The table CARS has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.
CASTable('CARS', caslib='CASUSER(Peter)')

The results show the cars.csv file was successfully loaded to the CAS server and tbl is a CASTable object referencing the CAS table CARS.

Next, let's preview the CARS CAS table using the SWAT package head method on the tbl object.

tbl.head()

The CAS server returns 5 rows from the CAS table to the Python client as a DataFrame, as expected.

Create Calculated Columns

Now that we have created the table in the caslib and are familiar with the data, let's learn how to create calculated columns in CAS tables!

Python Technique

First, I'll begin with the Pandas bracket notation to create new columns. Since the SWAT package tries to blend the world of Pandas and SAS, this is a good starting point. Here, I'll create two columns. The first statement creates a column with upper case values from the Origin column, the second calculated column will find the percent increase of the Invoice price to the MSRP price.

tbl['NewCol1'] = tbl.Origin.str.upper()
tbl['NewCol2'] = ((tbl.MSRP - tbl.Invoice) / tbl.Invoice).mul(100).round()
tbl.head()

The results show the creation of two new columns, NewCol1 and NewCol2. Be aware, behind the scenes the SWAT package adds a variety of parameters to the CASTable object to create the columns. You can view the parameters by displaying the tbl object. These parameters are stored on the CASTable object and are applied to the table when an action or method is executed on the object.

display(tbl)
# and the command results:
{'name': 'CARS',
 'caslib': 'CASUSER(Peter)',
 'computedvars': ['NewCol1', 'NewCol2'],
 'computedvarsprogram': 'length _upper_0_ varchar(*); _upper_0_ = upcase(Origin); NewCol1 = ....

eval Method

You can also use the eval method from the SWAT package to create calculated columns. This works similarly to the Pandas eval method.

Below I'll create the same two columns, but before I do, I'll reset the CASTable object to reference the CARS CAS table, removing the parameters that were stored on the object from above. Then I'll use the eval method to create the columns. The eval method requires SAS assignment statements and functions.

tbl = conn.CASTable('cars', caslib='casuser')
 
tbl.eval('NewCol1 = upcase(Origin)')
tbl.eval('NewCol2 = round(((MSRP - Invoice) / Invoice) * 100)')
tbl.head()


The results show the two new columns NewCol1 and NewCol2 were created successfully.

CAS Table COMPUTEDVARSPROGRAM Parameter

Lastly, I'll use the computedVarsProgram parameter to create calculated columns. This method is specific to SAS Viya.

I'll begin by storing the SAS expressions as string variables. Notice the required semicolon at the end of each expression. Then, I'll add the computedVarsProgram parameter to the CASTable object tbl, and set the value equal to the two string expressions.

tbl = conn.CASTable('cars', caslib='casuser')
createNewCol1 = 'NewCol1 = upcase(Origin);'
createNewCol2 = 'NewCol2 = round(((MSRP - Invoice) / Invoice) * 100);'
tbl.computedVarsProgram = createNewCol1 + createNewCol2
tbl.head()


The results show the two new columns NewCol1 and NewCol2 were created successfully.

Summary

SAS Viya offers various ways to create new columns in CAS tables. The SWAT package blends the world of Pandas and SAS, making available many familiar Pandas API techniques. You also have specific CAS techniques handy when working with the CASTable object directly. Choosing the correct method may depend on your greater code structure, but you now have the right tools to make that decision.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 8 - Creating Calculated Columns was published on SAS Users.

7月 222022
 

Welcome to the continuation of my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to filter CAS tables. Now it's time to focus on creating calculated columns on a CAS table.

Load and explore data

First, I'll load the cars.csv file into memory using my CAS connection object conn, followed by the upload_file method from the SWATpackage. The upload_file method loads a client-side file into memory on the CAS server and returns a reference to the CAS table. I'll confirm this by using the display function to display the value of the tbl object. Remember, the data in this example is small for training purposes. Processing data in the CAS server's massively parallel processing environment is typically reserved for larger data.

tbl = conn.upload_file('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv', 
                       casout={'name':'cars', 
                               'caslib':'casuser'})
display(type)
# and the command results:
NOTE: Cloud Analytic Services made the uploaded file available as table CARS in caslib CASUSER(Peter).
NOTE: The table CARS has been created in caslib CASUSER(Peter) from binary data uploaded to Cloud Analytic Services.
CASTable('CARS', caslib='CASUSER(Peter)')

The results show the cars.csv file was successfully loaded to the CAS server and tbl is a CASTable object referencing the CAS table CARS.

Next, let's preview the CARS CAS table using the SWAT package head method on the tbl object.

tbl.head()

The CAS server returns 5 rows from the CAS table to the Python client as a DataFrame, as expected.

Create Calculated Columns

Now that we have created the table in the caslib and are familiar with the data, let's learn how to create calculated columns in CAS tables!

Python Technique

First, I'll begin with the Pandas bracket notation to create new columns. Since the SWAT package tries to blend the world of Pandas and SAS, this is a good starting point. Here, I'll create two columns. The first statement creates a column with upper case values from the Origin column, the second calculated column will find the percent increase of the Invoice price to the MSRP price.

tbl['NewCol1'] = tbl.Origin.str.upper()
tbl['NewCol2'] = ((tbl.MSRP - tbl.Invoice) / tbl.Invoice).mul(100).round()
tbl.head()

The results show the creation of two new columns, NewCol1 and NewCol2. Be aware, behind the scenes the SWAT package adds a variety of parameters to the CASTable object to create the columns. You can view the parameters by displaying the tbl object. These parameters are stored on the CASTable object and are applied to the table when an action or method is executed on the object.

display(tbl)
# and the command results:
{'name': 'CARS',
 'caslib': 'CASUSER(Peter)',
 'computedvars': ['NewCol1', 'NewCol2'],
 'computedvarsprogram': 'length _upper_0_ varchar(*); _upper_0_ = upcase(Origin); NewCol1 = ....

eval Method

You can also use the eval method from the SWAT package to create calculated columns. This works similarly to the Pandas eval method.

Below I'll create the same two columns, but before I do, I'll reset the CASTable object to reference the CARS CAS table, removing the parameters that were stored on the object from above. Then I'll use the eval method to create the columns. The eval method requires SAS assignment statements and functions.

tbl = conn.CASTable('cars', caslib='casuser')
 
tbl.eval('NewCol1 = upcase(Origin)')
tbl.eval('NewCol2 = round(((MSRP - Invoice) / Invoice) * 100)')
tbl.head()


The results show the two new columns NewCol1 and NewCol2 were created successfully.

CAS Table COMPUTEDVARSPROGRAM Parameter

Lastly, I'll use the computedVarsProgram parameter to create calculated columns. This method is specific to SAS Viya.

I'll begin by storing the SAS expressions as string variables. Notice the required semicolon at the end of each expression. Then, I'll add the computedVarsProgram parameter to the CASTable object tbl, and set the value equal to the two string expressions.

tbl = conn.CASTable('cars', caslib='casuser')
createNewCol1 = 'NewCol1 = upcase(Origin);'
createNewCol2 = 'NewCol2 = round(((MSRP - Invoice) / Invoice) * 100);'
tbl.computedVarsProgram = createNewCol1 + createNewCol2
tbl.head()


The results show the two new columns NewCol1 and NewCol2 were created successfully.

Summary

SAS Viya offers various ways to create new columns in CAS tables. The SWAT package blends the world of Pandas and SAS, making available many familiar Pandas API techniques. You also have specific CAS techniques handy when working with the CASTable object directly. Choosing the correct method may depend on your greater code structure, but you now have the right tools to make that decision.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Part 8 - Creating Calculated Columns was published on SAS Users.

2月 162022
 

Editor’s note: SASjs, while free and useful for anyone, is a gateway to services from 4GL Apps, a company founded and operated by longtime SAS user Allan Bowe.Open code sharing


I would like to start this post with a well-known quote by George Bernard Shaw: “If you have an apple and I have an apple and we exchange these apples then you and I will still each have one apple. But if you have an idea and I have an idea and we exchange these ideas, then each of us will have two ideas.”

SAS Software provides vast functionality delivered with it, but perhaps the biggest asset of the SAS ecosystem is its ever-growing community of SAS users around the globe readily willing to share their SAS code and solutions to various problems. Take for example, SAS Communities, SAS blog or SAS Software on GitHub.

With open-source code taking the world by storm, SAS maintains and nurtures a viable mix of its rock-solid backbone – commercial SAS Software backed by its unparalleled Technical Support and open-source SAS solutions developed and shared by SAS users.

An introduction to SASjs

Here, I am going to introduce you to one more publicly available open source treasure called SASjs (also available on GitHub), which stands out from many others because of its thought-provoking, multifaceted approach.

In a nutshell, the SASjs is a collection of tools aimed at enabling and accelerating development and operations (DevOps) for a broad range of SAS applications, including SAS web development.

SASjs consists of the following three major components:

  1. SAS macro library (MacroCore)
  2. JavaScript library (Adapter)
  3. Command line interface (CLI)

SASjs MacroCore library

The MacroCore library is a centerpiece of the SASjs framework. It includes several hundred SAS macros facilitating various aspects of SAS coding, application development and deployments. All these macros are production quality and free to use (MIT-licensed). The contents of the MacroCore library organized by the following sections:

  • BASE library (SAS9/Viya) - prefixes: mf (macro functions), mp (macro procedures - macros).
  • FCMP library (SAS9/Viya) - prefix: mcf; used to generate user-defined functions using PROC FCMP.
  • META library (SAS9 only) - prefix: mm; used in SAS EBI environments that connect to the metadata server.
  • SERVER library (@sasjs/server only) - prefix: ms; used for building applications using @sasjs/server - an open source REST API for Desktop SAS.
  • VIYA library (Viya only) - prefixes: mv, mvf; used for interfacing with SAS Viya.
  • METAX library (SAS9 only) - prefixes: mmw, mmu, mmx.
  • DDL Library - prefix: mddl; provides the structure for any permanent table used by the macros.
  • LUA library - prefix: ml

All these macros (except METAX) are Operating System independent and operate under NOXCMD option. That option prevents the SAS program from executing OS commands and is often implemented by SAS Administrators for security reasons.

DOCUMENTATION PAGE – here you can find detailed descriptions for all of the SASjs MacroCore macros.

DOWNLOAD PAGE – here you can grab all these macros FREE and deploy (save) anywhere in your SAS environment.

Alternatively, you can make all these macros available in your SAS program by %INCLUDE-ing them directly from the Web – just add this code snippet to the beginning of your SAS program:

filename mcore url "https://raw.githubusercontent.com/sasjs/core/main/all.sas";
%inc mcore;

To give you a taste of what these macros are here is a small sample of them with examples of usage and brief descriptions:

  • %mp_copyfolder(&rootdir,&copydir) - copies the whole directory structure including files and subdirectories from one location (&rootdir) to another (&copydir). Uses only internal SAS functions without OS commands (NOXCMD option).
  • %mp_dirlist(path=/some/location, outds=myTable, maxdepth=MAX) - returns SAS table myTable listing all files and sub-directories up to specified depth. It also uses only internal SAS functions without OS commands (NOXCMD option).
  • %mp_csv2ds(inref=mycsv,outds=myds,baseds=template_ds) - imports relevant columns from mycsv CSV file to myds data table using a template dataset template_ds to provide the types and lengths.
  • %mf_getvarcount(my_dataset) - this macro function returns the number of variables in a dataset.
  • %mf_getvarlist(sashelp.class,dlm=%str(,),quote=double) - this macro function returns the dataset variable list directly from its header. The list can be separated by blanks, commas or any other delimiter (dlm=), and be unquoted or quoted (quote=).

SASjs Javascript library (Adapter)

In my earlier posts about integrating SAS with Google Maps, I employed and implemented the idea of using SAS for generating data-driven HTML and Javascript code for creating web pages.

SASjs, on the other hand, turns on its head this SAS-to-JavaScript paradigm. In a total reverse, it uses JavaScript to generate SAS code for running on SAS servers as jobs or stored processes.

Previously, I naively believed that JavaScript is a web browser language that exists and used only within web browsers to enhance interactions with HTML web pages. SASjs proved me wrong. It turns out JavaScript is a general-purpose language that can be used everywhere outside of the web browser.

SASjs extensively employs this JavaScript functionality. Its second component, SASjs Adapter, is a server-side JavaScript library (and a set of SAS macros) to handle the communication between the frontend HTML5 or client application and the SAS 9 or Viya backend SAS services.

SASjs uses the following two client-side components:

  • Node.js JavaScript runtime environment.
  • Visual Studio Code (VS Code) source code editor.

Node.js is an asynchronous event-driven JavaScript runtime designed to build scalable network applications.

Visual Studio Code (VS Code) is a lightweight but powerful source code editor, which runs on your desktop and is available for Windows, macOS and Linux. It comes with built-in support for JavaScript, TypeScript and Node.js and has a rich collection of extensions for other languages (such as C++, C#, Java, Python, PHP, Go, etc.) and runtimes (such as .NET and Unity). In essence, it is more than just a source code editor, it is an Integrated Development Environment (IDE) allowing you to perform programming along with development as well as debugging and testing the application.

SASjs command line interface (CLI)

The SASjs CLI is the third major component of SASjs that ties together and facilitates creating, building, deploying and executing various SAS web applications (for SAS 9 and SAS Viya).

It is a powerful and flexible tool covering a wide range of features and functionalities to support successful applications development and operations (DevOps). CLI’s primary actions include:

  • Create a template SASjs Git project repository.
  • Compile dependent macros, macro variables and pre/post code.
  • Create/build the master SAS deployment, including Jobs, Services and Tests.
  • Deploy/execute local scripts and remote SAS programs to create your app on the SAS Server.
  • Configure and generate Doxygen documentation for all SAS content.

There is also a feature to let you deploy your frontend as a service, bypassing the need to access the SAS Web Server.

SASjs support

Even though SASjs is an open source platform, it provides different levels of support for its products.

Report defects. Through this GitHub web page SAS users can raise issues with SASjs such as report defects, suggest improvements, etc.

Engage with SASjs developers. SAS users can also communicate with SASjs developers, receive free clarifications and consultations on various aspects of SASjs usage, or even hire them for involvement that is more extensive.

Questions? Thoughts? Comments?

Do you find this post useful? Have you learned anything new? Do you have any other SAS open-source assets you can share? Do you have questions, concerns, comments? Please share with us below.

SAS open-source treasures from around the world: SASjs was published on SAS Users.

2月 152022
 

The Missionaries and Cannibals Problem (MCP) is a classic river-crossing logic puzzle that derives from the famous Jealous Husbands problem. The earliest version of the MCP problem was described by Pocock in 1891. In the article “The jealous husbands and the missionaries and cannibals” issued by Ian Pressman and David Singmaster on The Mathematical Gazette73 ( JSTOR 3619658), the following theorem was stated as the 4th theorem without proof for this river crossing problem:

THEOREM 4. The minimal number of crossings to ferry n >= 3 missionaries and n cannibals across a river with an island, using a two-person boat and bank-to- bank crossings, is 4n - 1.

The Missionaries and Cannibals Problem is usually defined as follows:

On one bank of a river are 3 missionaries and 3 cannibals. There is 1 boat available that can carry at most 2 people and that they would like to use to cross the river. If the cannibals ever outnumber the missionaries on either of the river’s banks or on the boat, the missionaries will get eaten. How can the boat be used to carry all the missionaries and cannibals across the river safely? The boat cannot cross the river by itself with no people on board and there is no island in the middle of the river.

Solving the problem

To build a system to solve this problem, we can define how to represent the state of the system and how the states will change from the actions applied. We also need to define the initial state and the final state, so the problem solving is abstracted as finding a path from the initial state to the final state. Through this method, we can solve the problem with the help of computer graph theory knowledge to find a connected one-way graph path. The detailed algorithm can be described as such:

  1. The primary argument for the system is the number of Missionaries (M), the number of Cannibals (C) and the capacity of the boat (B). For the upper problem, the M=3, C=3 and B=2.
  2. The system state can be uniquely defined by the state of missionaries, cannibals and the boat on the left bank. In other words, [m=3, c=3, b=1] indicates there are 3 missionaries, 3 cannibals and a one-person boat on the left bank. For the state of the other bank, it’s uniquely determined by the left bank after crossing. Both banks need to always abide by the game's rules for all [m, c, b], m>=0, c>=0, m>=c if m>0.
  3. The number of valid crossing actions depends on the capacity of the boat and the state of the departure ferry. If the capacity of a boat is 2, the possible states of the boat need to meet all of the following conditions of the rules defined in #2:
    • (p+q)<=B : a boat can carry at most B people.
    • (p=0 OR (p>0 AND p>=q)): cannibals can’t outnumber the missionaries on the boat if there is any missionary.
    • NOT (p=0 AND q=0): the boat cannot cross the river by itself with no people.

So, we get following actions table for river crossing:

(p, q) 0 1 2
0 (0, 0) (0, 1) (0, 2)
1 (1,0) (1,1) (1,2)
2 (2,0) (2,1) (2,2)

 

  1. Starting from the initial system state, we can use Breadth First Search (BFS) algorithm to drive state space transition. The starting point is the initial state, while the end point is the target state. To avoid reentrant visits, we need to use a dictionary to record the nodes that have been visited. We also need a queue to pool the newfound system state which has not been visited yet. So, we can apply the actions defined in #3 until the state space is traversed.
  2. The output of #4 is the path segment for a final solution. Whenever we find a solution, we need to dump out the full path. And all these paths form a Directed Acyclic Graph (DAG). For those endpoints that are not the final goal state, we need to remove them to build a single clean graph with the final state as the endpoint.

Visualizing the solution

The following result is generated by the upper algorithm implementation by SAS for the MCP problem (M=3, C=3 and B=2). It’s a directed acyclic graph that can represent all possible solutions on one page. The NETDRAW procedure in SAS was designed to draw a network diagram of the activities in a project, but we use it to visualize nodes and relationships for a directed acyclic graph (DAG) here (click here to download the precompiled code).We also can generate the step description for a solution (top-most path) in that directed acyclic graph. For example, the first intuitive solution for (M=3, C=3, B=2) is listed below. It is one of the 4 possible solutions revealed by the upper directed acyclic graph. There are three other variants for (M=3, C=3 and B=2) besides the following solution.

Generally, if the boat's capacity is 2 (B=2) and the number of missionaries and cannibals is equal (C=M), we draw the following conclusions:

  • When M = 1, there is one and only one solution, that is, N(M=1, C=1, B=2) =1.
  • When M = 2, there are 4 different solutions, that is, N(M=2, C=2, B=2) =4. Each solution needs 5 trips.
  • When M = 3, there are 4 different solutions, that is, N(M=3, C=3, B=2) =4. Each solution needs 11 tips. In fact, this is the only case meet the 4n-1 statement of THEOREM #4.
  • When M>=4, there is no solution, that is, N(M>=4, C=M, B=2)=0.

When the number of cannibals is less than that of missionaries, such as 1 less, that is (C=M-1), all values of M have solutions. This is because fewer cannibals weaken the constraints, so there will be more solutions.

  • When M = 1, there is one and only one solution, that is, N(M=1, C=0, B=2)=1.
  • When M = 2, there are 3 different solutions, that is, N(M=2, C=1, B=2)=3. Each solution needs 3 trips.
  • When M = 3, there are 9 different solutions, that is, N(M=3, C=2, B=2) =9. Each solution needs 7 trips.
  • When M = 4, there are 25 different solutions, that is, N(M=4, C=3, B=2)=25. Each solution needs 11 trips.

If the number of missionaries and cannibals is equal (C=M) when the boat's capacity is 3 (B=3), then:

  • When M = 1, there is one and only one solution, that is, N(M=1, C=1, B=3) = 1.
  • When M = 2, there are 5 different solutions, that is, N(M=2, C=2, B=3) = 5. Each solution needs 3 trips.
  • When M = 3, there are 6 different solutions, that is, N(M=3, C=3, B=3) = 6. Each solution needs 5 trips.
  • When M = 4, there are 32 different solutions, that is, N(M=4, C=4, B=3) = 32. Each solution needs 9 trips.
  • When M = 5, there are 25 different solutions, that is, N(M=5, C=5, B=3) = 25. Each solution needs 11 trips.
  • When M>=6, there is no solution, that is, N(M>=6, C=M, B=3) = 0.

For solving an upper missionaries and cannibals Problem (M=5, C=5, B=3), the step description of a solution also can be generated by SAS as below:

In the same way, when the number of cannibals is less than that of the missionaries, such as 1 less (C=M-1), then all values of M can be solved because fewer cannibals weaken the restriction conditions.

When the capacity of boat B is greater than or equal to 4, there are solutions for all values of M if the number of missionaries and cannibals are equal (C=M). E.g., here is a list of all solutions for MCP(M=5, C=5, B=4) and the step description of a solution below:

Furthermore, the following table lists the statistics of all possible MCP solutions when M<=16, C=M, B=1 to 6. The SolutionsNum column indicates the number of solutions while MinTrips and MaxTrips indicate the minimum and maximum trips needed, respectively. The DistinctTripsLength column indicates whether the number of trips is variable; the distinct trips length is either 1 or 2.

The maximum number of trips across the river is not monotonically increasing, they show the following correlation. And when other conditions are the same, B=4 requires the greatest number of trips if M>=6.

 

The first case with a variable number of trips is MCP(M=2, C=2, B=4), it has 6 solutions with 1 or 3 trips; the second case with a variable number of trips is MCP(M=3, C=3, B=4), it has 25 solutions with 3 or 5 trips, you can see more details below:

Summary and more resources

We have described how to solve the classic Missionaries and Cannibals problem with SAS, visualized solutions with SAS NETDRAW procedure and generated steps description automatically for any MCP problem. The statistics of all possible MCP solutions when M<=16 proved that MCP(M=3, C=3, B=2) is the only case that conforms to Theorem 4. This is just one example of how powerful SAS can be for problem-solving and data visualization. For more examples, please check out some of my other articles:

Solve the Missionaries and Cannibals Problem with SAS was published on SAS Users.

2月 112022
 

Ever had to work with data that had crazy column names, custom formats you didn’t have access to, or text columns sized way too big for the actual data they contained? Annoying, isn’t it? Well, you’re not alone and, as it turns out, it’s not that hard to fix. Let’s take a peek at renaming, reformatting, relabeling and resizing columns in your data using PROC SQL and PROC DATASETS. PROC SQL could handle most of these tasks, but you can’t rename a column in SQL without creating a new copy of the table. PROC DATASETS can do it all, but the syntax can be trickier. When we resize a column, we're going to have to re-write the data anyway, so I'll use PROC SQL for resizing, and PROC DATASETS for everything else. For example, we have this existing table:

Year Make Model MSRP Invoice Engine Size (L)
2004 Acura MDX $36,945 $33,337 3.5
2004 Audi A4 1.8T 4dr $25,940 $23,508 1.8
2004 Buick Rainier $37,895 $34,357 4.2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP num 8 DOLLAR8.
Invoice num 8 DOLLAR8.
EngineSize Engine Size (L) num 8

 
I want to modify the table to use European formats for the numeric values, to rename the EngineSize column as Size leaving its label unchanged, and to add descriptive labels for MSRP and Invoice.
 
The general syntax for this in PROC DATASETS is:
 
PROC DATASETS LIBRARY=;
MODIFY sas-dataset <(options)>;
FORMAT variable-1 <format-1> <variable-2 <format-2> ...>;
LABEL variable-1=<'label-1' | ' '> <variable-2=<'label-2' | ' '> ...>;
RENAME variable-1=new-name-1 <variable-2=new-name-2 ...>;

After running this code:

proc datasets library=work nolist nodetails;
   modify cars;
      format MSRP eurox8. Invoice eurox8. EngineSize commax5.1;
      label MSRP="Sticker Price" Invoice="Wholesale Price" ;
      rename EngineSize=Size;
run; quit;

 
the data now looks like this:
 

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
In the report, the longest value in the Make column is 5 characters long but the table metadata shows a column width of 13 characters. I can shorten the Make column to 5 characters without truncating the values, making my table size smaller on disk and in memory.

PROC SQL with an ALTER TABLE statement and a MODIFY clause will change a table’s descriptor portion (metadata) for one or more columns. Changing the length of a column will automatically rewrite the data portion of the table, too. The syntax looks like this:

ALTER TABLE table-name MODIFY column-definition-1 <, column-definition-2, ..., column-definition-n>;

After running this code:

proc sql;
alter table work.cars
   modify Make char(5)
;
quit;

The data is still correct:

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 
and the column size is now smaller:

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 5
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
Now, unless you want to write this type of code by hand every time you need it, you’ll want macros to do the work for you. And macros we’ve got! Check out my SAS tutorial on YouTube titled “Resize, Rename and Reformat Data with SAS Macro” for more details on how this works. The video is also embedded below. Links below the embedded video lead to the data and programs used in the video and to my personal macro trove on GitHub, including a macro that quickly strips labels, formats and informats from a data set.

Until next time, may the SAS be with you!
 
Mark

 

DOWNLOAD NOW | VIDEO DATA PACKAGE ON GITHUB
GET MACROS NOW | MY MACRO STASH ON GITHUB

Jedi SAS tricks: resizing, renaming and reformatting your data was published on SAS Users.

2月 112022
 

Ever had to work with data that had crazy column names, custom formats you didn’t have access to, or text columns sized way too big for the actual data they contained? Annoying, isn’t it? Well, you’re not alone and, as it turns out, it’s not that hard to fix. Let’s take a peek at renaming, reformatting, relabeling and resizing columns in your data using PROC SQL and PROC DATASETS. PROC SQL could handle most of these tasks, but you can’t rename a column in SQL without creating a new copy of the table. PROC DATASETS can do it all, but the syntax can be trickier. When we resize a column, we're going to have to re-write the data anyway, so I'll use PROC SQL for resizing, and PROC DATASETS for everything else. For example, we have this existing table:

Year Make Model MSRP Invoice Engine Size (L)
2004 Acura MDX $36,945 $33,337 3.5
2004 Audi A4 1.8T 4dr $25,940 $23,508 1.8
2004 Buick Rainier $37,895 $34,357 4.2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP num 8 DOLLAR8.
Invoice num 8 DOLLAR8.
EngineSize Engine Size (L) num 8

 
I want to modify the table to use European formats for the numeric values, to rename the EngineSize column as Size leaving its label unchanged, and to add descriptive labels for MSRP and Invoice.
 
The general syntax for this in PROC DATASETS is:
 
PROC DATASETS LIBRARY=;
MODIFY sas-dataset <(options)>;
FORMAT variable-1 <format-1> <variable-2 <format-2> ...>;
LABEL variable-1=<'label-1' | ' '> <variable-2=<'label-2' | ' '> ...>;
RENAME variable-1=new-name-1 <variable-2=new-name-2 ...>;

After running this code:

proc datasets library=work nolist nodetails;
   modify cars;
      format MSRP eurox8. Invoice eurox8. EngineSize commax5.1;
      label MSRP="Sticker Price" Invoice="Wholesale Price" ;
      rename EngineSize=Size;
run; quit;

 
the data now looks like this:
 

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 

with this descriptor portion:

 

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 13
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
In the report, the longest value in the Make column is 5 characters long but the table metadata shows a column width of 13 characters. I can shorten the Make column to 5 characters without truncating the values, making my table size smaller on disk and in memory.

PROC SQL with an ALTER TABLE statement and a MODIFY clause will change a table’s descriptor portion (metadata) for one or more columns. Changing the length of a column will automatically rewrite the data portion of the table, too. The syntax looks like this:

ALTER TABLE table-name MODIFY column-definition-1 <, column-definition-2, ..., column-definition-n>;

After running this code:

proc sql;
alter table work.cars
   modify Make char(5)
;
quit;

The data is still correct:

Year Make Model Sticker Price Wholesale Price Engine Size
(L)
2004 Acura MDX €36.945 €33.337 3,5
2004 Audi A4 1.8T 4dr €25.940 €23.508 1,8
2004 Buick Rainier €37.895 €34.357 4,2

 
and the column size is now smaller:

Column Name Column Label Column Type Column Length Column Format
Year num 8
Make char 5
Model char 40
MSRP Sticker Price num 8 EUROX8.
Invoice Wholesale Price num 8 EUROX8.
Size Engine Size (L) num 8 COMMAX5.1

 
Now, unless you want to write this type of code by hand every time you need it, you’ll want macros to do the work for you. And macros we’ve got! Check out my SAS tutorial on YouTube titled “Resize, Rename and Reformat Data with SAS Macro” for more details on how this works. The video is also embedded below. Links below the embedded video lead to the data and programs used in the video and to my personal macro trove on GitHub, including a macro that quickly strips labels, formats and informats from a data set.

Until next time, may the SAS be with you!
 
Mark

 

DOWNLOAD NOW | VIDEO DATA PACKAGE ON GITHUB
GET MACROS NOW | MY MACRO STASH ON GITHUB

Jedi SAS tricks: resizing, renaming and reformatting your data was published on SAS Users.

1月 282022
 

In May of 1999, Ian Stewart issued the article “A Puzzle for Pirates” in that month's edition of Scientific American (pp.98-99). This mathematical game then went on to become a worldwide hit, especially for puzzlers who like a good challenge. The rules of the pirate game are described below:

Five rational pirates have found 100 gold coins on an island and need to decide how to distribute them. They are a democratic bunch and they ask the fiercest pirate to propose a plan of distribution. Then, all of the living pirates (including the proposer) vote on whether to accept the distribution plan. If the majority (50% or more) accepts that proposal, then coins are dispersed and the game ends. Otherwise, the proposer is thrown overboard and killed, then the procedure is repeated with the next fiercest pirate until a proposal is accepted or only one pirate is left. If there is a tie vote, then the proposer (say, the fiercest pirate) has the casting vote.

At first glance, the proposer may offer the most favorable terms to all voters to pass the proposal, but this is not the case. The pirates are not only rational but also greedy and want to maximize their personal gain. Due to the game's rules and how the order of pirate fierceness is known to all in advance, these rational pirates prioritize surviving first, then maximizing gold coins for themselves. They would love to kill off as many pirates as possible to make more coins for themselves through future distribution proposals. This logic may become surprisingly complicated due to the changes in the numbers of pirates and coins.

Suppose N is the number of pirates, and G is the number of gold coins. We number the pirates in order from meekest to fiercest, the greater the number, the fiercer the pirate. So P1 is the least fierce pirate, and PN is the fiercest pirate. Due to there being too many branches from the beginning, we can think of the problem backward from the end, that is, starting from the game results to reverse the entire distribution process.

If there are only two pirates (N=2), then obviously the fierce pirate is going to propose G coins for himself, and none for the meek pirate (P1=0, P2=G). Any proposal will pass no matter what it is for this case, so the fierce pirate would take all the coins while the meek pirate has no choice but to accept it. The fierce pirate doesn’t need to please meek pirate at all!

If there are three pirates (N=3), and all of them are aware of an inevitable positive result, then the fiercest pirate P3 is going to make a proposal pass to guarantee his own survival first, so he needs at least one vote from P1 or P2. Due to P2 having the possibility of getting G coins, so he can’t be bribed with gold coins; P1 knows he is going to get 0 coins if the game goes onto the next round, and P3 knows what P1 is thinking, so P3 can offer the least amount of coins, e.g., just 1 coin to buy the vote of P1, and P1 also would love to support P3 accordingly as he can get a better result: survival plus coins. So, the final proposal by P3 is P1=1, P2=0, P3=G-1, where G>1.

From the two trial calculations above, we can conclude the following decision-making patterns or rules for a player of the pirate game. The rules apply to all pirates and it is the rationality of a pirate, that is the pirate game recursive solving algorithm (click here to download the precompiled code).

  1. For any N, G, both N and G must be greater than zero for game, to avoid the possibility of unlimited bribing, we usually define them as both integer and 2*G < N, but this is not required.
  2. If the number of pirates N is 2, then proposal is P1=0, P2=G.
  3. Otherwise, count the proposer’s vote out and let the rest of the pirates (N-1) to distribute G gold coins. For its trial result proposal, count the number of votes supported:
    • The fiercest pirate whose proposal is impossible to pass in the next round. Because this pirate is going to die in the next round, he is the inevitable supporter of the current proposal to avoid making a proposal and being killed in the next round.
    • The pirates who are going to receive 0 coins in the next round. Try to buy this kind of pirate as a supporter with the least coins (e.g., 1 coin) from total G coins. For those pirates who received coins in the trial calculation, take their coins away!
  4. Check if the current proposal is going to pass the vote and return that flag.
    • If the number of votes supported plus 1 vote of the current proposer is equal to or greater than the N/2, then the proposal is passed. So the current proposer takes all the remaining gold coins after buying supporters in 3.b and make all those pirates going to die in the next round alive because they avoid the further distribution process.
    • If the current proposal cannot be passed, then throw the current proposer overboard, marking PN as a must-die pirate.

Please note step #3, we're calling the pirate game-solving algorithm recursively, so it would reach the game-over condition. The following proposal plan table is generated by upper algorithm implementation by SAS for pirate games with N<=20 and G=100. You can see the best solution is, the fiercest pirate chooses the pirates with the same parity as him in the list ordered by fierce as his supporter. E.g., P3 would choose P1 as his supporter for a N=3 pirate game, and P5 would choose P3, P1 as his supporter for a N=5 game.

Will this pattern continue for any N pirate games with constant G gold coins? The answer is no because we have limited gold coins (G) to buy supporters. Anyway, it is true for N<= 2*G game, e.g., The P200 can take 1 coin and buy 99 supporters with another 99 coins, the detailed proposal plan for N=200 pirate game is show as below:

Interesting things happen while the number of pirates N exceed the 2*G. e.g., P201 and P202 pirates still can survive if he is taking no gold coins for himself and offering 1 gold coin each to 100 pirates with the same parity in the list ordered by fierceness. The detailed proposal for N=201, N=202 cases is show as below:

Unfortunately, P203 has no coins to buy supporters anymore in the pirate game (G=100, N=203), so P201 and P202 in this round would love to throw P203 overboard to earn more coins through future distribution proposals. In other words, P203 must die for this case (We use -1 to indicate death in detailed proposal).

P204 is a lucky guy who can survive. Since P203 must die in the next round, P203 is going to support any proposal made by P204 in the current round. P204 will apply the same strategy as P202 used in the N=202 pirate game above, so P204 takes no coins, and he gets support from P203 in this current round. So, P201, P202, P203 and P204 all survive with no coins. The detailed proposal is shown as below:

Accordingly, P205, P206 and P207 are all unlucky and doomed to die in their round, just like P203 in pirate game (G=100, N=203). The key reason is that they cannot get sufficient votes to make their proposals pass. The four rational pirates (P201 to P204) know they can survive and would love to throw these three pirates (P205 to P207) overboard to earn more coins through future distribution proposals. The detailed proposal for the game (G=100, N=207) is shown below:

Just as lucky as P204 in pirate game (G=100, N=204), P208 is also lucky enough to make a proposal pass because he can get sufficient support from P205, P206 and P207 who are doomed to die. P205, P206 and P207 must support P208 to avoid making a proposal and getting killed in the next round. The detailed proposal for the game (G=100, N=208) is shown below:

In general, suppose the number of the pirate I-th (I<= N) is PI in a pirate game (N, G), we have the following conclusion based on the recursive solving algorithm result:

  1. When I<=2*G, all pirates will survive, and they also have 50% possibility to get 1 or more gold coins.
  2. When I >2*G and I=N, the pirate PI (=PN, proposer) is going to survive only when (N - 2*G is the power of 2, otherwise they will die.
  3. When I>2*G and I<N, all pirates with number I <= (2*G + M) will survive but will receive no coins, all pirates whose number I>(2*G+M) will die, where M is the highest power of 2 that does not exceed N - 2*G. Please note that the survival for the PI is a possibility related to N value, those PI will survive again If N exceed 2*G + 2*M.

e.g., G=100, N=(205, 206, 207), N-2*G=(5, 6, 7), the M=4 (2^2<=4), so 2*G+M=204, so P201, P202, P203 and P204 will survive in pirate games N=(205, 206, 207). P205 must die in pirate game N=(205, 206, 207), and P206 must die in pirate game N=(206, 207), and P207 must die in pirate game N=207. But P205, P206, P207 will survive again if N is equal to or greater than 208.

In fact, there is no unique solution to distribute gold coins if a pirate already guarantees his survival. This means that the fiercest pirate has some freedom of selection when I>=2G+2, but the simplest solution is to select those pirates with the same parity by order of fierceness.

From the proposer’s perspective (I=N), if he attends a pirate game with different N, then his personal interest is different. For a game with a number of pirates that is less than or equal to 2*G, the proposer will always survive with coins. For games with greater than 2*G pirates, most proposers can’t survive except whose number minus 2*G happens to be an integer power of 2. We can visualize the solution for pirate game(N=500, G=100) with SAS as seen below.

Accordingly, we also can calculate the possibility of survival of a pirate game with N pirates (N<=500) and constant G=100. The survival curve below indicates how likely a pirate (proposer) is to survive if he participates in a game with N pirates. As expected, the curve is a non-linear line.

For the least fierce pirate, he has no chance to distribute gold coins, but he has most chances to vote on proposals. In other words, he can survive longer than other pirates. For the fiercest pirate, the survival rate appears as a cliff-like fall at some points for the pirate whose number PI is greater than 2*G. The survival curve below indicates how likely a pirate PI is to survive if he participates in a game with N pirates (N<=500).

e.g., For a game with 500 pirates, the first 44 pirates are doomed to be thrown overboard (P457 to P500), so the blue survival line falls to 0 at P456 (2*G+2^Y=2*100+2^8=456), accordingly the death rate(red) is suddenly increased to 100%.

Summary

We have figured out how to solve the pirate game with a recursive solving algorithm in SAS, and how to analyze and visualize the law behind the complex logic of the pirate game. Now the next time you join a pirate game, you will know your destiny ahead of time and reap the benefits while avoiding getting thrown overboard.

Solve "A Puzzle for Pirates" with SAS was published on SAS Users.

1月 072022
 

Welcome to the sixth installment in my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to work with the results. Now it's time to generate simple descriptive statistics of a CAS table.

Let's begin by confirming the cars table is loaded into memory. With a connection to CAS established, execute the tableInfo action to view available in-memory tables. If necessary, you can execute the following code in SAS Studio to load the sashelp.cars table into memory.

conn.tableinfo(caslib="casuser")

The results show the cars table is loaded into memory and available for processing. Next, reference the cars table in the variable tbl. Then use the print function to show the value of the variable.

tbl = conn.CASTable('cars', caslib='casuser')
print(tbl)
CASTable('cars', caslib='casuser')

The results show that the tbl variable references the cars table in the CAS server.

Preview the CAS Table

First things first. Remember, the SWAT package blends the world of Pandas and CAS into one. So you can begin with the traditional head method to preview the CAS table.

tbl.head()

The SWAT head method returns five rows from the CAS server to the client as expected.

The Describe Method

Next, let's retrieve descriptive statistics of all numeric columns by using the familiar describe method on the CAS table.

tbl.describe()

The SWAT describe method returns the same descriptive statistics as the Pandas describe method. The only difference is that the SWAT version uses the CAS API to convert the describe method into CAS actions behind the scenes to process the data on the distributed CAS server. CAS processes the data and returns summarized results back to the client as a SASDataFrame, which is a subclass of the Pandas DataFrame. You can now work with the results as you would a Pandas DataFrame.

Summary CAS Action

Instead of using the familiar describe method, let's use a CAS action to do something similar. Here I'll use the summary CAS action.

tbl.summary()

Summary CAS Action

The results of the summary action return a CASResults object (Python dictionary) to the client. The CASResults object contains a single key named Summary with a SASDataFrame as the value. The SASDataFrame shows a variety of descriptive statistics.  While the summary action does not return exactly the same statistics as the describe method, it can provide additional insights into your data.

What if we don't want all the statistics for all of the data?

Selecting Columns and Summary Statistics with the Summary Action

Let's add additional parameters to the summary action. I'll add the inputs parameter to specify the columns to analyze in the CAS server.

tbl.summary(inputs = ['MPG_City','MPG_Highway'])

The results show only the MPG_City and MPG_Highway columns were analyzed.

Next, I'll use the subSet parameter to specify the summary statistics to produce. Here I'll obtain the MEAN, MIN and MAX.

tbl.summary(inputs = ['MPG_City','MPG_Highway'],
                       subSet = ['mean','min','max'])

The results processed only the MPG_City and MPG_Highway columns, and returned only the specified summary statistics to the client.

Creating a Calculated Column

Lastly, let's create a calculated column within the summary action. There are a variety of ways to do this. I like to add it as a parameter to the CASTable object. You can do that by specifying the tbl object, then computedVarsProgram parameter. Within computedVarsProgram you can use SAS assignment statements with most SAS functions. Here we will create a new column name MPG_Avg that takes the mean of MPG_City and MPG_Highway. Lastly, add the new column to the inputs parameter.

tbl.computedVarsProgram = 'MPG_Avg = mean(MPG_City, MPG_Highway);'
tbl.summary(inputs = ['MPG_City','MPG_Highway', 'MPG_Avg'],
                       subSet = ['mean','min','max'])

In the results I see the calculated column and requested summary statistics.

Summary

The SWAT package blends the world of Pandas and CAS. You can use many of the familiar Pandas methods within the SWAT package, or the flexible, highly optimized CAS actions like summary to easily obtain summary statistics of your data in the massively parallel processing CAS engine.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Index
SWAT API Reference
CAS Action Documentation
SAS® Cloud Analytic Services: Fundamentals
SAS Scripting Wrapper for Analytics Transfer (SWAT)
CAS Action! - a series on fundamentals
Execute the following code in SAS Studio to load the sashelp.cars table into memory

Getting Started with Python Integration to SAS® Viya® - Part 6 - Descriptive Statistics was published on SAS Users.

10月 142021
 

Trimming strings left and right

I am pretty sure you have never heard of the TRIMS function, and I would be genuinely surprised if you told me otherwise. This is because this function does not exist (at least at the time of this writing).

But don’t worry, the difference between "nonexistence" and "existence" is only a matter of time, and from now it is less than a blog away. Let me explain. Recently, I published two complementary blog posts:

[1] Removing leading characters from SAS strings

[2] Removing trailing characters from SAS strings

While working on these pieces and researching “prior art” I stumbled upon a multipurpose function in the SAS FedSQL Language that alone does either one or both of these things – remove leading or/and trailing characters from SAS strings.

FedSQL Language and Proc FedSQL

The FedSQL language is the SAS proprietary implementation of the ANSI SQL:1999 core standard. Expectedly, the FedSQL language is implemented in SAS by means of the FedSQL procedure (PROC FEDSQL). This procedure enables you to submit FedSQL language statements from a Base SAS session, and it is supported in both SAS 9.4 and SAS Viya.

Using the FEDSQL procedure, you can submit FedSQL language statements to SAS and third-party data sources that are accessed with SAS and SAS/ACCESS library engines. Or, if you have SAS Cloud Analytic Services (CAS) configured, you can submit FedSQL language statements to the CAS server.

FedSQL TRIM function

FedSQL language has its own vast FedSQL Functions library with hundreds of functions many of which replicate SAS 9.4 Functions. Many, but not all. Deep inside this FedSQL functions library, there is a unique treasure modestly called TRIM Function which is quite different from the BASE SAS Language TRIM() function.

While SAS 9.4 BASE TRIM() function capabilities are quite limited - it removes just trailing blanks from a character string, the FedSQL TRIM() function is way much more powerful. This triple-action function can remove not just trailing blanks, but also leading blanks, as well as both, leading and trailing blanks. On top of it, it can remove not just blanks, but any characters (although one character at a time). See for yourself, this function has the following pretty self-explanatory syntax:

TRIM( [BOTH | LEADING | TRAILING] [trim-character] FROM column)

Here trim-character specifies one character (in single quotations marks) to remove from column. If trim-character is not specified, the function removes blanks.

While being called a function, it does not look like a regular SAS function where arguments are separated by commas.  It looks more like an SQL statement (which it understandably is – it is part of the FedSQL language). However, this function is available only in PROC FEDSQL; it’s not available in SAS DATA steps or other PROC steps. Still, it gives us pretty good idea of what such a universal function may look like.

User-defined function TRIMS to remove leading or/and trailing characters in SAS strings

Let’s build such a function by means of the PROC FCMP for the outside the FedSQL usage (it is worth noticing that the FCMP procedure is not supported for FedSQL). To avoid confusion with the existing TRIM function we will call our new function TRIMS (with an ‘S’ at the end) which suits our purpose quite well denoting its plural purpose. First, we define what we are going to create.

User-defined TRIMS function

TRIMS Function

Removes leading characters, trailing characters, or both from a character string.

Syntax

TRIMS(function-modifier, string, trim-list, trim-list-modifier)

Required Arguments

  • function-modifier is a case-insensitive character constant, variable, or expression that specifies one of three possible operations:
    'L' or 'l' – removes leading characters.
    'T' or 't' – removes trailing characters.
    'B' or 'b' – removes both, leading and trailing characters.
  • string is a case-sensitive character constant, variable, or expression that specifies the character string to be trimmed.
  • trim-list is a case-sensitive character constant, variable, or expression that specifies character(s) to remove from the string.
  • trim-list-modifier is a case-insensitive character constant variable, or expression that supplements the trim-list.
    The valid values are those modifiers of the FINDC function that “add” groups of characters (e.g. 'a' or 'A', 'c' or 'C', 'd' or 'D', etc.) to the trim-list.

The following user-defined function implementation is based on the coding techniques described in the two previous posts, [1] and [2] that I mentioned above. Here goes.

 
libname funclib 'c:\projects\functions';
 
/* delete previous function definition during debugging */
options cmplib=funclib.userfuncs;
proc fcmp outlib=funclib.userfuncs.package1;
   deletefunc trims;
run;
 
/* new function defintion */
proc fcmp outlib=funclib.userfuncs.package1;
   function trims(f $, str $, clist $, mod $) $32767;
      from = 1;
      last = length(str);
      if upcase(f) in ('L', 'B') then from = findc(str, clist, 'K'||mod);
      if from=0 then return('');
      if upcase(f) in ('T', 'B') then last = findc(str, clist, 'K'||mod, -last); 
      if last=0 then return('');
      return(substr(str, from, last-from+1));      
   endfunc; 
run;

Code highlights

  • In the function definition, we first assign initial values of the target substring positions as from=1 and last=length(str).
  • Then for Leading or Both character removal, we calculate an adjusted value of from as a position of the first character in str that is not listed in clist and not defined by the mod
  • If from=0 then we return blank and stop further calculations as this means that ALL characters are to be removed.
  • Then for Trailing or Both character removal, we calculate an adjusted value of last as a position of the last character in str that is not listed in clist and not defined by the mod
  • If last=0 then we return blank and stop further calculations as this means that ALL characters are to be removed.
  • And finally, we return a substring of str starting at the from position and ending at the last position, that is with the length of last-from+1.

TRIMS function usage

Let’s define SAS data set SOURCE as follows:

data SOURCE;
   input X $ 1-30;
   datalines;
*00It's done*2*1**-
*--*1****9*55
94*Clean record-*00
;

In the following DATA step, we will create three new variables with removed leading (variable XL), trailing (variable XT) and both - leading and trailing (variable XB) characters '*' and '-' as well as any digits:

options cmplib=funclib.userfuncs;
data TARGET;
   set SOURCE;
   length XB XL XT $30;
   XB = trims('b', X, '*-', 'd');
   XL = trims('L', X, '*-', 'd');
   XT = trims('t', X, '*-', 'd');
run;

In this code we use the TRIM function three times, each time with a different first argument to illustrate how this affects the outcome.

Arguments usage highlights

  • The first argument of the TRIMS function specifies whether we remove characters from both leading and trailing positions ('b'), from leading positions only ('L'), or from trailing positions only ('t'). This argument is case-insensitive. (I prefer using capital 'L' for clarity since lowercase 'l' looks like digit '1').
  • The second argument specifies the name of the variable (X) that we are going to remove characters from (variable X is coming from the dataset SOURCE).
  • The third argument '*-' specifies which character (or characters) to remove. In our example we are removing '*' and '-'. If you do not need to explicitly specify any character here, you still must supply a null value ('') since it is a required argument. In this case, the fourth argument (trim-list-modifier) will determine the set of characters to be removed.
  • And finally, the fourth argument (case-insensitive) of the TRIMS function specifies the FINDC function modifier(s) to remove certain characters in bulk (in our example 'd' will remove all digits). If such modifier is not needed, you still must supply a null value ('') since all four arguments of the TRIMS function are positional and required.

Here is the output data table TARGRET showing the original string X and the resulting strings XB (Both leading and trailing characters removed), XL (Leading characters removed) and XT (Trailing characters removed) side by side:

Result of leading and trailing characters trimming

Conclusion

The new TRIMS function presented in this blog post goes ways further the ubiquitous LEFT and TRIM functions that remove the leading (LEFT) or trailing (TRIM) blanks. The TRIMS function handles ANY characters, not just blanks. It also expands the character deletion functionality of the powerful  FedSQL TRIM function beyond just removing any single leading and/or trailing character. The TRIMS function single-handedly removes any number of explicitly specified characters from either leading, trailing or both (leading and trailing) positions. Plus, it removes in bulk many implicitly specified characters. For example 'd' modifier removes all digits, 'du' modifier removes all digits ('d') and all uppercase letters ('u'), 'dup' modifier removes all digits ('d'), all uppercase letters ('u') and all punctuation marks ('p'); and so on as described by the FINDC function modifiers. The order in which modifier characters are listed does not matter.

Additional resources

Questions? Thoughts? Comments?

Do you find this post useful? Please share your thoughts with us below.

Introducing TRIMS function to remove any leading and/or trailing characters from SAS strings was published on SAS Users.

10月 012021
 

I'll admit it - I'm often impatient while waiting for results, so my code needs to run as fast as possible! In this video, I show you how you can get faster results, too.

I'll demonstrate several different techniques that produce identical results, and compare processing speeds. For a more robust assessment, I'll test the techniques while reading from both SAS data sets and database tables. Analysis of the results clearly shows that when the techniques produce identical results, these choices usually produce faster run times:

  1. Use a WHERE statement instead of a subsetting IF statement.
  2. Use the KEEP= dataset option on input data sets instead of a KEEP statement.
  3. Should I use SQL or DATA step?
    • If producing a single result, use SQL.
    • If producing multiple results, use the DATA step.
  4. If your process is CPU bound:
    • If you have access to CAS, run it in CAS.
    • Otherwise, refactor in DS2.

The programs I used to create the video are available for download from GitHub if you want to experiment.

Until next time, may the SAS be with you!
Mark

Jedi SAS Tricks: 5 Ways to Make Your SAS Code Run Faster was published on SAS Users.