In this blog post we are going to tackle a data cleansing task of removing unwanted repeated characters in SAS character variables.
Character repetition can stem from various stages of data life cycle: from data collection, to data transmission, to data transformation. It can be accidental or intentional by design. It can be sporadic or consistent. In either case, it needs to be addressed by robust data cleansing processes to ensure adequate data quality that is imperative for the data usability.
Character repetition examples
Example 1. Data entry, especially manual data entry, can be a high-risk factor for accidental character duplication. Have you ever pressed a key on your computer keyboard for a bit longer than intended, so it started automatically typing multiple characters???????????????
Tip: You can adjust your Keyboard Properties to control “Repeat delay” and “Repeat rate” settings (on Windows computer, search for “Keyboard” and click on Keyboard in Control Panel).
Example 2. Recently, I had to deal with the data that contained multiple consecutive double quotation marks all over the character string values. Even though we don’t know the exact cause of it, still for each of these duplicated quotation marks occurrences we needed to replace them with a single quotation mark.
Removing repeated blanks
data D; c = ','; *<- character to un-duplicate; cc = c||c; *<- double character; string = 'Many,,,,,, commas,,,,, in,,, this,, sentence.,'; *<- source string; put 'BEFORE:' string=; *<- output initial string; do while (find(string,cc)); *<- loop through while there are doubles; string = tranwrd(string,cc,c); *<- replace double with a single character; end; put 'AFTER: ' string=; *<- output unduplicated string; run;
This code will produce the following in the SAS log:
BEFORE:string=Many,,,,,, commas,,,,, in,,, this,, sentence.,
AFTER: string=Many, commas, in, this, sentence.,
which shows that this approach correctly un-duplicates the source string removing and replacing all repeated characters (commas in our example) with a single one.
User-defined SAS function for removing any repeated characters
libname funclib 'c:\projects\functions'; proc fcmp outlib=funclib.userfuncs.package1; function undupc(str $, clist $) $; length x $32767 c $1 cc $2; x = str; do i=1 to length(clist); c = char(clist,i); cc = c||c; do while (find(trim(x),cc)); x = tranwrd(x,cc,c); end; end; return (x); endfunc; run;
- We introduce an interim variable x to which we will iteratively apply replacing double characters with a single one.
- We assign length attribute of this variable to be maximum allowable character length of 32767 bytes to accommodate any character length used in the calling program.
- Outer do-loop loops through the clist containing characters we want to unduplicate.
- Variable c is assigned a single character from clist, variable cc is assigned double of the cc value.
- Inner do-loop iterates through trimmed characters in x while doubles are found; using trim(x) is essential as it not only speeds up processing while searching through a shorter string (without trailing blanks), it also prevents from falling into an infinite loop in case clist contains blank character to unduplicate (cc equals to double blanks which are always going to be found among trailing blanks).
Let’s test our newly minted UNDUPC function on the following data:
data SOURCE; infile datalines truncover; input str $50.; datalines; """Repeated "double quotes"""" Repeated,,,,,commas,,,,,,,,,,, [[[""Mixed""]] characters,,, ;
options cmplib=funclib.userfuncs; data TARGET; set SOURCE; length new_str $50; new_str = undupc(str, ' ,"]['); run;