Reading an external file that contains delimiters (commas, tabs, or other characters such as a pipe character or an exclamation point) is easy when you use the IMPORT procedure. It's easy in that variable names are on row 1, the data starts on row 2, and the first 20 rows are a good sample of your data. Unfortunately, most delimited files are not created with those restrictions in mind. So how do you read files that do not follow those restrictions?
You can still use PROC IMPORT to read the comma-, tab-, or otherwise-delimited files. However, depending on the circumstances, you might have to add the GUESSINGROWS= statement to PROC IMPORT or you might need to pre-process the delimited file before you use PROC IMPORT.
Note: PROC IMPORT is available only for use in the Microsoft Windows, UNIX, or Linux operating environments.
The following sections explain four different scenarios for using PROC IMPORT to read files that contain the delimiters that are listed above.
In this scenario, I use PROC IMPORT to read a comma-delimited file that has variable names on row 1 and data starting on row 2, as shown below:
proc import datafile='c:\temp\classdata.csv' out=class dbms=csv replace; run;
When I submit this code, the following message appears in my SAS® log:
NOTE: Invalid data for Age in line 28 9-10. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--- 28 Janet,F,NA,62.5,112.5 21 Name=Janet Sex=F Age=. Height=62.5 Weight=112.5 _ERROR_=1 _N_=27 NOTE: 38 records were read from the infile 'c:\temp\classdata.csv'. The minimum record length was 17. The maximum record length was 21. NOTE: The data set WORK.CLASS has 38 observations and 5 variables.
In this situation, how do you prevent the Invalid Data message in the SAS log?
By default, SAS scans the first 20 rows to determine variable attributes (type and length) when it reads a comma-, tab-, or otherwise-delimited file. Beginning in SAS® 9.1, a new statement (GUESSINGROWS=) is available in PROC IMPORT that enables you to tell SAS how many rows you want it to scan in order to determine variable attributes. In SAS 9.1 and SAS® 9.2, the GUESSINGROWS= value can range from 1 to 32767. Beginning in SAS® 9.3, the GUESSINGROWS= value can range from 1 to 2147483647. Keep in mind that the more rows you scan, the longer it takes for the PROC IMPORT to run.
The following program illustrates the use of the GUESSINGROWS= statement in PROC IMPORT:
proc import datafile='c:\temp\classdata.csv' out=class dbms=csv replace; guessingrows=100; run;
The example above includes the statement GUESSINGROWS=100, which instructs SAS to scan the first 100 rows of the external file for variable attributes. You might need to increase the GUESSINGROWS= value to something greater than 100 to obtain the results that you want.
In this scenario, my delimited file has the variable names on row 4 and the data starts on row 5. When you use PROC IMPORT, you can specify the record number at which SAS should begin reading. Although you can specify which record to start with in PROC IMPORT, you cannot extract the variable names from any other row except the first row of an external file that is comma-, tab-, or an otherwise-delimited.
Then how do you program PROC IMPORT so that it begins reading from a specified row?
To do that, you need to allow SAS to assign the variable names in the form VARx (where x is a sequential number). The following code illustrates how you can skip the first rows of data and start reading from row 4 by allowing SAS to assign the variable names:
proc import datafile='c:\temp\class.csv' out=class dbms=csv replace; getnames=no; datarow=4; run;
In this scenario, I want to read only records 6–15 (inclusive) in the delimited file. So the question here is how can you set PROC IMPORT to read just a section of a delimited file?
To do that, you need to use the OBS= option before you execute PROC IMPORT and use the DATAROW= option within PROC IMPORT.
The following example reads the middle ten rows of a CSV file, starting at row 6:
options obs=15; proc import out=work.test2 datafile= "c:\temp\class.csv" dbms=csv replace; getnames=yes; datarow=6; run; options obs=max; run;
Notice that I reset the OBS= option to MAX after the IMPORT procedure to ensure that any code that I run after the procedure processes all observations.
In this scenario, I again use PROC IMPORT to read my external file. However, I receive more observations in my SAS data set than there are data rows in my delimited file. The external file looks fine when it is opened with Microsoft Excel. However, when I use Microsoft Windows Notepad or TextPad to view some records, my data spans multiple rows for values that are enclosed in quotation marks. Here is a snapshot of what the file looks like in both Microsoft Excel and TextPad, respectively:
The question for this scenario is how can I use PROC IMPORT to read this data so that the observations in my SAS data set match the number of rows in my delimited file?
In this case, the external file contains embedded carriage return (CR) and line feed (LF) characters in the middle of the data value within a quoted string. The CRLF is an end-of-record marker, so the remaining text in the string becomes the next record. Here are the results from reading the CSV file that is illustrated in the Excel and TextPad files that are shown earlier:
That behavior is why you receive more observations than you expect. Anytime SAS encounters a CRLF, SAS considers that a new record regardless of where it is found.
A sample program that removes a CRLF character (as long as it is part of a quoted text string) is available in SAS Note 26065, "Remove carriage return and line feed characters within quoted strings."
After you run the code (from the Full Code tab) in SAS Note 26065 to pre-process the external file and remove the erroneous CR/LF characters, you should be able to use PROC IMPORT to read the external file with no problems.
For more information about PROC IMPORT, see "Chapter 35, The IMPORT Procedure" in the Base SAS® 9.4 Procedures Guide, Seventh Edition.