David Loshin describes some steps you can take to ensure that self-service data preparation improves collaboration.
The post Does self-service data preparation improve collaboration? appeared first on The Data Roundtable.
David Loshin describes some steps you can take to ensure that self-service data preparation improves collaboration.
The post Does self-service data preparation improve collaboration? appeared first on The Data Roundtable.
If you've visited SAS documentation (also known as the "SAS Help Center") lately, you may have noticed that we've made some fairly significant changes in the documentation for SAS products and solutions. The new site is organized in a new way, search works a little differently, and the user interface has changed. These changes are part of our continuous pursuit to provide you with the best possible experience on our website.
Below you'll find a quick summary of what's new. Check out the SAS Help Center and let us know what you think.
(You'll find ways to provide feedback at the end of this post. We'd love to hear from you!)
For starters, SAS documentation has a new location on the web: http://documentation.sas.com and a new name: the “SAS Help Center.” You'll notice the SAS Help Center homepage serves as a gateway to documentation for a number of SAS products and solutions. We've highlighted a few of the major product documentation sets at the top of the page, with a full listing of available documentation immediately following. The user interface contains many of the same features as the documentation you used on support.sas.com, but there are a few little differences. Perhaps the most significant - search works a little differently. More on that in a bit.
SAS documentation is now organized into topic-focused collections. For example, SAS Viya Administration docs are together in another collection. You'll find collections for a number of different topic areas, with each collection containing all the documentation for that specific topic area. For a list of all topic areas, see the Products Index A – Z .
When you use search in the new SAS Help Center, be aware that you're only searching the specific documentation collection that you are using at the time. For example, if you're inside the SAS Viya 3.2 Administration documentation and initiate a search, you will only see results for the doc within the SAS Viya 3.2 Administration collection. If you prefer to search all doc collections at once, you can use the search on support.sas.com or use a third-party search tool, such as Google or Bing. (For tips and guidelines on using search, visit our “email@example.com.
Suppose you want a list of car manufacturers from the CARS dataset. Easy! Call the %CHARLIST macro from a %PUT statement, like this: The CHARLIST macro generates a list of unique values of a selected variable from a selected dataset. So does PROC FREQ. But, if you don't need statistics, the CHARLIST [...]
This article describes the advantages and disadvantages of principal component regression (PCR). This article also presents alternative techniques to PCR.
In a previous article, I showed how to compute a principal component regression in SAS. Recall that principal component regression is a technique for handling near collinearities among the regression variables in a linear regression. The PCR algorithm in most statistical software is more correctly called "incomplete" PCR because it uses only a subset of the principal components. Incomplete PCR means that you compute the principal components for the explanatory variables, keep only the first k principal components (which explain most of the variance among the regressors), and regress the response variable onto those k components.
The principal components that are dropped correspond to the near collinearities. Consequently, the standard errors of the parameter estimates are reduced, although the tradeoff is that the estimates are biased, and "the bias increases as more [principal components]are droppped" (Jackson, p. 276).
Some arguments in this article are from J. E. Jackson's excellent book, A User's Guide to Principal Components, (1991, pp. 271–278). Jackson introduces PCR and then immediately cautions against using it (p. 271). He writes that PCR "is a widely used technique," but "it also has some serious drawbacks." Let's examine the advantages and disadvantages of principal component regression.
Principal component regression is a popular and widely used method. Advantages of PCR include the following:
The algorithm that is currently known as PCR is actually a misinterpretation of the original ideas behind PCR (Jolliffe, 1982, p. 201). When Kendall and Hotelling first proposed PCR in the 1950s, they proposed "complete" PCR, which means replacing the original variables by all the principal components, thereby stabilizing the numerical computations. Which principal components are included in the final model is determined by looking at the significance of the parameter estimates. By the early 1980s, the term PCR had changed to mean "incomplete PCR."
The primary argument against using (incomplete) principal component regression can be summarized in a single sentence: Principal component regression does not consider the response variable when deciding which principal components to drop. The decision to drop components is based only on the magnitude of the variance of the components.
There is no a priori reason to believe that the principal components with the largest variance are the components that best predict the response. In fact, it is trivial to construct an artificial example in which the best predictor is the last component, which will surely be dropped from the analysis. (Just define the response to be the last principal component!) More damning, Jolliffe (1982, p. 302) presents four examples from published papers that advocate PCR, and he shows that some of the low-variance components (which were dropped) have greater predictive power than the high-variance components that were kept. Jolliffe concludes that "it is not necessary to find obscure or bizarre data in order for the last few principal components to be important in principal component regression. Rather it seems that such examples may be rather common in practice."
There is a hybrid version of PCR that enables you to use cross validation and the predicted residual sum of squares (PRESS) criterion to select how many components to keep. (In SAS, the syntax is proc pls method=PCR cv=one cvtest(stat=press).) Although this partially addresses the issue by including the response variable in the selection of components, it is still the case that the first k components are selected and the last p – k are dropped. The method never keeps the first, third, and sixth components, for example.
Some alternatives to principal component regression include the following:
In summary, principal component regression is a technique for computing regressions when the explanatory variables are highly correlated. It has several advantages, but the main drawback of PCR is that the decision about how many principal components to keep does not depend on the response variable. Consequently, some of the variables that you keep might not be strong predictors of the response, and some of the components that you drop might be excellent predictors. A good alternative is partial least squares regression, which I recommend. In SAS, you can run partial least squares regression by using PROC PLS with METHOD=PLS.
Quick Quiz! Where might you hear the following conversation? ... Waitress: "What would you like to drink, honey?" Customer: "I'll have a coke." Waitress: "What kind?" Customer: "Diet Pepsi." If you answered somewhere between Texas and Georgia, you would be correct! To those of us not from that area, it [...]
A common question on discussion forums is how to compute a principal component regression in SAS. One reason people give for wanting to run a principal component regression is that the explanatory variables in the model are highly correlated which each other, a condition known as multicollinearity. Although principal component regression (PCR) is a popular technique for dealing with almost collinear data, PCR is not a cure-all. This article shows how to compute a principal component regression in SAS; a subsequent article discusses the problems with PCR and presents alternative techniques.
Near collinearity among the explanatory variables in a regression model requires special handling because:
Principal component regression keeps only the most important principal components and discards the others. This means that you compute the principal components for the explanatory variables and drop the components that correspond to the smallest eigenvalues of X`X. If you keep k principal components, then those components enable you to form a rank-k approximation to the crossproduct matrix. If you regress the response variable onto those k components, you obtain a PCR. Usually the parameter estimates are expressed in terms of the original variables, rather than in terms of the principal components.
In SAS there are two easy ways to compute principal component regression:
Notice that neither of these methods calls PROC PRINCOMP. You could call PROC PRINCOMP, but it would be more complicated than the previous methods. You would have to extract the first principal components (PCs), then use PROC REG to compute the regression coefficients for the PCs, then use matrix computations to convert the parameter estimates from the PCs to the original variables.
Principal component regression is also sometimes used for general dimension reduction. Instead of projecting the response variable onto a p-dimensional space of raw variables, PCR projects the response onto a k-dimensional space where k is less than p. For dimension reduction, you might want to consider another approach such as variable selection by using PROC GLMSELECT or PROC HPGENSELECT. The reason is that the PCR model retains all of the original variables whereas variable selection procedures result in models that have fewer variables.
I recommend using the PLS procedure to compute a principal component regression in SAS. As mentioned previously, you need to use the METHOD=PCR and NFAC= options. The following data for 31 men at a fitness center is from the documentation for PROC REG. The goal of the study is to predict oxygen consumption from age, weight, and various physiological measurements before and during exercise. The following call to PROC PLS computes a PCR that keeps four principal components:
data fitness; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@; datalines; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ; proc pls data=fitness method=PCR nfac=4; /* PCR onto 4 factors */ model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / solution; run;
The output includes the parameter estimates table, which gives the estimates for the four-component regression in terms of the original variables. Another table (not shown) shows that the first four principal components explain 93% of the variation in the explanatory variables and 78% of the variation in the response variable.
For another example of using PROC PLS to combat collinearity, see Yu (2011), "Principal Component Regression as a Countermeasure against Collinearity."
I recommend PROC PLS for principal component regression, but you can also compute a PCR by using the PCOMIT= option on the MODEL statement in PROC REG. However, the parameter estimates are not displayed in any table but must be written to OUTEST= data set, as follows:
proc reg data=fitness plots=none outest=PE; /* write PCR estimates to PE data set */ model Oxygen=Age Weight RunTime RunPulse RestPulse MaxPulse / PCOmit=2; /* omit 2 PCs ==> keep 6-2=4 PCs */ quit; proc print data=PE(where=(_Type_="IPC")) noobs; var Intercept--MaxPulse; run;
Notice that the PCOMIT=2 option specifies that two PCs should be dropped, which is equivalent to keeping four components in this six-variable model. The parameter estimates are written to the PE data set and are displayed by PROC PRINT. The estimates the same as those found by PROC PLS. In the PE data, the PCR estimates are indicated by the value "IPC" for the _TYPE_ variable, which stands for incomplete principal component regression. The word "incomplete" indicates that not all the principal components are used.
It is worth noting that even though the principal components themselves are based on centered and scaled data, the parameter estimates are reported for the original (raw) variables. It is also worth noting that you can use the OUTSEB option on the PROC REG statement to obtain standard errors for the parameter estimates.
This article shows you how to perform principal component regression in SAS by using PROC PLS with METHOD=PCR. However, I must point out that there are statistical drawbacks to using principal component regression. The primary issue is that principal component regression does not use any information about the response variable when choosing the principal components. Before you decide to use PCR, I urge you to read my next post about the drawbacks with the technique. You can then make an informed decision about whether you want to use principal component regression for your data.
To make accurate predictions, it is necessary that the sample data you use for model development is compatible with the target population. The distribution of each input used in the model should be similar in the sample and the target population. In your model you should include only those variables [...]
The post Compatibility between model data and the target population appeared first on SAS Learning Post.
There are many compelling reasons existing SAS users might want to start integrating SAS Viya into their SAS9 programs and applications. For me, it comes down to ease-of-use, speed, and faster time-to-value. With the ability to traverse the (necessarily iterative) analytics lifecycle faster than before, we are now able to generate output quicker – better supporting vital decision-making in a reduced timeframe. In addition to the positive impacts this can have on productivity, it can also change the way we look at current business challenges and how we design possible solutions.
Earlier this year I wrote about how SAS Viya provides a robust analytics environment to handle all of your big data processing needs. Since then, I’ve been involved in testing the new SAS Viya 3.3 software that will be released near the end of 2017 and found some additional advantages I think warrant attention. In this article, I rank order the main advantages of SAS Viya processing and new capabilities coming to SAS Viya 3.3 products. While the new SAS Viya feature list is too long to list everything individually, I’ve put together the top reasons why you might want to start taking advantage of SAS Viya capabilities of the SAS platform.
In SAS Viya, everything that can run multi-threaded - does. This is the single-most important aspect of the SAS Viya architecture for existing SAS customers. As part of this new holistic approach to data processing, SAS has enabled the highly flexible DATA step to run multi-threaded, requiring very little modification of code in order to begin taking advantage of this significant new capability (more on that in soon-to-be-released blog). Migrating to SAS Viya is important especially in those cases where long-running jobs consist of very long DATA steps that act as processing bottle-necks where constraints exist because of older single-threading configurations.
While not 100% true, most sort routines can be removed from your existing SAS programs. Ask yourself the question: “What portion of my runtimes are due strictly to sorting?” The answer is likely around 10-25%, maybe more. In general, the concept of sorting goes away with in-memory processing. SAS Viya does its own internal memory shuffling as a replacement. The SAS Viya CAS engine takes care of partitioning and organizing the data so you don’t have to. So, take those sorts out your existing code!
Not available in SAS 9.4, the VARCHAR informat/format allows you to store byte information without having to allocate room for blank spaces. Because storage for columnar (input) values varies by row, you have the potential to achieve an enormous amount of (blank space) savings, which is especially important if you are using expensive (fast) disk storage space. This represents a huge value in terms of potential data storage size reduction.
SAS Viya can leverage Hive/HDFS and Teradata platforms by loading (lifting) data up and writing data back down in parallel using CAS pooled memory. Data I/O, namely reading data from disk and converting it into a SAS binary format needed for processing, is the single most limiting factor of SAS 9.4. Once you speed up your data loading, especially for extremely large data sets, you will be able to generate faster time to results for all analyses and projects.
Similar to SAS LASR, CAS can be structured to persist large data sets in memory, indefinitely. This allows users to access the same data at the same time and eliminates redundancy and repetitive I/O, potentially saving valuable compute cycles. Essentially, you can load the data once and then as many people (or processing steps) can reuse it as many times as needed thereafter.
All the most popular ML techniques are represented giving you the flexibility to customize model tournaments to include those techniques most appropriate for your given data and problem set. We also provide assessment capabilities, thus saving you valuable time to get the types of information you need to make valid model comparisons (like ROC charts, lift charts, etc.) and pick your champion models. We do not have extreme Gradient Boosting, Factorization Machines, or a specific Assessment procedure in SAS 9.4. Also, GPU processing is supported in SAS Viya 3.3, for Deep Neural Networks and Convolutional Neural Networks (this has not be available previously).
The task of transposing data amounts to about 80% of any model building exercise, since predictive analytics requires a specialized data set called a ‘one-row-per-subject’ Analytic Base Table (ABT). SAS Viya allows you transpose in a fraction of the time that it used to take to develop the critical ABT outputs. A phenomenal time-saver procedure that now runs entirely multi-threaded, in-memory.
The ability to code from external interfaces gives coders the flexibility they need in today’s fast-moving programming world. SAS Viya supports native language bindings for Lua, Java, Python and R. This means, for example, that you can launch SAS processes from a Jupyter Notebook while staying within a Python coding environment. SAS also provide a REST API for use in data science and IT departments.
The core of SAS Viya machine learning techniques support auto-tuning. SAS has the most effective hyper-parameter search and optimization routines, allowing data scientists to arrive at the correct algorithm settings with higher probability and speed, giving them better answers with less effort. And because ML scoring code output is significantly more complex, SAS Viya Data Mining and Machine Learning allows you to deploy compact binary score files (called Astore files) into databases to help facilitate scoring. These binary files do not require compilation and can be pushed to ESP-supported edge analytics. Additionally, training within event streams is being examined for a future release.
A. Less coding – SAS Viya acts as a code generator, producing batch code for repeatability and score code for easier deployment. Both batch code and score code can be produced in a variety of formats, including SAS, Java, and Python.
B. Improved data integration between SAS Viya visual analytics products – you can now edit your data in-memory and pass it effortlessly through to reporting, modeling, text, and forecasting applications (new tabs in a single application interface).
C. Ability to compare modeling pipelines – now data scientists can compare champion models from any number of pipelines (think of SAS9 EM projects or data flows) they’ve created.
D. Best practices and white box templates – once only available as part of SAS 9 Rapid Predictive Modeler, Model Studio now gives you easy access to basic, intermediate and advanced model templates.
E. Reusable components – Users can save their best work (including pipelines and individual nodes) and share it with others. Collaborating is easier than ever.
You can load big data without having all that data fit into memory. Before in HPA or LASR engines, the memory environment had to be sized exactly to fit all the data. That prior requirement has been removed using CAS technology – a really nice feature.
SAS Viya seeks to standardize on common algorithms and techniques provided within every analytic technique so that you don’t get different answers when attempting to do things using alternate procedures or methods. For instance, our deployment of Stochastic Gradient Descent is now the same in every technique that uses that method. Consistency also applies to the interfaces, as SAS Viya attempts to standardize the look-and-feel of various interfaces to reduce your learning curve when using a new capability.
The net result of these Top 12 advantages is that you have access to state-of-the-art technology, jobs finish faster, and you ultimately get faster time-to-value. While this idea has been articulated in some of the above points, it is important to re-emphasize because SAS Viya benefits, when added together, result in higher throughputs of work, a greater flexibility in terms of options, and the ability to keep running when other systems would have failed. You just have a much greater efficiency/productivity level when using SAS Viya as compared to before. So why not use it?
Reading an external file that contains delimiters (commas, tabs, or other characters such as a pipe character or an exclamation point) is easy when you use the IMPORT procedure. It's easy in that variable names are on row 1, the data starts on row 2, and the first 20 rows are a good sample of your data. Unfortunately, most delimited files are not created with those restrictions in mind. So how do you read files that do not follow those restrictions?
You can still use PROC IMPORT to read the comma-, tab-, or otherwise-delimited files. However, depending on the circumstances, you might have to add the GUESSINGROWS= statement to PROC IMPORT or you might need to pre-process the delimited file before you use PROC IMPORT.
Note: PROC IMPORT is available only for use in the Microsoft Windows, UNIX, or Linux operating environments.
The following sections explain four different scenarios for using PROC IMPORT to read files that contain the delimiters that are listed above.
In this scenario, I use PROC IMPORT to read a comma-delimited file that has variable names on row 1 and data starting on row 2, as shown below:
proc import datafile='c:\temp\classdata.csv' out=class dbms=csv replace; run;
When I submit this code, the following message appears in my SAS® log:
NOTE: Invalid data for Age in line 28 9-10. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--- 28 Janet,F,NA,62.5,112.5 21 Name=Janet Sex=F Age=. Height=62.5 Weight=112.5 _ERROR_=1 _N_=27 NOTE: 38 records were read from the infile 'c:\temp\classdata.csv'. The minimum record length was 17. The maximum record length was 21. NOTE: The data set WORK.CLASS has 38 observations and 5 variables.
In this situation, how do you prevent the Invalid Data message in the SAS log?
By default, SAS scans the first 20 rows to determine variable attributes (type and length) when it reads a comma-, tab-, or otherwise-delimited file. Beginning in SAS® 9.1, a new statement (GUESSINGROWS=) is available in PROC IMPORT that enables you to tell SAS how many rows you want it to scan in order to determine variable attributes. In SAS 9.1 and SAS® 9.2, the GUESSINGROWS= value can range from 1 to 32767. Beginning in SAS® 9.3, the GUESSINGROWS= value can range from 1 to 2147483647. Keep in mind that the more rows you scan, the longer it takes for the PROC IMPORT to run.
The following program illustrates the use of the GUESSINGROWS= statement in PROC IMPORT:
proc import datafile='c:\temp\classdata.csv' out=class dbms=csv replace; guessingrows=100; run;
The example above includes the statement GUESSINGROWS=100, which instructs SAS to scan the first 100 rows of the external file for variable attributes. You might need to increase the GUESSINGROWS= value to something greater than 100 to obtain the results that you want.
In this scenario, my delimited file has the variable names on row 4 and the data starts on row 5. When you use PROC IMPORT, you can specify the record number at which SAS should begin reading. Although you can specify which record to start with in PROC IMPORT, you cannot extract the variable names from any other row except the first row of an external file that is comma-, tab-, or an otherwise-delimited.
Then how do you program PROC IMPORT so that it begins reading from a specified row?
To do that, you need to allow SAS to assign the variable names in the form VARx (where x is a sequential number). The following code illustrates how you can skip the first rows of data and start reading from row 4 by allowing SAS to assign the variable names:
proc import datafile='c:\temp\class.csv' out=class dbms=csv replace; getnames=no; datarow=4; run;
In this scenario, I want to read only records 6–15 (inclusive) in the delimited file. So the question here is how can you set PROC IMPORT to read just a section of a delimited file?
To do that, you need to use the OBS= option before you execute PROC IMPORT and use the DATAROW= option within PROC IMPORT.
The following example reads the middle ten rows of a CSV file, starting at row 6:
options obs=15; proc import out=work.test2 datafile= "c:\temp\class.csv" dbms=csv replace; getnames=yes; datarow=6; run; options obs=max; run;
Notice that I reset the OBS= option to MAX after the IMPORT procedure to ensure that any code that I run after the procedure processes all observations.
In this scenario, I again use PROC IMPORT to read my external file. However, I receive more observations in my SAS data set than there are data rows in my delimited file. The external file looks fine when it is opened with Microsoft Excel. However, when I use Microsoft Windows Notepad or TextPad to view some records, my data spans multiple rows for values that are enclosed in quotation marks. Here is a snapshot of what the file looks like in both Microsoft Excel and TextPad, respectively:
The question for this scenario is how can I use PROC IMPORT to read this data so that the observations in my SAS data set match the number of rows in my delimited file?
In this case, the external file contains embedded carriage return (CR) and line feed (LF) characters in the middle of the data value within a quoted string. The CRLF is an end-of-record marker, so the remaining text in the string becomes the next record. Here are the results from reading the CSV file that is illustrated in the Excel and TextPad files that are shown earlier:
That behavior is why you receive more observations than you expect. Anytime SAS encounters a CRLF, SAS considers that a new record regardless of where it is found.
A sample program that removes a CRLF character (as long as it is part of a quoted text string) is available in SAS Note 26065, "Remove carriage return and line feed characters within quoted strings."
After you run the code (from the Full Code tab) in SAS Note 26065 to pre-process the external file and remove the erroneous CR/LF characters, you should be able to use PROC IMPORT to read the external file with no problems.
For more information about PROC IMPORT, see "Chapter 35, The IMPORT Procedure" in the Base SAS® 9.4 Procedures Guide, Seventh Edition.