4月 082019
 

Have you ever wondered if love at first sight really exists? And if it exists, what qualities are people drawn too? Watch any romantic comedy and you’ll see this phenomenon play out on the big screen. Which begs the question, “If it can happen to them why not me?” Let’s [...]

Love at first sight: authentic or absurd? was published on SAS Voices by Melanie Carey

4月 052019
 

You are a data scientist, in your office, doing data scientist-y things when, your manager's, manager's, manager makes an impossible request. She wants you take a raw data set from the stem cell research team, scrub the data, create and score models, and be ready to rescore when new data comes is available. And she wants it in a week. WHAT?! Your company doesn't own an analytics software license, and a spreadsheet is not going to work on this data with millions of records. Even if you received funding, how could you ever create and maintain an environment under your tight deadline? Take a deep breath, conjure your inner data scientist acumen, and realize SAS has the answer.

SAS Machine Learning on SAS Analytics Cloud provides on-demand programming access to machine learning algorithms in the cloud. No downloads, no install, no infrastructure, no maintenance. This solution provides a multithreaded, multiuser environment for concurrent access to data in memory. The solution is designed for data scientists (and others) coding in SAS or Python and allows them on-demand programmatic access to SAS Viya. You can find more details on Analytic Cloud in the fact sheet. You can even try it for free! The rest of this article will walk you through the features of this new SAS offering and outline how it can help you complete the task bestowed upon you.

Register and get started

Literally, to sign up for the trial, all you need are a SAS Profile, an email address, and a PC. You will be coding in SAS in less than a minute. From the SAS Cloud Analytics page, select the Get Free Trial button. This takes you to the SAS Profile login page (note you can create your SAS Profile here if you do not have one).

SAS Profile log in or creation

Agree to the Terms and Conditions on the License Agreement page and select the Continue button:

Trial License Agreement

You will receive an email containing a URL much like the following:

email confirmation with trial URL

Logging in

Select the link or paste it into your browser (Google Chrome 64-bit recommended) and you will see the log in screen. Enter your SAS Profile credentials and click the Sign In button.

Sign In screen

The Home screen (Applications) appears.

Home Page

We'll discuss the Data and Team pages in further detail later on in this article. You have two options for applications: SAS Studio (for SAS programming) and JupyterLab (for Python programming). This article focuses on SAS Studio. A follow up article will cover the JupyterLab use case. Select the SAS Studio button, a new tab opens to SAS Studio, and we're ready to start coding.

SAS Studio

You are familiar with the SAS language, but you need to brush up a little. Have no fear, support documentation is easily accessible. Also, the SAS Data Mining and Machine Learning Community is a great place to discover additional resources and ask questions. Finally, embedded in SAS Studio are code snippets. You decide to explore the latter.

Code snippets

In SAS Studio select the Snippets twisty in the left pane. Navigate to the SAS Viya Machine Learning section. Here you find code samples you will use to prep and analyze your data. When opening a snippet, you see code and detailed comments on what the code will accomplish. You will use these snippets as a guide when you load and prep your data and preform your analysis. Below is an image of the Prepare and Explore Data snippet. Notice each code step has accompanying comments.

You read through each snippet in the Machine Learning section. The command and structure of the code comes back to you pretty quickly and you're now ready to try it all out on your own data.

Uploading data

Now that you have an idea of what code you need to write, you need to load the data from the research department. You accomplish this by selecting the Server Files and Folders twisty and navigate to the Folder Shortcuts section. In this instance you want to upload your file into the shared/data directory (I'll explain why I chose this location in a moment). Use the Upload button to upload the research data file.

Upload file to the data directory

You're not alone

Files uploaded to shared/data are now visible by others logged into the environment. Wait, did I forget to mention this is a multi-user environment?! Well, yes, it is. You can invite others to collaborate on the project. To add and manage users, return to the Home screen (leaving SAS Studio open). Select the Team section in the left pane. The Team page lists users and displays an Invite button, used to send an invitation for system access to others.

Teams page

To invite others, click the button and enter the email address of the new user. This generates and sends an invitation in an email. The new user accepts the invite and now has access to the system. Using the URL provided in the email the new user logs in using their own SAS Profile credentials. The default role for new users is ‘User.’ A user with admin privileges can change the role to ‘Admin. In the free trial, you are permitted to have a total of five users.

Shared data

You may have guessed by now the Data section lists directories and files located in the shared directory in SAS Studio.

Data page

You also notice here you have 5 GB of storage space. This includes shared and non-shared files.

I love this. How do I get more?

Now you know your way around the system and are ready to start coding. Return to SAS Studio, open a new program, and commence your analysis of the stem cell data. When you successfully deliver the project and impress your management chain, you can mention how the SAS Analytics Cloud solution made it all possible (and simple). You now have a case for the departmental procurement of the solution opening your organization up to add more users, access more storage, and gain more power to run advanced machine learning algorithms on your data.

Your turn

In this article I've outlined how to easily register for the SAS Machine Learning trial and start coding in the matter of a minute. Try it out yourself. Register, load your data, get coding, and solve your problem.

Related Resources

For more details on the development of SAS Analytics Cloud, check out Missy Hannah's interview with two UI developers on the project.

Zero to SAS in 60 Seconds- SAS Machine Learning on SAS Analytics Cloud was published on SAS Users.

4月 042019
 

According to the World Cancer Research Fund, Breast cancer is one of the most common cancers worldwide, with 12.3% of new cancer patients in 2018 suffering from breast cancer. Early detection can significantly improve treatment value, however, the interpretation of cancer images heavily depends on the experience of doctors and technicians. The [...]

Malignant or benign? Cancer detection with SAS Viya and NVIDIA GPUs was published on SAS Voices by David Tareen

4月 022019
 

There's been a lot of hype regarding using machine learning (ML) for demand forecasting, and rightfully so, given the advancements in data collection, storage, and processing along with improvements in technology. There's no reason why machine learning can't be utilized as another forecasting method among the collection of forecasting methods [...]

Is machine learning practical for statistical forecasting? was published on SAS Voices by Charlie Chase

3月 282019
 

Over the past couple of years, I've written about a variety of use cases and value props regarding SAS® Customer Intelligence 360. As powerful as words and images can be, let's transition to the ultimate show – the demo. In the forty minute video below, observe how SAS Customer Intelligence [...]

SAS Customer Intelligence 360 Meets SAS Viya: Show me the demo was published on Customer Intelligence Blog.

2月 192019
 

When it comes to forecasting new product launches, executives say that it's a frustrating, almost futile, effort. The reason? Minimal data, limited analytic capabilities and a general uncertainty surrounding a new product launch. Not to mention the ever-changing marketplace. Nevertheless, companies cannot disregard the need for a new product forecast [...]

Practical approaches to new product forecasting using structured and unstructured data was published on SAS Voices by Charlie Chase

2月 192019
 

When it comes to forecasting new product launches, executives say that it's a frustrating, almost futile, effort. The reason? Minimal data, limited analytic capabilities and a general uncertainty surrounding a new product launch. Not to mention the ever-changing marketplace. Nevertheless, companies cannot disregard the need for a new product forecast [...]

Practical approaches to new product forecasting using structured and unstructured data was published on SAS Voices by Charlie Chase

2月 072019
 

Each day, more than 130 Americans die from opioid overdoses. Combating the opioid epidemic begins with understanding it, and that begins with data. SAS recently partnered with graduate students from Carnegie Mellon University (CMU) 's Heinz College of Information Systems and Public Policy to understand how data mining and machine [...]

An unexpected weapon in the fight against the opioid epidemic: Graduate students was published on SAS Voices by Manuel Figallo

2月 062019
 

Feature generation (also known as feature creation) is the process of creating new features to use for training machine learning models. This article focuses on regression models. The new features (which statisticians call variables) are typically nonlinear transformations of existing variables or combinations of two or more existing variables. This article argues that a naive approach to feature generation can lead to many correlated features (variables) that increase the cost of fitting a model without adding much to the model's predictive power.

Feature generation in traditional statistics

Feature generation is not new. Classical regression often uses transformations of the original variables. In the past, I have written about applying a logarithmic transformation when a variable's values span several orders of magnitude. Statisticians generate spline effects from explanatory variables to handle general nonlinear relationships. Polynomial effects can model quadratic dependence and interactions between variables. Other classical transformations in statistics include the square-root and inverse transformations.

SAS/STAT procedures provide several ways to generate new features from existing variables, including the EFFECT statement and the "stars and bars" notation. However, an undisciplined approach to feature generation can lead to a geometric explosion of features. For example, if you generate all pairwise quadratic interactions of N continuous variables, you obtain "N choose 2" or N*(N-1)/2 new features. For N=100 variables, this leads to 4950 pairwise quadratic effects!

Generated features might be highly correlated

In addition to the sheer number of features that you can generate, another problem with generating features willy-nilly is that some of the generated effects might be highly correlated with each other. This can lead to difficulties if you use automated model-building methods to select the "important" features from among thousands of candidates.

I was reminded of this fact recently when I wrote an article about model building with PROC GLMSELECT in SAS. The data were simulated: X from a uniform distribution on [-3, 3] and Y from a cubic function of X (plus random noise). I generated the polynomial effects x, x^2, ..., x^7, and the procedure had to build a regression model from these candidates. The stepwise selection method added the x^7 effect first (after the intercept). Later it added the x^5 effect. Of course, the polynomials x^7 and x^5 have a similar shape to x^3, but I was surprised that those effects entered the model before x^3 because the data were simulated from a cubic formula.

After thinking about it, I realized that the odd-degree polynomial effects are highly correlated with each other and have high correlations with the target (response) variable. The same is true for the even-degree polynomial effects. Here is the DATA step that generates 1000 observations from a cubic regression model, along with the correlations between the effects (x1-x7) and the target variable (y):

%let d = 7;
%let xMean = 0;
data Poly;
call streaminit(54321);
array x[&d];
do i = 1 to 1000;
   x[1] = rand("Normal", &xMean);  /* x1 ~ U(-3, 3] */
   do j = 2 to &d;
      x[j] = x[j-1] * x[1];        /* x[i] = x1**i, i = 2..7 */
   end;
   y = 2 - 1.105*x1 - 0.2*x2 + 0.5*x3 + rand("Normal");  /* response is cubic function of x1 */
   output;
end;
drop i j;
run;
 
proc corr data=Poly nosimple noprob;
   var y;
   with x:;
run;
Correlations between the target variable and polynimal effects

You can see from the output that the x^5 and x^7 effects have the highest correlations with the response variable. Because the squared correlations are the R-square values for the regression of Y onto each effect, it makes intuitive sense that these effects are added to the model early in the process. Towards the end of the model-building process, the x^3 effect enters the model and the x^5 and x^7 effects are removed. To the model-building algorithm, these effects have similar predictive power because they are highly correlated with each other, as shown in the following correlation matrix:

proc corr data=Poly nosimple noprob plots(MAXPOINTS=NONE)=matrix(NVAR=ALL);
   var x:;
run;
Correlations among polynomial effects

I've highlighted certain cells in the lower triangular correlation matrix to emphasize the large correlations. Notice that the correlations between the even- and odd-degree effects are close to zero and are not highlighted. The table is a little hard to read, but you can use PROC IML to generate a heat map for which cells are shaded according to the pairwise correlations:

Heat map of correlations among polynomial effects

This checkerboard pattern shows the large correlations that occur between the polynomial effects in this problem. This image shows that many of the generated features do not add much new information to the set of explanatory variables. This can also happen with other transformations, so think carefully before you generate thousands of new features. Are you producing new effects or redundant ones?

Generated features are not independent

Notice that the generated effects are not statistically independent. They are all generated from the same X variable, so they are functionally dependent. In fact, a scatter plot matrix of the data will show that the pairwise relationships are restricted to parametric curves in the plane. The graph of (x, x^2) is quadratic, the graph of (x, x^3) is cubic, the graph of (x^2, x^3) is an algebraic cusp, and so forth. The PROC CORR statement in the previous section created a scatter plot matrix, which is shown below: Again, I have highlighted cells that have highly correlated variables.

Scatter plot matrix that shows the statistical dependencies between polynomial effects

I like this scatter plot matrix for two reasons. First, it is soooo pretty! Second, it visually demonstrates that two variables that have low correlation are not necessarily independent.

Summary

Feature generation is important in many areas of machine learning. It is easy to create thousands of effects by applying transformations and generating interactions between all variables. However, the example in this article demonstrates that you should be a little cautious of using a brute-force approach. When possible, you should use domain knowledge to guide your feature generation. Every feature you generate adds complexity during model fitting, so try to avoid adding a bunch of redundant highly-correlated features.

For more on this topic, see the article "Automate your feature engineering" by my colleague, Funda Gunes. She writes: "The process [of generating new features]involves thinking about structures in the data, the underlying form of the problem, and how best to expose these features to predictive modeling algorithms. The success of this tedious human-driven process depends heavily on the domain and statistical expertise of the data scientist." I agree completely. Funda goes on to discuss SAS tools that can help data scientists with this process, such as Model Studio in SAS Visual Data Mining and Machine Learning. She shows an example of feature generation in her article and in a companion article about feature generation. These tools can help you generate features in a thoughtful, principled, and problem-driven manner, rather than relying on a brute-force approach.

The post Feature generation and correlations among features in machine learning appeared first on The DO Loop.