11月 082019
 

Six editions is a lot! If you had told us, back when we wrote the first edition of The Little SAS Book, that someday we would write a sixth; we would have wondered how we could possibly find that much to say. After all, it is supposed to be The Little SAS Book, isn’t it? But the developers at SAS Institute are constantly hard at work inventing new and better ways of analyzing and visualizing data. And some of those ways turn out to be so fundamental that they belong even in a little book about SAS.

Interface independence

One of the biggest changes to SAS software in recent years is the proliferation of interfaces. SAS programmers have more choices than ever before. Previous editions contained some sections specific to the SAS windowing environment (also called Display Manager). We wrote this edition for all SAS programmers whether you use SAS Studio, SAS Enterprise Guide, the SAS windowing environment, or run in batch. That sounds easy, but it wasn’t. There are differences in how SAS behaves with different interfaces, and these differences can be very fundamental. In particular, the system option that sets the rules for names of variables varies depending on how you run SAS. So old sections had to be rewritten, and we added a whole new section showing how to use variable names containing blanks and special characters.

New ways to read and write Microsoft Excel files

Previous editions already covered how to read and write Microsoft Excel files, but SAS developers have created some great new ways. This edition contains new sections about the XLSX LIBNAME engine and the ODS EXCEL destination.

More PROC SQL

From the very first edition, The Little SAS Book always covered PROC SQL. But it was in an appendix and over time we noticed that most people ignore appendices. So for this edition, we removed the appendix and added new sections on using PROC SQL to

  • Subset your data
  • Join data sets
  • Add summary statistics to a data set
  • Create macro variables with the INTO clause

For people who are new to SQL, these sections provide a good introduction; for people who already know SQL, they provide a model of how to leverage SQL in your SAS programs.

Updates and additions throughout the book

Almost every section in this edition has been changed in some way. We added new options, made sure everything is up-to-date, and ran every example in every SAS interface noting any differences. For example, PROC SGPLOT has some new options, the default ODS style for PDF has changed, and the LISTING destination behaves differently in different interfaces. Here’s a short list, in no particular order, of new or expanded topics in the sixth edition:

  • More examples with permanent SAS data sets, CSV files, or tab-delimited files
  • More log notes throughout the book showing what to look for
  • LIKE or sounds-like (=*) operators in WHERE statements
  • CROSSLIST, NOCUM, and NOPRINT options in PROC FREQ
  • Grouping data with a user-defined format and the PUT function
  • Iterative DO groups
  • DO WHILE and DO UNTIL statements
  • %DO statements

Even though we have added a lot to this edition, it is still a little book.  In fact, this edition is shorter than the last—by twelve pages! We think this is the best edition yet.

11月 062019
 

I have been programming SAS for a LONG time and have never seen much in the way of programming standards. For example, most SAS programmers indent DATA and PROC statements (I like three spaces). Most programmers do not like to see more than one statement on a line and most agree that there should be blank lines between program boundaries (DATA and PROC steps).

I thought I would share some of my thoughts on programming standards, with the hope that others will chime in with their ideas.

    • I like to indent all the statements in a DO group or DO loop. If there are nested groups, each one gets indented as well.
    • I prefer variable names in proper case.
    • I am not a fan of camel-case. For example, I prefer Weight_Kg to WeightKg. The reason that some programmers like camel-case is that SAS will automatically split a variable name at a capital letter in some headings.
    • I like my TITLE statements in open code, not inside a PROC. To me, that makes sense because TITLE statements are global.
    • There should be no conversion messages (character to numeric or numeric to character) in the SAS log. For example use Num = INPUT(Char_Num,12.); instead of Num = 1*Char_Num;. The latter statement forces an automatic character to numeric conversion and places a message in the log.
    • I always use the statement ODS NOPROCTITLE;. This eliminates the default SAS procedure name at the top of the output.
    • Although fewer and fewer people are reading raw text data, I like my @ signs to all line up in my INPUT statement.
    • I like to use the /* and */ comments to define all macro variables. For example:

Notice that I prefer named parameters in my macros, instead of positional parameters.

If this seems like too much work - SAS Studio has an automatic formatting tool that can help standardize your programs. For example, look at the code below:

Really ugly, right? Here is how you can use the automatic formatting tool in SAS Studio.

When you click this icon, the program now looks like this:

That’s pretty much the way I would write it. By the way, if you don't like how Studio formatted your code, enter a control-z to undo it.

For more tips on writing code and how to get started in SAS Studio – check out my book, Learning SAS by Example: A Programmer’s Guide, Second Edition. You can also download a free book excerpt. To also learn more about SAS Press, check out the up-and-coming titles, and receive exclusive discounts make sure to subscribe to the SAS Books newsletter.

Making your SAS code more readable was published on SAS Users.

11月 062019
 
COV biplot of Fisher's iris data. Commputed in SAS.

Principal component analysis (PCA) is an important tool for understanding relationships in multivariate data. When the first two principal components (PCs) explain a significant portion of the variance in the data, you can visualize the data by projecting the observations onto the span of the first two PCs. In a PCA, this plot is known as a score plot. You can also project the variable vectors onto the span of the PCs, which is known as a loadings plot. See the article "How to interpret graphs in a principal component analysis" for a discussion of the score plot and the loadings plot.

A biplot overlays a score plot and a loadings plot in a single graph. An example is shown at the right. Points are the projected observations; vectors are the projected variables. If the data are well-approximated by the first two principal components, a biplot enables you to visualize high-dimensional data by using a two-dimensional graph.

In general, the score plot and the loadings plot will have different scales. Consequently, you need to rescale the vectors or observations (or both) when you overlay the score and loadings plots. There are four common choices of scaling. Each scaling emphasizes certain geometric relationships between pairs of observations (such as distances), between pairs of variables (such as angles), or between observations and variables. This article discusses the geometry behind two-dimensional biplots and shows how biplots enable you to understand relationships in multivariate data.

Some material in this blog post is based on documentation that I wrote in 2004 when I was working on the SAS/IML Studio product and writing the SAS/IML Studio User's Guide. The documentation is available online and includes references to the literature.

The Fisher iris data

A previous article shows the score plot and loadings plot for a PCA of Fisher's iris data. For these data, the first two principal components explain 96% of the variance in the four-dimensional data. Therefore, these data are well-approximated by a two-dimensional set of principal components. For convenience, the score plot (scatter plot) and the loadings plot (vector plot) are shown below for the iris data. Notice that the loadings plot has a much smaller scale than the score plot. If you overlay these plots, the vectors would appear relatively small unless you rescale one or both plots.

Score plot for Fisher's iris data; first two principal components. Loadings plot for Fisher's iris data; first two principal components.

The mathematics of the biplot

You can perform a PCA by using a singular value decomposition of a data matrix that has N rows (observations) and p columns (variables). The first step in constructing a biplot is to center and (optionally) scale the data matrix. When variables are measured in different units and have different scales, it is usually helpful to standardize the data so that each column has zero mean and unit variance. The examples in this article use standardized data.

The heart of the biplot is the singular value decomposition (SVD). If X is the centered and scaled data matrix, then the SVD of X is
X = U L V`
where U is an N x N orthogonal matrix, L is a diagonal N x p matrix, and V is an orthogonal p x p matrix. It turns out that the principal components (PCs) of X`X are the columns of V and the PC scores are the columns of U. If the first two principal components explain most of the variance, you can choose to keep only the first two columns of U and V and the first 2 x 2 submatrix of L. This is the closest rank-two approximation to X. In a slight abuse of notation,
X ≈ U L V`
where now U, L, and V all have only two columns.

Since L is a diagonal matrix, you can write L = Lc L1-c for any number c in the interval [0, 1]. You can then write
X ≈ (U Lc)(L1-c V`)
   = A B

This the factorization that is used to create a biplot. The most common choices for c are 0, 1, and 1/2.

The four types of biplots

The choice of the scaling parameter, c, will linearly scale the observations and vectors separately. In addition, you can write X ≈ (β A) (B / β) for any constant β. Each choice for c corresponds to a type of biplot:

  • When c=0, the vectors are represented faithfully. This corresponds to the GH biplot. If you also choose β = sqrt(N-1), you get the COV biplot.
  • When c=1, the observations are represented faithfully. This corresponds to the JK biplot.
  • When c=1/2, the observations and vectors are treated symmetrically. This corresponds to the SYM biplot.

The GH biplot for variables

GHbiplot of Fisher's iris data. Commputed in SAS.

If you choose c = 0, then A = U and B = L V`. The literature calls this biplot the GH biplot. I call it the "variable preserving" biplot because it provides the most faithful two-dimensional representation of the relationship between vectors. In particular:

  • The length of each vector (a row of B) is proportional to the variance of the corresponding variable.
  • The Euclidean distance between the i_th and j_th rows of A is proportional to the Mahalanobis distance between the i_th and j_th observations in the data.

In preserving the lengths of the vectors, this biplot distorts the Euclidean distance between points. However, the distortion is not arbitrary: it represents the Mahalanobis distance between points.

The GH biplot is shown to the right, but it is not very useful for these data. In choosing to preserve the variable relationships, the observations are projected onto a tiny region near the origin. The next section discusses an alternative scaling that is more useful for the iris data.

The COV biplot

If you choose c = 0 and β = sqrt(N-1), then A = sqrt(N-1) U and B = L V` / sqrt(N-1). The literature calls this biplot the COV biplot. This biplot is shown at the top of this article. It has two useful properties:

  • The length of each vector is equal to the variance of the corresponding variable.
  • The Euclidean distance between the i_th and j_th rows of A is equal to the Mahalanobis distance between the i_th and j_th observations in the data.

In my opinion, the COV biplot is usually superior to the GH biplot.

The JK biplot

JK biplot of Fisher's iris data. Commputed in SAS.

If you choose c = 1, you get the JK biplot, which preserves the Euclidean distance between observations. Specifically, the Euclidean distance between the i_th and j_th rows of A is equal to the Euclidean distance between the i_th and j_th observations in the data.

In faithfully representing the observations, the angles between vectors are distorted by the scaling.

The SYM biplot

If you choose c = 1/2, you get the SYM biplot (also called the SQ biplot), which attempts to treat observations and variables in a symmetric manner. Although neither the observations nor the vectors are faithfully represented, often neither representation is very distorted. Consequently, some people prefer the SYM biplot as a compromise between the COV and JK biplots. The SYM biplot is shown in the next section.

How to interpret a biplot

SYM biplot of Fisher's iris data. Commputed in SAS.

As discussed in the SAS/IML Studio User's Guide, you can interpret a biplot in the following ways:

  • The cosine of the angle between a vector and an axis indicates the importance of the contribution of the corresponding variable to the principal component.
  • The cosine of the angle between pairs of vectors indicates correlation between the corresponding variables. Highly correlated variables point in similar directions; uncorrelated variables are nearly perpendicular to each other.
  • Points that are close to each other in the biplot represent observations with similar values.
  • You can approximate the relative coordinates of an observation by projecting the point onto the variable vectors within the biplot. However, you cannot use these biplots to estimate the exact coordinates because the vectors have been centered and scaled. You could extend the vectors to become lines and add tick marks, but that becomes messy if you have more than a few variables.

If you want to faithfully interpret the angles between vectors, you should equate the horizontal and vertical axes of the biplot, as I have done with the plots on this page.

If you apply these facts to the standardized iris data, you can make the following interpretations:

  • The PetalLength and PetalWidth variables are the most important contributors to the first PC. The SepalWidth variable is the most important contributor to the second PC.
  • The PetalLength and PetalWidth variables are highly correlated. The SepalWidth variable is almost uncorrelated with the other variables.
  • Although I have suppressed labels on the points, you could label the points by an ID variable or by the observation number and use the relative locations to determine which flowers had measurements that were most similar to each other.

Summary

This article presents an overview of biplots. A biplot is an overlay of a score plot and a loadings plot, which are two common plots in a principal component analysis. These two plots are on different scales, but you can rescale the two plots and overlay them on a single plot. Depending upon the choice of scaling, the biplot can provide faithful information about the relationship between variables (lengths and angles) or between observations (distances). It can also provide approximates relationships between variables and observations.

In a subsequent post, I will show how to use SAS to create the biplots in this article.

The post What are biplots? appeared first on The DO Loop.

11月 052019
 

After you start to use arrays, it is easy to see their usefulness and power. In this post, I will share tips and techniques that can increase the functionality of arrays and enable you to use programming shortcuts to take maximum advantage of them.

These tips will also streamline your programming, even if you are not using arrays. I'll provide complete code examples for each tip at the end of each section, as well as explain how the code is constructed along the way.

Use arrays to zoom out for greater perspective

One of the biggest uses of arrays, of course, is to reshape your data from one observation per identifying variable per data point, to one observation per ID containing all the data points for that ID. For example, consider the following data set:

Figure 1

Now, you want the values of the EVENT variable to become the names of the variables in the output data set, as shown here:

Figure 2

How do you know which variables to create if your data set is more complicated than shown above?

Create a macro variable

You can use the SQL procedure to create a macro variable (called EVENTLIST here) that contains all the unique values of EVENT. You then can use that macro variable in your array definition statement:

proc sql noprint;
   select distinct event into :eventlist
   separated by ' '
   from work.events;
quit;

When you run the following statement, the resulting log shows that &EVENTLIST contains the unique values for the EVENT variable:

%put &eventlist;

Here is the log information:

3365 %put &eventlist;
aaa bbb ccc ddd

You can then use the EVENTLIST macro variable in an ARRAY statement to define the variables that you are including:

array events_[*] &eventlist;

Use CALL MISSING to set all variables

However, you do not want the values for the previous ID to contaminate the next ID, so you need to reset the array variables to missing with each new ID. You can use a DO loop that uses FIRST.variable and LAST.variable processing in order to set each value to missing:

if first.id then do;
   do i = 1 to dim(events_);
      events_[i] = .;
   end;
end;

DO loops were commonly used in the past to initialize all elements of an array to missing. And it still needs to be used if you want to set all the elements to a non-missing value, such as 0. However, you can replace this code with one function call. The CALL MISSING routine sets all the variables in the array to missing in one statement:

if first.id then call missing(of events_[*]);

Compare via OF variable-list syntax

The OF variable-list syntax is another helpful feature. You can use OF with a list of variables or an array reference.

So, now you have the EVENT values as variable names. How are you going to compare the value of a variable in an observation with a variable name in the array? You can reference each element of the array that you created with EVENTS_[i], but that will return the value of that element. For this comparison, you need to obtain the variable name for each element in the array.

To return the name of each variable in the array, you can use the VNAME function:

if event = vname(events_[i]) then do;

Now you can find when the value of the EVENT variable in the original data set matches the name of the variable in the array.

Additional ways to extract variables information in arrays

Other variable information functions can also be used in the same way to extract information about each of the variables in the array.

In this example, a value that is read from each observation can match only one variable name in the array, so you want to stop when you achieve that instead of continuing the DO loop.

To stop the DO loop at that point, use the LEAVE statement, which stops processing the current loop and resumes with the next statement in the DATA step.

Because you want to output only one observation per ID, you must explicitly use the OUTPUT statement at the end of each ID to override the default output that occurs at the end of each iteration of the DATA step:

if last.id then output;

Combine above methods with DATA Step

Putting all these methods together, here is a short DATA Step that will reshape the original data set into the output data set that you want:

data events_out (drop =i event);
   set events;
   by id;
   array events_[*] &eventlist;
   retain &eventlist;
   if first.id then call missing(of events_[*]);
   do i = 1 to dim(events_);
      if event = vname(events_[i]) then do;
	 events_[i] = 1;
	 leave;
   end;
end;
if last.id then output;
run;

Bonus shortcuts using arrays

Want to use even more shortcuts with arrays? Here, your original data has a numeric variable called RATING, which is included in every observation:

Figure 3

You want to find the lowest rating and corresponding event and also the highest rating and corresponding event for each ID. You can start off by defining the arrays and reading each observation in the same way as in the previous example. In this case, you have two arrays, so you need to read both the EVENTS and RATINGS variables into their own arrays:


array ratings_[5];
array events_[5] $3.;
…more SAS statements…
ratings_[count]=rating;
events_[count]=event;

After you have read all the observations for an ID, how do you find the lowest rating? Of course, you can use a DO loop to loop through the array and check each value to see whether it is the lowest and keep track of which one it is. Or you can use the OF array syntax with the MIN and MAX functions:


lowest_rating=min(of ratings_[*]);
highest_rating=max(of ratings_[*]);

This code returns the value of the lowest and highest ratings for ID 1, which are 3 and 9, respectively. But you also want to know which events were rated the highest and lowest for this ID. First, you have to determine which elements in the RATINGS_ array had the lowest and highest values. You know the values, but you do not yet know which specific elements match those values. Use the WHICHN function to determine the indexes for those values in the array. Keep in mind that if there is more than one element that matches the value using the WHICHN function, the function returns only the index of the first element in the array to match:


min_index=whichn(lowest_rating, of ratings_[*]);
max_index=whichn(highest_rating, of ratings_[*]);

For ID 1, these functions return 3 and 5, respectively. Now you can use the indexes to retrieve the corresponding elements in the EVENTS_ array:


lowest_rated_event=events_[min_index];
highest_rated_event=events_[max_index];

Putting all these methods together, here is the complete code for returning the desired results:

data rating (keep = id lowest_rated_event lowest_rating  
             highest_rated_event highest_rating);
   set events;
   by id;
   array ratings_[5];
   array events_[5] $3.;
   retain ratings: events:;
   if first.id then call missing(of ratings_[*],of events_[*], count);
   count + 1;
   ratings_[count] = rating;
   events_[count]=event;
   if last.id then do;
      lowest_rating = min(of ratings_[*]);
	 highest_rating = max(of ratings_[*]);
	 min_index =whichn(lowest_rating,of ratings_[*]);
	 max_index=whichn(highest_rating, of ratings_[*]);
	 lowest_rated_event=events_[min_index];
	 highest_rated_event=events_[max_index];
	 output;
   end;
run;

Here is the resulting output:

Figure 4

You can expand the functionality and capability of arrays by incorporating other features of the SAS programming language in your DATA Step code. I hope these tips and links improve your array use and lead you to explore even more ways to work with arrays. The resource below complements this post and provides additional tips and tricks.

Adventures in Arrays | Learn more about array processing!

How to make the most of arrays with SAS® software and streamline your programming was published on SAS Users.

11月 052019
 

"Moving to cloud" is top of agenda for a lot of the customers I meet. They see the potential for agility or cost reduction. Interestingly when I was speaking to the CTO of one our customers that was an earlier adopter of cloud, he didn’t see an overall cost reduction and didn’t mind. But they did experience increased agility in IT which directly lead to business improvements and an impact on the bottom line. Food for thought.

One of the biggest costs of running an analytics system can be the storage or database cost, whether through typical databases like Oracle, DB2,  Hadoop, or Teradata or just SAN alongside your SAS implementation. Getting the storage equation right on cloud can bring your TCO (Total Cost of Ownership) right down; some of our customers have halved it.

When moving to a public cloud some customers just assume that they can "lift and shift" and still get the full advantages of cloud. Unfortunately often this isn’t true -- some small refactoring changes nearly always make sense.  Often the virtual hardware on the cloud doesn’t have the same performance criteria as on-premise which means your SAS jobs will run slower when retrieving data. (Buyer beware: some instance types have lower I/O throughput.) To get faster data I/O (throughput) you might need to boost your SAS instance types which increase overall cloud costs. This can also be true for storage types too, so you may end up with more storage volume than you need if you are to guarantee a level of I/O throughput. (For more tech details, Conor Hogan, SAS Solutions Architect, describes the cloud and storage options when using SAS in the cloud.)

In this article I'll cover two main themes. First, thinking differently with SAS Grid storage on cloud, especially with regard to databases. Then, I'll look at getting access to that troublesome data that has stayed on-premise, addressing security and data-on-cloud including cloud caches.

Databases on cloud: do you need one?

I often get asked what database(s) do you support on cloud, and how does the access pattern work. To learn which databases are supported, view the current list for SAS Viya and the current list for SAS 9.4. Most of the traditional databases work fine as infrastructure-as-a-service, and follow standard rules. It starts to get more interesting as customers use the native databases in AWS, Google Cloud and Azure. These each have their own nuances, which customers didn’t quite understand at the start of the journey.  (SAS recently released an access engine that uses native Google BigQuery APIs, with great performance.)

Prior to this customers had been accessing cloud data via ODBC or JDBC, which have limitations in comparison to the new native access engines. However it’s important to note that JDBC was enhanced recently to open up more options to cloud native databases, giving SAS customers a platform-neutral option to try a number of the new native cloud database players.

We also added support for Snowflake (a scalable cloud native database) and will be furthering our support for Azure native databases like HDInsights.  SAS 9.4M6 (or higher) or SAS Viya 3.4 (or higher) is required for SAS/ACCESS Interface to Google BigQuery or Snowflake.

Incidentally a number of the database connectors on SAS Viya also support multi-node access which allows each node in SAS Viya to read/write data in parallel to your database, massively speeding up your loading times. There’s a useful article on the SAS Communities that tells you more.

There are quite a few good papers on our conference proceedings covering connecting to cloud databases like Redshift here or within SAS communities like this example to Amazon RDS.

But do you need actually need a database? You could just use the storage on the instances, but that might not be as efficient if you are wanting to scale and take machines up and down in the cloud. You might want to use lower cost persistent storage like AWS S3.  Luckily, SAS has started to build out its options for reaching into native-cloud storage buckets like S3 in Amazon, with the ability in SAS Viya to read and write directly in parallel without any additional processing layers required. (caveats exist.), have a look here for more details. Storage on S3 is cheap, but the catch with cloud is that moving data up and down the wire between on- and off-premise might be expensive.

Many SAS Grid Computing users are considering migrating to the cloud, but wonder about what to do with their Clustered File system (CFS), mainly used as a central store, to store SAS data sets in between processing steps, including SASWORK. There are Cloud Clustered File Systems available (and they can be expensive), but because we are moving to cloud we need to think a bit more laterally, more cloud native and decouple work steps to take advantage of cloud design principles.

Let’s look at an example. The diagram shows a typically data management or analytics data flow, that could be running at scale every day in a SAS grid. The great thing about this design  is it allows recovery if a node goes from the last step, as the data is written to a file system that all the nodes can see (SASWORK on a CFS), but has the disadvantage of high cost (relatively on cloud).  One way to overcome this could be to re-factoring your design. This could be relatively simple. Let’s look at two simple ways of refactoring this:

  • Don't worry about the job 'blowing up', cloud 'hardware' fails relatively infrequently on cloud. Remember Cloud resources are disposable, to restart from the beginning is very viable proposition, so just spin up a new copy on demand, maybe on a bigger instance if you need to make up time, replacing the failed machine. The later version of SAS Grid Manager for LSF from 9.4 M5 allows dynamic bursting/scaling on demand within AWS, so this is really easy to achieve.
  • Replace SASWORK with either a database or S3 or move the work into CAS on SAS Viya where node redundancy will protect your data…between steps and failure.

SAS has completed testing of Amazon FSx for Lustre, which offers a new off the shelf highly performant clustered file system suitable for grid, at a reasonable price point. However, be aware that this is sold in large blocks, and may require an even larger size to achieve sufficient throughput for a SAS Grid. We understand AWS is working to reduce these constraints.

Looking forward to SAS Viya 3.5 (4Q2019) we will be introducing further direct access data to cloud native storage sources: parquet on s3 (an extension of  the caslib concept) and CASLIBS (direct read and write access) and on Azure Data Lake Storage, Gen 2  -- supporting CSV and ORC formats for direct access.

New SAS tools to help

We covered storage transitions and considerations for design patterns when we lift and shift a SAS system to the cloud and can move most of the storage or the data to the cloud with it. This could include refactoring the workload to be more cloud friendly. This design fits with ‘data gravity’ paradigm, which implies keeping the data near the processing to reduce I/O or network latency low.  Increasingly we see cloud design patterns separating compute and processing to allow flexible processing models, including hydrating data on demand.

Unfortunately, in real life the situation isn’t so clear cut, sometimes all the data can’t be moved to the cloud due to cost, security, historical constraints or maybe because the data SAS processes originates from sources that must stay on-premise next to the primary systems they support.

Luckily SAS Viya offers us some very clever solutions to help with this…

The first of these is SAS Cloud Data Exchange (CDE) that comes as part of our SAS Data Preparation product with SAS Viya.  CDE is a data connection capability that securely performs high-volume data transfers (via https) from an on-premises data store(s) to a cloud-based instance of SAS Viya for use in SAS Viya applications.  This is done by installing an on-premise local SAS data agent that queries the local source(s) and then pulls the relevant data up to your CAS Cloud instance. This allows end users with a Graphical user Interface to reach through and get to on-premise data, while the administrators don’t have to setup multiple connections (and holes through a firewall) for each access engine to reach on-premise data. There a great summary article of how this works on SAS Communities.

One of the other key considerations we see from some customers is not leaving the data in the cloud permanently. This is where the SAS Viya caching and in-memory model come into play. Used in conjunction with CDE the CAS engine can be used as a semi-temporary store for your data -- think of it as a cache for analysts. SAS Viya’s built in scheduling capabilities could be used to pull in large tables before users arrive and the parallel loading capabilities of SAS Cloud Data Exchange also make on-demand request very feasible. This does requires some design and thought behind the strategies that are used for this in-memory cache in order to meet core situational requirements.

One of the other great advantages of this solution is that that it can also offer an effective route to access data for open source developers (a clever cache) in the cloud. CAS in conjunction with SAS Cloud Data Exchange can provide a secure and audited connectivity back to the on-premise data, and expose it through native Python or R packages that support SAS.  The python-swat package allows the Python client access to SAS Cloud Analytic Services. It allows users to execute CAS actions, access data in CAS and process the results all from Python.

In fact, python-swat mimics much of the API of the Pandas package, so that using CAS should feel familiar to current Pandas users. And if they so choose, they can download the data from CAS as required.

This rounds off my initial thoughts on considerations of data when migrating SAS to cloud -- thoughts and discussion are welcome as ever.

We’ve all gone to cloud -- but what about my SAS data(base)? was published on SAS Users.

11月 052019
 

Editor’s note: This is the third article in a series by Conor Hogan, a Solutions Architect at SAS, on SAS and database and storage options on cloud technologies. This article covers the SAS offerings available to connect to and interact with the various database options available in Microsoft Azure. Access all the articles in the series here.

The series

This is the next iteration of a series covering database as a service (DBaaS) and storage offerings in the cloud, this time from Microsoft Azure. I have already published two articles on Amazon Web Services. One of those articles covers the DBaaS offerings and the other covers storage offerings for Amazon Web Services. I will cover Google Cloud Platform in future articles. The goal of these articles is to supply a breakdown of these services to better understand the business requirements of these offerings and how they relate to SAS. I would encourage you to read all the articles in the series even if you are already using a specific cloud provider. Many of the core technologies and services are offered across the different cloud providers. These articles focus primarily on SAS Data Connectors as part of SAS Viya, but all the same functionality is available using a SAS/ACCESS Interface in SAS 9.4. SAS In-Database technologies in SAS Viya, called the SAS Data Connect Accelerator, are synonymous with the SAS Embedded Process.

As companies move their computing to the cloud, they are also moving their storage to the cloud. Just like compute in the cloud, data storage in the cloud is elastic and responds to demand while only paying for what you use. As more technologies move to a cloud-based architecture, companies must consider questions like: Where do I store my data? What cloud services best meet my business requirements? Which cloud vendor should I use? Can I migrate my applications to the cloud? If you are looking to migrate your SAS infrastructure to Azure, look at the SAS Viya QuickStart Template for Azure to see a rapid deployment pattern to get the SAS Viya platform up and running in Azure.

SAS integration with Azure

SAS has extended SAS Data Connectors and SAS In-Database Technologies support to Azure database variants. A database running in Azure is much like your on-premise database, but instead Microsoft manages the software and hardware. Azure’s DBaaS offerings takes care of the scalability and high availability of the database with minimal user input. SAS integrates with your cloud database even if SAS is running on-premise or with a different cloud provider.

Azure databases

Azure offers database service technologies familiar to many users. If you read my previous article on SAS Data Connectors and Amazon Web Services, you are sure to see many parallels. It is important to understand the terminology and how the different database services in Azure best meet the demands of your specific application. Many common databases already in use are being refactored and provided as service offerings to customers in Azure. The advantages for customers are clear: no hardware to manage and no software to install. Databases that scale automatically to meet demand and software that updates and creates backups means customers can spend more time creating value from their data and less time managing their infrastructure.

For the rest of this article I cover various database management systems, the Azure offering for each database type, and SAS integration. First let's consider the diagram below depicting a decision flow chart to determine integration points between Azure database services and SAS. Trace you path in the diagram and read on to learn more about connection details.

Integration points between Azure database services and SAS

Relational Database Management System (RDBMS)

In the simplest possible terms, an RDBMS is a collection of managed tables with rows and columns. You can divide relational databases into two functional groups: online transaction processing (OLTP) and online analytical processing (OLAP). These two methods serve two distinct purposes and are optimized depending in how you plan to use the data in the database.

Transactional Databases (OLTP)

Transactional databases are good at processing reads, inserts, updates and deletes. These queries usually have minimal complexity, in large volumes. Transactional databases are not optimized for business intelligence or reporting. Data processing typically involves gathering input information, processing the data and updating existing data to reflect the collected and processed information. Transactional databases prevent two users accessing the same data concurrently. Examples include order entry, retail sales, and financial transaction systems. Azure offers several types of transactional database services. You can organize the Azure transactional database service into three categories: enterprise licenses, open source, and cloud native.

Enterprise License

Many customers have workloads built around an enterprise database. Azure is an interesting use case because Microsoft is also a traditional enterprise database vendor. Amazon, for example, does not have existing on-premise enterprise database customers. Oracle cloud is the other big player in the enterprise market looking to migrate existing customers to their cloud. Slightly off topic, but it may be of interest to some, SAS does support customers running their Oracle database on Oracle Cloud Platform using their SAS Data Connector to Oracle. Azure offers a solution for customers looking to continue their relationship with Microsoft without refactoring their existing workflows. Customers bring an existing enterprise database licenses to Azure and run SQL Server on Virtual Machines. SAS has extended SAS Data Connector support for SQL Server on Virtual Machines. You can also use your existing SAS license for SAS Data Connector to Oracle or SAS Data Connector to Microsoft SQL Server to interact with SQL Server on Virtual Machines.

Remember you can install and manage your own database on a virtual machine. For example, support for both SAS Data Connector to Teradata and SAS Data Connect Accelerator for Teradata is available for Teradata installed on Azure. If there is not an available database as a service offering, the traditional backup and update responsibilities are left to the customer.

SQL Server Stretch Database is another service available in Azure. If you are not prepared to add more storage to your existing on-premise SQL Server database, you can add capacity using the resources available in Azure. SQL Server Stretch will scale your data to Azure without having to provision any more servers on-premise. New SQL Server capacity will be running in Azure instead of in your data center.

Open Source

Azure provides service offerings for common open source databases like MySQL, MariaDB, and PostgreSQL. You can use your existing SAS license for SAS Data Connector to MYSQL to connect to Azure Database for MYSQL and SAS Data Connector to PostgreSQL to interface with Azure Database for PostgreSQL. SAS has not yet formally supported Azure Database for MariaDB. MariaDB is a variant of MySQL, so validation of support for SAS Data Connector is coming soon. If you need support for MariaDB in Azure database, please comment below and I will share your feedback with product management and testing.

Cloud Native

Azure SQL Database is an iteration of Microsoft SQL Server built for the cloud, combining the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases. SAS has extended SAS Data Connector support for Azure SQL Database. You can use your existing license for SAS Data Connector to Microsoft SQL Server to connect to Azure SQL Database.

Analytical Databases (OLAP)

Analytical Databases optimize on read performance. These databases work best from complex queries in smaller volume. When working with an analytical database you are typically doing analysis on multidimensional data interactively from multiple perspectives. Azure SQL Data Warehouse is the analytical database service offered by Azure. The SAS Data Connector to ODBC combined with a recent version of the Microsoft-supplied ODBC driver is currently the best way to interact with Azure SQL Data Warehouse. Look for the SAS Data Connector to Microsoft SQL Server to support SQL Data Warehouse soon.

NoSQL Databases

A non-relational or NoSQL database is any database not conforming to the relational database model. These databases are more easily scalable to a cluster of machines. NoSQL databases are a more natural fit for the cloud because the loose dependencies make the data easier to distribute and scale. The different NoSQL databases are designed to solve a specific business problem. Some of the most common data structures are key-value, column, document, and graph databases. If you want a brief overview of these database structures, I cover them in my AWS database blog.

For Microsoft Azure, CosmosDB is the option available for NoSQL databases. CosmosDB is multi-model, meaning you can build out your databases to fit the NoSQL model you prefer. Use the SAS Data Connector to ODBC to interact with your Data in Azure CosmosDB.

Hadoop

The traditional deployment of Hadoop is changing dramatically with the cloud. Traditional Hadoop vendors may have a tough time keeping up with the service offerings available in the cloud. Hadoop still offers reliable replicated storage across nodes and powerful parallel processing of large jobs without much data movement. Azure offers HDInsights as their Hadoop as a service offering. Azure HDInsights supports both SAS Data Connector to Hadoop and SAS Data Connect Accelerator for Hadoop.

Finally

It is important to think about the use case for your database and the type of data you plan to store before you select an Azure database service. Understanding your workloads is critical to getting the right performance and cost. When dealing with cloud databases, remember that you will be charged for the storage you use and for the data that you move out of the database. Performing analysis and reporting on your data may require data transfer. Be aware of these costs and think about how you can lower these by keeping frequently accessed data cached somewhere or remain on-premise. Another strategy I’ve seen becoming more popular is taking advantage of the SAS Micro Analytics Service to move the models you have built to run in the cloud provider where your data is stored. Data transfer is cheaper if that data moves between cloud services instead of outside of the cloud provider. Micro Analytics Service allows you to score the data in place without movement from a cloud provider and without having to do an install of SAS.

Additional Resources
1. Support for Databases in SAS® Viya® 3.4
2. Support for Cloud and Database Variants in SAS® 9.4

Accessing Databases in the Cloud – SAS Data Connectors and Microsoft Azure was published on SAS Users.

11月 052019
 

Editor’s note: This is the third article in a series by Conor Hogan, a Solutions Architect at SAS, on SAS and database and storage options on cloud technologies. This article covers the SAS offerings available to connect to and interact with the various database options available in Microsoft Azure. Access all the articles in the series here.

The series

This is the next iteration of a series covering database as a service (DBaaS) and storage offerings in the cloud, this time from Microsoft Azure. I have already published two articles on Amazon Web Services. One of those articles covers the DBaaS offerings and the other covers storage offerings for Amazon Web Services. I will cover Google Cloud Platform in future articles. The goal of these articles is to supply a breakdown of these services to better understand the business requirements of these offerings and how they relate to SAS. I would encourage you to read all the articles in the series even if you are already using a specific cloud provider. Many of the core technologies and services are offered across the different cloud providers. These articles focus primarily on SAS Data Connectors as part of SAS Viya, but all the same functionality is available using a SAS/ACCESS Interface in SAS 9.4. SAS In-Database technologies in SAS Viya, called the SAS Data Connect Accelerator, are synonymous with the SAS Embedded Process.

As companies move their computing to the cloud, they are also moving their storage to the cloud. Just like compute in the cloud, data storage in the cloud is elastic and responds to demand while only paying for what you use. As more technologies move to a cloud-based architecture, companies must consider questions like: Where do I store my data? What cloud services best meet my business requirements? Which cloud vendor should I use? Can I migrate my applications to the cloud? If you are looking to migrate your SAS infrastructure to Azure, look at the SAS Viya QuickStart Template for Azure to see a rapid deployment pattern to get the SAS Viya platform up and running in Azure.

SAS integration with Azure

SAS has extended SAS Data Connectors and SAS In-Database Technologies support to Azure database variants. A database running in Azure is much like your on-premise database, but instead Microsoft manages the software and hardware. Azure’s DBaaS offerings takes care of the scalability and high availability of the database with minimal user input. SAS integrates with your cloud database even if SAS is running on-premise or with a different cloud provider.

Azure databases

Azure offers database service technologies familiar to many users. If you read my previous article on SAS Data Connectors and Amazon Web Services, you are sure to see many parallels. It is important to understand the terminology and how the different database services in Azure best meet the demands of your specific application. Many common databases already in use are being refactored and provided as service offerings to customers in Azure. The advantages for customers are clear: no hardware to manage and no software to install. Databases that scale automatically to meet demand and software that updates and creates backups means customers can spend more time creating value from their data and less time managing their infrastructure.

For the rest of this article I cover various database management systems, the Azure offering for each database type, and SAS integration. First let's consider the diagram below depicting a decision flow chart to determine integration points between Azure database services and SAS. Trace you path in the diagram and read on to learn more about connection details.

Integration points between Azure database services and SAS

Relational Database Management System (RDBMS)

In the simplest possible terms, an RDBMS is a collection of managed tables with rows and columns. You can divide relational databases into two functional groups: online transaction processing (OLTP) and online analytical processing (OLAP). These two methods serve two distinct purposes and are optimized depending in how you plan to use the data in the database.

Transactional Databases (OLTP)

Transactional databases are good at processing reads, inserts, updates and deletes. These queries usually have minimal complexity, in large volumes. Transactional databases are not optimized for business intelligence or reporting. Data processing typically involves gathering input information, processing the data and updating existing data to reflect the collected and processed information. Transactional databases prevent two users accessing the same data concurrently. Examples include order entry, retail sales, and financial transaction systems. Azure offers several types of transactional database services. You can organize the Azure transactional database service into three categories: enterprise licenses, open source, and cloud native.

Enterprise License

Many customers have workloads built around an enterprise database. Azure is an interesting use case because Microsoft is also a traditional enterprise database vendor. Amazon, for example, does not have existing on-premise enterprise database customers. Oracle cloud is the other big player in the enterprise market looking to migrate existing customers to their cloud. Slightly off topic, but it may be of interest to some, SAS does support customers running their Oracle database on Oracle Cloud Platform using their SAS Data Connector to Oracle. Azure offers a solution for customers looking to continue their relationship with Microsoft without refactoring their existing workflows. Customers bring an existing enterprise database licenses to Azure and run SQL Server on Virtual Machines. SAS has extended SAS Data Connector support for SQL Server on Virtual Machines. You can also use your existing SAS license for SAS Data Connector to Oracle or SAS Data Connector to Microsoft SQL Server to interact with SQL Server on Virtual Machines.

Remember you can install and manage your own database on a virtual machine. For example, support for both SAS Data Connector to Teradata and SAS Data Connect Accelerator for Teradata is available for Teradata installed on Azure. If there is not an available database as a service offering, the traditional backup and update responsibilities are left to the customer.

SQL Server Stretch Database is another service available in Azure. If you are not prepared to add more storage to your existing on-premise SQL Server database, you can add capacity using the resources available in Azure. SQL Server Stretch will scale your data to Azure without having to provision any more servers on-premise. New SQL Server capacity will be running in Azure instead of in your data center.

Open Source

Azure provides service offerings for common open source databases like MySQL, MariaDB, and PostgreSQL. You can use your existing SAS license for SAS Data Connector to MYSQL to connect to Azure Database for MYSQL and SAS Data Connector to PostgreSQL to interface with Azure Database for PostgreSQL. SAS has not yet formally supported Azure Database for MariaDB. MariaDB is a variant of MySQL, so validation of support for SAS Data Connector is coming soon. If you need support for MariaDB in Azure database, please comment below and I will share your feedback with product management and testing.

Cloud Native

Azure SQL Database is an iteration of Microsoft SQL Server built for the cloud, combining the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases. SAS has extended SAS Data Connector support for Azure SQL Database. You can use your existing license for SAS Data Connector to Microsoft SQL Server to connect to Azure SQL Database.

Analytical Databases (OLAP)

Analytical Databases optimize on read performance. These databases work best from complex queries in smaller volume. When working with an analytical database you are typically doing analysis on multidimensional data interactively from multiple perspectives. Azure SQL Data Warehouse is the analytical database service offered by Azure. The SAS Data Connector to ODBC combined with a recent version of the Microsoft-supplied ODBC driver is currently the best way to interact with Azure SQL Data Warehouse. Look for the SAS Data Connector to Microsoft SQL Server to support SQL Data Warehouse soon.

NoSQL Databases

A non-relational or NoSQL database is any database not conforming to the relational database model. These databases are more easily scalable to a cluster of machines. NoSQL databases are a more natural fit for the cloud because the loose dependencies make the data easier to distribute and scale. The different NoSQL databases are designed to solve a specific business problem. Some of the most common data structures are key-value, column, document, and graph databases. If you want a brief overview of these database structures, I cover them in my AWS database blog.

For Microsoft Azure, CosmosDB is the option available for NoSQL databases. CosmosDB is multi-model, meaning you can build out your databases to fit the NoSQL model you prefer. Use the SAS Data Connector to ODBC to interact with your Data in Azure CosmosDB.

Hadoop

The traditional deployment of Hadoop is changing dramatically with the cloud. Traditional Hadoop vendors may have a tough time keeping up with the service offerings available in the cloud. Hadoop still offers reliable replicated storage across nodes and powerful parallel processing of large jobs without much data movement. Azure offers HDInsights as their Hadoop as a service offering. Azure HDInsights supports both SAS Data Connector to Hadoop and SAS Data Connect Accelerator for Hadoop.

Finally

It is important to think about the use case for your database and the type of data you plan to store before you select an Azure database service. Understanding your workloads is critical to getting the right performance and cost. When dealing with cloud databases, remember that you will be charged for the storage you use and for the data that you move out of the database. Performing analysis and reporting on your data may require data transfer. Be aware of these costs and think about how you can lower these by keeping frequently accessed data cached somewhere or remain on-premise. Another strategy I’ve seen becoming more popular is taking advantage of the SAS Micro Analytics Service to move the models you have built to run in the cloud provider where your data is stored. Data transfer is cheaper if that data moves between cloud services instead of outside of the cloud provider. Micro Analytics Service allows you to score the data in place without movement from a cloud provider and without having to do an install of SAS.

Additional Resources
1. Support for Databases in SAS® Viya® 3.4
2. Support for Cloud and Database Variants in SAS® 9.4

Accessing Databases in the Cloud – SAS Data Connectors and Microsoft Azure was published on SAS Users.

11月 042019
 

According to the Price Waterhouse Cooper 2018 Global Economic Crime and Fraud Survey, the reported rate of economic crime is on the rise, up to 49% in 2018. That makes the use case I want to share particularly relevant, no matter what industry or sector you're in. This use case [...]

How using analytics and AI to detect payment fraud netted an immediate $16 million ROI was published on SAS Voices by David Pope

11月 042019
 

Understanding multivariate statistics requires mastery of high-dimensional geometry and concepts in linear algebra such as matrix factorizations, basis vectors, and linear subspaces. Graphs can help to summarize what a multivariate analysis is telling us about the data. This article looks at four graphs that are often part of a principal component analysis of multivariate data. The four plots are the scree plot, the profile plot, the score plot, and the pattern plot.

The graphs are shown for a principal component analysis of the 150 flowers in the Fisher iris data set. In SAS, you can create the graphs by using PROC PRINCOMP. By default, the scatter plots that display markers also label the markers by using an ID variable (such as name, state, patient ID, ...) or by using the observation number if you do not specify an ID variable. To suppress the labels, you can create an ID variable that contains a blank string, as follows:

data iris;
set Sashelp.iris;
id = " ";                  /* create empty label variable */
run;

The following statements create the four plots as part of a principal component analysis. The data are the measurements (in millimeters) of length and width of the sepal and petal for 150 iris flowers:

ods graphics on;
proc princomp data=iris         /* use N= option to specify number of PCs */
              STD               /* optional: stdize PC scores to unit variance */
              out=PCOut         /* only needed to demonstate corr(PC, orig vars) */
              plots=(scree profile pattern score);
   var SepalLength SepalWidth PetalLength PetalWidth;  /* or use _NUMERIC_ */
   ID id;                       /* use blank ID to avoid labeling by obs number */
   ods output Eigenvectors=EV;  /* to create loadings plot, output this table */
run;

The principal components are linear combinations of the original data variables. Before we discuss the graph, let's identify the principal components and interpret their relationship to the original variables. The linear coefficients for the PCs (sometimes called the "loadings") are shown in the columns of the Eigenvectors table.

  • The first PC is the linear combination PC1 = 0.52*SepalLength – 0.27*SepalWidth + 0.58*PetalLength + 0.56*PetalWidth. You can interpret this as a contrast between the SepalWidth variable and an equally weighted sum of the other variables.
  • For the second PC, the coefficients for the PetalLength and PetalWidth variables are very small. Therefore, the second PC is approximately PC2 ≈ 0.38*SepalLength + 0.92*SepalWidth. You can interpret this weighted sum as a vector that points mostly in the direction of the SepalWidth variable but has a small component in the direction of the SepalLength variable.
  • In a similar way, the third PC is primarily a weighted contrast between the SepalLength and PetalWidth variables, with smaller contributions from the other variables.
  • The fourth PC is a weighted contrast between the SepalWidth and PetalLength variables (with positive coefficients) and the SepalLength and PetalWidth variables (with negative coefficients).

Note that the principal components (which are based on eigenvectors of the correlation matrix) are not unique. If v is a PC vector, then so is -v. If you compare PCs from two different software packages, you might notice that a PC from one package is the negative of the same PC from another package.

You could present this table graphically by creating a "loadings plot," as shown in the last section of this article.

The scree plot

Recall that the main idea behind principal component analysis (PCA) is that most of the variance in high-dimensional data can be captured in a lower-dimensional subspace that is spanned by the first few principal components. You can therefore to "reduce the dimension" by choosing a small number of principal components to retain. But how many PCs should you retain? The scree plot is a line plot of the eigenvalues of the correlation matrix, ordered from largest to smallest. (If you use the COV option, it is a plot of the eigenvalues of the covariance matrix.)

You can use the scree plot as a graphical tool to help you choose how many PCs to retain. In the scree plot for the iris data, you can see (on the "Variance Explained" plot) that the first two eigenvalues explain about 96% of the variance in the four-dimensional data. This suggests that you should retain the first two PCs, and that a projection of the data onto the first to PCs will give you a good way to visualize the data in a low-dimensional linear subspace.

The profile plot

The profile plot shows the correlations between each PC and the original variables. To some extent, you can guess the sign and the approximate magnitude of the correlations by looking at the coefficients that define each PC as a linear combination of the original variables. The correlations are shown in the following "Component Pattern Profiles" plot.

The profile plot reveals the following facts about the PCs:

  • The first PC (solid blue line) is strongly positively correlated with SepalLength, PetalLength, and PetalWidth. It is moderately negatively correlated with SepalWidth.
  • The second PC (dashed reddish line) is positively correlated with SepalLength and SepalWidth.
  • The third and fourth PCs have only small correlations with the original variables.

The component pattern plot shows all pairwise correlations at a glance. However, if you are analyzing many variables (say 10 or more), the plot will become very crowded and hard to read. In that situation, the pattern plots are easier to read, as shown in the next section.

The pattern plots

The output from PROC PRINCOMP includes six "component pattern" plots, which show the correlations between the principal components and the original variables. Because there are four PCs, a component pattern plot is created for each pairwise combination of PCs: (PC1, PC2), (PC1, PC3), (PC1, PC4), (PC2, PC3), (PC2, PC4), and (PC3, PC4). In general, if there are k principal components, there are N(N-1)/2 pairwise combinations of PCs.

Each plot shows the correlations between the original variables and the PCs. For example, the graph shown to the right shows the correlations between the original variables and the first two PCs. Each point represents the correlations between an original variable and two PCs. The correlations with the first PC are plotted on the horizontal axis; the correlations with the second PC are plotted on the vertical axis.

Six graphs require a lot of space for what is essentially a grid of sixteen numbers. If you want to see the actual correlations in a table, you can call PROC CORR on the OUT= data set, as follows:

/* what are the correlations between PCs and orig vars? */
proc corr data=PCOUT noprob nosimple;
   var SepalLength SepalWidth PetalLength PetalWidth;
   with Prin1-Prin4;
run;

If you know you want to keep only two PCs, you can use the N=2 option on the PROC PRINCOMP statement, which will reduce the number of graphs that are created.

The score plots

The score plots indicate the projection of the data onto the span of the principal components. As in the previous section, this four-dimensional example results in six score plots, one for each pairwise combination of PCs. You will get different plots if you create PCs for the covariance matrix (the COV option) as opposed to the correlation matrix (the default). Similarly, if you standardize the PCs (the STD option) or do not standardize them (the default), the corresponding score plot will be different.

The score plot for the first two PCs is shown. Notice that it uses equal scales for the axes. The graph shows that the first principal component separates the data into two clusters. The left cluster contains the flower from the Iris setosa species. You can see a few outliers, such as one setosa flower whose second PC score (about -2.5) is much smaller than the other setosa flowers.

ODS graphics provide an easy way to generate a quick look at the data. However, if you want to more control over the graphs, it is often simplest to output the results to a SAS data set and customize the plots by hand. You can use the OUT= option to write the PCs to a data set. The following call to PROC SGPLOT creates the same score plot but colors the markers by the Species variable and adds a grid of reference lines.

title "Score Plot";
title2 "Observations Projected onto PC1 and PC2";
proc sgplot data=PCOut aspect=1;
   scatter x=Prin1 y=Prin2 / group=species;
   xaxis grid label="Component 1 (72.96%)";
   yaxis grid label="Component 2 (22.85%)";
run;

The loadings plots

A loadings plot is a plot of two columns of the Eigenvectors table. PROC PRINCOMP does not create a loadings plot automatically, but there are two ways to create it. One way is to use the ODS OUTPUT to write the Eigenvectors table to a SAS data set. The previous call to PROC PRINCOMP created a data set named EV. The following call to PROC SGPLOT creates a score plot that projects the observations onto the first two PCs.

title "Loadings Plot";
title2 "Variables Projected onto PC1 and PC2";
proc sgplot data=EV aspect=1;
   vector x=Prin1 y=Prin2 / datalabel=Variable;
   xaxis grid label="Component 1 (72.96%)";
   yaxis grid label="Component 2 (22.85%)";
run;

The loadings plot shows the relationship between the PCs and the original variables. You can use the graph to show how the original variables relate to the PCs, or the other way around. For example, the graph indicates that the PetalWidth and PetalLength variables point in the same direction as PC1. The graph also shows that the second PC is primarily in the direction of the SepalWidth variable, with a small shift towards the direction of the SepalLength variable. The second PC is essentially orthogonal to the and PetalWidth and PetalLength variables.

The second way to create a loadings plot is to use PROC FACTOR, as shown by the following statements. To documentation for PROC FACTOR compares the PROC FACTOR analysis to the PROC PRINCOMP analysis.

proc factor data=iris N=2         /* use N= option to specify number of PCs */
     method=principal       
     plots=(initloadings(vector));
   var SepalLength SepalWidth PetalLength PetalWidth;  /* or use _NUMERIC_ */
run;

Summary

In summary, PROC PRINCOMP can compute a lot of graphs that are associated with a principal component analysis. This article shows how to interpret the most-used graphs. The scree plot is useful for determining the number of PCs to keep. The component pattern plot shows the correlations between the PCs and the original variables. The component pattern plots show similar information, but each plot displays the correlations between the original variables and a pair of PCs. The score plots project the observations onto a pair of PCs. The loadings plot project the original variables onto a pair of PCs.

When you analyze many variables, the number of graphs can be overwhelming. I suggest that you use the WHERE option in the ODS SELECT statement to restrict the number of pattern plots and score plots. For example, the following statement creates only two pattern plots and two score plots:

   ods select Eigenvectors 
              ScreePlot PatternProfilePlot
              where=(_label_ in: ('2 by 1','4 by 3'));  /* limit pattern plots and score plots */

There is one more plot that is sometimes used. It is called a "biplot" and it combines the information in a score plot and a loadings plot. I will discuss the biplot in a subsequent article.

The post How to interpret graphs in a principal component analysis appeared first on The DO Loop.

11月 012019
 

When I started working in data and analytics 30 years ago, information security wasn’t high on the agenda for organizations. That's changed with the rise of the Internet, and now that cloud is becoming more and more prevalent in organizations, information security is no longer just the domain of specialists [...]

Cloud security: Questions you need to ask when choosing a cloud provider was published on SAS Voices by David Annis