data

8月 082019
 

Some key components of CASL are the action statements. These statements perform tasks that range from configuring the connection to the server, to summarizing large amounts of data, to processing image files. Each action has its own purpose. However, there is some overlapping functionality between actions. For example, more than one action can summarize numeric variables.
This blog looks at three actions: SIMPLE.SUMMARY, AGGREGATION.AGGREGATE, and DATAPREPROCESS.RUSTATS. Each of these actions generates summary statistics. Though there might be more actions that generate the same statistics, these three are a good place to start as you learn CASL.

Create a CAS table for these examples

The following step generates a table called mydata, stored in the casuser caslib, that will be used for the examples in this blog.

cas;
libname myuser cas caslib='casuser';
data myuser.mydata;
	length color $8;
	array X{100};
	do k=1 to 9000;
   	do i=1 to 50;
      	X{i} = rand('Normal',0, 4000);
      end;
      do i=51 to 100;
      	X{i} = rand('Normal', 100000, 1000000);
      end;
      if x1 < 0 then color='red';
      	else if x1 < 3000 then color='blue';
         else color='green';
		output;
	end;
run;

SIMPLE.SUMMARY

The purpose of the Simple Analytics action set is to perform basic analytical functions. One of the actions in the action set is the SUMMARY action, used for generating descriptive statistics like the minimum, maximum, mean, and sum.
This example demonstrates obtaining the sum, mean, and n statistics for five variables (x1–x5) and grouping the results by color. The numeric input variables are specified in the INPUTS parameter. The desired statistics are specified in the SUBSET parameter.

proc cas;
   simple.summary / 
      inputs={"x1","x2","x3","x4","x5"},
      subset={"sum","mean","n"},
      table={caslib="casuser",name="mydata",groupBy={"color"}},
      casout={caslib="casuser", name="mydata_summary", replace=true};
run;
	table.fetch /
		table={caslib="casuser",name="mydata_summary" };
run;
quit;

The SUMMARY action creates a table that is named mydata_summary. The TABLE.FETCH action is included to show the contents of the table.

The mydata_summary table can be used as input for other actions, its variable names can be changed, or it can be transposed. Now that you have the summary statistics, you can use them however you need to.

AGGREGATION.AGGREGATE

Many SAS® procedures have been CAS-enabled, which means you can use a CAS table as input. However, specifying a CAS table does not mean all of the processing takes place on the CAS server. Not every statement, option, or statistic is supported on the CAS server for every procedure. You need to be aware of what is not supported so that you do not run into issues if you choose to use a CAS-enabled procedure. In the documentation, refer to the CAS processing section to find the relevant details.
When a procedure is CAS-enabled, it means that, behind the scenes, it is submitting an action. The MEANS and SUMMARY procedure steps submit the AGGREGATION.AGGREGATE action.
With PROC MEANS it is common to use a BY or CLASS statement and ask for multiple statistics for each analysis variable, even different statistics for different variables. Here is an example:

proc means sum data=myuser.mydata noprint;
  by color;
   var x1 x2 x3;
   output out=test(drop=_type_ _freq_) sum(x1 x3)=x1_sum x3_sum
   max(x2)=x2_max std(x3)=x3_std;
run;

The AGGREGATE action produces the same statistics and the same structured output table as PROC MEANS.

proc cas;
	aggregation.aggregate / 
		table={name="mydata",caslib="casuser",groupby={"color"}}
      casout={name="mydata_aggregate", caslib='casuser', replace=true}
      varspecs={{name='x1', summarysubset='sum', columnnames={'x1_sum'}}, 
                {name='x2', agg='max', columnnames={'x2_max'}},
                {name='x3', summarysubset={'sum','std'},
                columnnames={'x3_sum','x3_std'}}}
      savegroupbyraw=true, savegroupbyformat=false, raw=true;
run;
quit;

The VARSPECS parameter might be confusing. It is where you specify the variables that you want to generate statistics for, which statistics to generate, and what the resulting column should be called. Check the documentation: depending on the desired statistic, you need to use either SUMMARYSUBSET or AGG arguments.

If you are using the GROUPBY action, you most likely want to use the SAVEGROUPBYRAW=TRUE parameter. Otherwise, you must list every GROUPBY variable in the VARSPECS parameter. Also, the SAVEGROUPBYFORMAT=FALSE parameter prevents the output from containing _f versions (formatted versions) of all of the GROUPBY variables.

DATAPREPROCESS.RUSTATS

The RUSTATS action, in the Data Preprocess action set, computes univariate statistics, centralized moments, quantiles, and frequency distribution statistics. This action is extremely useful when you need to calculate percentiles. If you ask for percentiles from a procedure, all of the data will be moved to the compute server and processed there, not on the CAS server.
This example has an extra step. Actions require a list of variables, which can be cumbersome when you want to generate summary statistics for more than a handful of variables. Macro variables are a handy way to insert a list of strings, variable names in this case, without having to enter all of the names yourself. The SQL procedure step generates a macro variable containing the names of all of the numeric variables. The macro variable is referenced in the INPUTS parameter.
The RUSTATS action has TABLE and INPUTS parameters like the previous actions. The REQUESTPACKAGES parameter is the parameter that allows for a request for percentiles.
The example also contains a bonus action, TRANSPOSE.TRANSPOSE. The goal is to have a final table, mydata_rustats2, with a structure like PROC MEANS would generate. The tricky part is the COMPUTEDVARSPROGRAM parameter.
The table generated by the RUSTATS action has a column called _Statistic_ that contains the name of the statistic. However, it contains “Percentile” multiple times. A different variable, _Arg1_, contains the value of the percentiles (1, 10, 20, and so on). The values of _Statistic_ and _Arg1_ need to be combined, and that new combined value generates the new variable names in the final table.
The COMPUTEDVARS parameter specifies that the name of the new variable will hold the concatenation of _Statistic_ and _Arg1_. The COMPUTEDVARSPROGRAM parameter tells CAS how to create the values for NEWID. The NEWID value is then used in the ID parameter to make the new variable names—pretty cool!

proc sql noprint;
	select quote(strip(name)) into: numvars separated by ','
	from dictionary.columns 
 	where libname='MYUSER' and memname='MYDATA' and type='num';
quit;
 
proc cas;
	dataPreprocess.rustats / 
   	table={name="mydata",caslib="casuser"} 
   	inputs={&numvars}
   	requestpackages={{percentiles={1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95 99},scales={"std"}}}
   	casoutstats={name="mydata_rustats",caslib="casuser"} ;
 
	transpose.transpose / 
   	table={caslib='casuser', name="mydata_rustats", groupby={"_variable_"},
			computedvars={{name="newid",format="$20."}},computedvarsprogram="newid=strip(_statistic_)||compress(strip(_arg1_),'.-');"}
   	transpose={"_value_"} 
   	id={"newid"}
   	casOut={caslib='casuser', name="mydata_rustats2", replace=true};
   run;
quit;

Here is a small portion of the final table. Remember, you can use the TABLE.FETCH action to view the table contents as well.

Summary

Summarizing numeric data is an important step in analyzing your data. CASL provides multiple actions that generate summary statistics. This blog provided a quick overview of three of those actions: SIMPLE.SUMMARY, AGGREGATION.AGGREGATE, and DATAPREPROCESS.RUSTATS.
The wonderful part of so many choices is that you can decide which one best fits your needs. Summarizing your data with actions also ensures that all of the processing occurs on the CAS server and that you are taking full advantage of its capabilities.
Be sure to use the DROPTABLE action to delete any tables that you do not want taking up space in memory:

proc cas;
	table.droptable / caslib='casuser' name='mydata' quiet=true;
	table.droptable / caslib='casuser' name='mydata_summary' quiet=true;
	table.droptable / caslib='casuser' name='mydata_aggregate' quiet=true;
	table.droptable / caslib='casuser' name='mydata_rustats' quiet=true;
	table.droptable / caslib='casuser' name='mydata_rustats2' quiet=true;
quit;
cas casauto terminate;

Learn More

Summarization in CASL was published on SAS Users.

3月 292017
 

In 1990, the internet took on its most recognizable form. It brought connection, knowledge and speed that was previously inaccessible. Fast-forward 27 years, and I get asked a lot about the most recent form of the internet – the internet of things (IoT). And while I think the current possibilities [...]

3 best practices to prepare for the future of IoT was published on SAS Voices by Oliver Schabenberger

3月 242017
 

In a recent presentation, Jill Dyche, VP of SAS Best Practices gave two great quotes: "Map strategy to data" and "strategy drives analytics drives data." In other words, don't wait for your data to be perfect before you invest in analytics. Don't get me wrong -- I fully understand and [...]

Don't let data issues delay analytics was published on SAS Voices by David Pope

12月 142016
 

The insurance industry is becoming increasingly focused on the digitalization of its business processes. There are many factors driving digitalization, but it’s clear that a reliable and meaningful database is the basic prerequisite successful digitalization strategy. Insurance companies are increasingly prioritizing digitalization, not because this issue is currently "in," but […]

Drivers for the digitalization of insurance was published on SAS Voices.

10月 052016
 

While men still outnumber women in the analytics field, there are plenty of opportunities available for women. At a recent Chief Data and Analytics forum, I was encouraged to see a well-balanced number of senior executives presenting about the business of analytics.  Speakers included 12 women and 14 men, which indicates a […]

Two tips for women starting a career in analytics was published on SAS Voices.

9月 272016
 

Recently, I was talking to a director of analytics from a large telecommunications company, and I asked her, “Do you think we have a skills shortage?” She replied, “NO, I think we’re just looking in the wrong place.” I wanted to hear more as this analytics expert may have just […]

More data journalists – not data scientists was published on SAS Voices.

1月 152016
 

ProblemSolversAs support analysts in the SAS Technical Support division, we answer many phone calls from SAS customers.  As members of the SAS Foundation team, we get questions that vary significantly in content from all of the areas that we support.  We offer coding tips and suggestions as well as point customers to SAS Notes and documentation. A common question that we are frequently asked is how to collapse a data set with many observations for the same BY variable value into a data set that has one observation per BY variable value.  Another way to phrase this question is: how do you reshape the data from “long to wide?”

Resources to Reshape Data

The following SAS Note illustrates how to use two TRANSPOSE procedure steps to collapse the observations for multiple BY variables:

Sample 44637: Double PROC TRANSPOSE method for reshaping your data set with multiple BY variables

If you prefer to use the DATA step rather than PROC TRANSPOSE, the following SAS Note provides a code sample that accomplishes the same purpose:

Sample 24579: Collapsing observations within a BY-Group into a single observation (when data set has 3 or more variables)

If the data set is “wide” and you’d like to expand each observation into multiple observations to make it a “long” data set, the following SAS Note is helpful:

Sample 24794: Expand single observations into multiple observations

Brief Overview of Some Support.sas.com Resources

Since we’ve been discussing SAS Notes from the support.sas.com site, here is a brief overview of how to use the site to find other helpful items for your future coding projects.

Documentation

  1. Go to support.sas.com
  2. On the left side of the page in the Knowledge Base section, select Documentation, Product Index A-Z, and then select your product of interest to access user guides and other documentation resources.

SAS Notes

  1. Go to support.sas.com
  2. On the left side of the page in the Knowledge Base section, select Samples & SAS Notes.
  3. From this page, you can type keywords in the Search box to get a list of relevant notes from both categories.
    Note: If you’re interested in specific types of notes (for example, samples or problem notes), select the type of note from the choices given on the left side of the page underneath Samples & SAS Notes.

It was a pleasure to work with you in 2015 when you contacted us for assistance. We look forward to another great year in 2016.


tags: data, Problem Solvers, SAS Programmers

New year refresher on reshaping data was published on SAS Users.

12月 162015
 

If you read my last post, then you know that I’m giving myself the gift of data this holiday season! For me, collecting data on my diet and fitness habits is a gift that just keeps on giving. Although I may not look at all my data sets on a […]

The post Visualizing holiday food log patterns appeared first on JMP Blog.

6月 252015
 

After doing some recent research with IDC®, I got to thinking again about the reasons that organizations of all sizes in all industries are so slow at adopting analytics as part of their ‘business as usual’ operations.

While I have no hard statistics on who is and who isn’t adopting analytics, the research shows that organizations that do leverage analytics are more successful on average than those that don’t. What we need is a new analytics experience, an experience where organizations can:

  • Make confident decisions
  • Analyze all their data where it exists
  • Seize new opportunities with analytics
  • Remove restrictions for data scientists

IDC states that “50.6% of Asia Pacific enterprises want to monetize their data in the next 18 months”. Are you one of them or are you going to let your competition get the jump on you?

Big data (or more specifically how to actually gain some sort of competitive advantage from it) is top of mind for forward-looking businesses.

Our research with IDC gives us a few clues on where to head when it comes to the monetization discussion.

In the recent Monetizing Your Data infographic (PDF) created by IDC and SAS, three key approaches to monetizing big data emerged:

  1. Data decisioning, where insights derived from big data can be used to enhance business processes;
  2. Data products, where new innovative data products can be created and sold;
  3. Data partnerships, where organizations sell or share core analytics capabilities with partners.

Organizations that adopt and combine all three key approaches to leverage analytics are twice as likely to outperform their peers1.

If you’re looking to truly create value from the stores of data you have then you need to look at deploying analytics.

                                 monetizing-your-data-info-pdf-button                 big-data-resource-ctr-button

1 IDC APEJ Big Data MaturityScape Benchmark Survey 2014 (n=1255) IDC APEJ Big Data Pulse 2014 (n = 854)

tags: analytics, big data, business analytics, business intelligence, Data, data management, data quality, data visualisation, high performance analytics, visual analytics, visualisation, visualization

Are you missing out when it comes to data monetization? was published on Left of the Date Line.

6月 092015
 

customer-intelligenceYou've probably heard many times about the fantastic untapped potential of combining online and offline customer data. But relax, I’m going to cut out the fluff and address this matter in a way that makes the idea plausible and its objectives achievable. The reality is that while much has been written about the benefits of online customer intelligence, it far outweighs what’s happening in most organisations today. In fact, considering how beneficial tapping the data can be, I don’t think enough has been written about what types of online customer behaviours should be tracked and how they could be used to create a better customer experience across all touch points.

So where do you begin? 

It all starts with what you have decided are the objectives for your digital presence – are they to register, to make a transaction, sign up for a newsletter, interact with a certain content object such as internal or third party? Those are generally the key objectives I see organisations having in order to understand the customer journey leading up to these events, as well as tracking and ‘remembering’ when the customer interacts with all the organisation’s available channels to the market. A key aspect is to monitor and understand how external campaigns, in-site promotions and search contribute towards those goals and how this breaks down into behavioural segments/profiles.

Recognising a customer

The next important consideration is – how do we recognise visitors/customers we should know from previous interactions even if they haven’t identified themselves on this occasion? Identification doesn’t have to be dependent on a log-in. It could be through an email address we can match with a satisfactory level of confidence, or it could be a tracking code coming from another digital channel where customers had earlier identified themselves. It’s of much greater value if we can match their behaviour as unknown visitors when the identify themselves and not have to start building our knowledge from scratch at the time of identification.

This leads to the point where we need to explore our options for weaving a visitors’ online behaviours into our offline knowledge about them and how – at the enterprise level – we can best exploit the capabilities of our broader data-driven marketing eco-system. We should ask ourselves, is it valuable to us to be able to send a follow up email to the ones that abandoned a specific form? Can our call centre colleagues enrich their conversations by knowing which customers downloaded particular content? How important is it to us as an organisation to be able to analyse text from in-site searches and combine it with insights driven of complaint data from our CRM system? What are the attributes of the various parts of the journey leading up to completing an objective?

Perhaps you wonder what I mean by the capabilities of the ‘broader data-driven marketing eco-system’. Well, my point is that it that it puzzles me that most organisations today can’t integrate/report/visualise online customer intelligence in the systems that already comprise the backbone of their information infrastructure. They don’t utilise their existing campaign management systems to make decisions on what’s relevant for the individual and drive online personalisation which increase the online conversion rates, but at the same time can be used across channels. Organisations rarely take ownership of online customer data or use their advanced analytical engines and existing analytical skills to drive next level insights.

Not taking full advantage of campaign management systems already in place is opportunity missed because the deliverables of integrated online and offline customer intelligence are very real. We should be looking for them every day.

This post first appeared on marketingmag.com.au.

Take the Customer Intelligence Assessment

tags: business intelligence, CRM, customer experience, customer intelligence, Data, data management, data visualisation, marketing

All customer intelligence must be woven into CRM programs – online and offline was published on Left of the Date Line.