Tech

7月 262016
 

SAS Business Data Network is SAS’ solution for managing glossaries of common business terms. This is part of the SAS Data Governance offering as well as bundled with Advanced versions of all SAS Data Management bundles. One thing that is important regarding Data Governance in general, and this solution in particular, is the ability to search for terms. SAS Business Data Network provides a web search that is sometimes overlooked, but can be used in different ways to help you derive meaning and context from your data. Let’s take a look.

1 – Use a URL to get to the main web page

The main URL that brings you to the SAS Business Data Network Search page is http://<host>/SASBusinessDataNetwork/search

Conduct a web search for business terms with SAS Business Data Network

The search box allows you to search business terms using a specific string. This string is processed entirely, no AND or OR operators. Specifying 2 words will generate a search of those 2 words consecutively. The following fields are processed for the search: Name, Description, Requirements, Attributes, Links, Associated Items and Notes. It is case insensitive and one has to sign in if not already logged on.

Conduct a google-like web search for business terms with SAS Business Data Network2

After a search, among the results, you can choose to open one term in particular or all the terms in SAS Business Data Network.

2 – Customize the URL and call it from external applications

You can also parameterize the search URL and then integrate with it or call it from any other application.

For example, the URL http://<host>/SASBusinessDataNetwork/search?q=gender will search for the string “gender” in all business terms.

The URL http://<host>/SASBusinessDataNetwork/search?q=currency+code will search for “currency code”.

The same search results page will open.

Conduct a google-like web search for business terms with SAS Business Data Network3

This is very useful when you want to seamlessly integrate SAS Business Data Network with other SAS applications or external applications in order to provide more insights on the sense of an indicator or the terminology of a business noun. Within SAS Visual Analytics, we can easily define such parameterized URLs to help the business user understand the meaning of a measure for example. See the end for an example.

3 – Use the bookmarklet to search for highlighted text

You can use the bookmarklet to interact with the BDN search capability smartly. The bookmarklet enables you to search against BDN for a highlighted text in your browser.

Conduct a google-like web search for business terms with SAS Business Data Network01

That means that you can call SAS BDN search from any HTML page in the company. A user can browse an intranet article and call BDN for more information on a term. He can open a business report and call BDN for more information on a specific value.

To install the bookmarklet, you need to follow instructions described in the SAS Business Data Network Search extras.

Conduct a google-like web search for business terms with SAS Business Data Network4

4 – Use the browser search provider

You can register SAS Business Data Network Search as a search provider with your browser. This will allow the user to search strings in a dedicated zone in the toolbar.

Conduct a google-like web search for business terms with SAS Business Data Network5

To install the search provider, you need to follow instructions described in the SAS Business Data Network Search extras.

A final example

I talked earlier about integrating a Visual Analytics report with BDN in order to provide explanations of a specific data item: what is the meaning of that data item? How is it calculated? Who owns that data item? Has there been a change recently in the formula? What is the lineage of that data item? These are questions which BDN can help to answer… Note example treemap from SAS Visual Analytics below.

Conduct a google-like web search for business terms with SAS Business Data Network02

An external link to BDN search is easily configurable in SAS Visual Analytics.

Conduct a google-like web search for business terms with SAS Business Data Network6

BDN also supports deep linking to go directly to a term, without going through search results. This capability requires additional data from the calling SAS Visual Analytics solution.

These are just a few examples of some of the incredible power that SAS Business Data Network gives you to relate and search for business and technical metadata to improve discovery and increase the sharing and reuse of your data assets.

Thanks for your attention and I hope you found this post helpful.

tags: data governance, data management, SAS Professional Services, SAS Visual Analytics

Four ways to conduct a web search for business terms with SAS Business Data Network was published on SAS Users.

7月 222016
 

In DataFlux Data Management Studio, the predominate component of the SAS Data Quality bundle, the data quality nodes in a data job use definitions from something called the SAS Quality Knowledge Base (QKB). The QKB supports over 25 languages and provides a set of pre-built rules, definitions and reference data that conduct operations such as parsing, standardizing and fuzzy matching to help you cleanse your data.  The QKB comes with pre-built definitions for both customer and product data and allows for customization and addition of rules to accommodate new data types and rules specific to your business. (You can learn more about the QKB here.)

Sometimes you may want to work with an alternate QKB installation that contains different definitions within the same data job. For example, your default QKB may be the Contact Information QKB; however, in your data flow you may want to use a definition that exists in the Product Data QKB.  These data quality nodes have the BF_PATH attribute as part of their Advanced Properties enabling you to do this.
Note: You must have the alternate QKB data installed and be licensed for any QKB locales that you plan to use in your data job.

Here is an example data job that uses the Advanced property of BF_PATH to call the Brand/Manufacturer Extraction definition from the Product Data QKB. It also calls Standardization definitions from the default QKB of Contact Info. Notice that from the data flow perspective, it is one seamless flow.

Using an alternate Quality Knowledge Base in a DataFlux Data Management Studio data job

The BF_PATH advanced property setting for the Extraction node that is using the Brand/Manufacturer definition from the Product QKB contains the path of where the Product Data QKB was installed.
Note: The path setting could be set as a macro variable.  The path information can be obtained from the QKB registration information in the Administration riser bar in Data Management Studio.

Using an Alternate Quality Knowledge Base (QKB) in a DataFlux Data Management Studio Data Job02

Once BF_PATH advanced property is set, you will not be able to use the user interface to make your definition selection. You will need to know the definition name and any other relevant information for the node, so you can add the information using the appropriate Advanced properties.

In my example, I need to set the definition and token fields for the Brand/Manufacturer definition using the PARSE_DEF and PARSE_DEF Advanced properties.

Using an Alternate Quality Knowledge Base (QKB) in a DataFlux Data Management Studio Data Job03

Note: I was able to find out the needed information by viewing the Product Data QKB information in the Administration riser bar in Data Management Studio.

Using an Alternate Quality Knowledge Base (QKB) in a DataFlux Data Management Studio Data Job04

Here is the user interface for the Extraction node that is using the Brand/Manufacturer definition from the Product QKB and its data preview.

Note: The definition cannot be displayed since it is not in the Active QKB.  You can select the Extraction field and Additional Output information on the user interface.

Using an Alternate Quality Knowledge Base (QKB) in a DataFlux Data Management Studio Data Job05

In conclusion, the Advanced property of BF_PATH is useful when you want to use an Alternate QKB installation in your data job.  For more information on Advanced properties for nodes in DataFlux Data Management Studio, refer to the topic “Advanced Properties” in the DataFlux Data Management Studio 2.7: User’s Guide. For more information on the SAS Quality Knowledge Base (QKB), refer to its documentation. Finally, to learn more about SAS Data Quality, visit sas.com/dataquality.

tags: data management, data quality, DataFlux Data Management Studio

Using an alternate Quality Knowledge Base (QKB) in a DataFlux Data Management Studio data job was published on SAS Users.

7月 162016
 

ProblemSolversWhen you work out, you probably have a routine set of exercises that you do. But if you read health-and-fitness websites or talk to a personal trainer, you know that for optimal fitness, you should vary your workout routine. Not only does adding variety to your fitness regime help you prevent injuries, it also helps build muscles by working different parts of your body. For example, if you’re a runner, you could add weights or another resistance workout once or twice a week. If you are a weight lifter, you might add yoga or Pilates to your repertoire to enhance your personal fitness. In a similar way, it can be beneficial for you as a SAS programmer to vary your coding methods by trying new things.

SAS programmers very often use the TEMPLATE procedure to create style templates that use the CLASS and STYLE statements. I am going to help you bulk up your PROC TEMPLATE muscles by showing you how to use the COMPUTE AS and TRANSLATE INTO statements to alter the structure of tables via table definitions.

The COMPUTE AS statement helps you create or modify columns of data. You can use this statement to create a column from raw data (similar to what you can do with the REPORT procedure), or you can use it with an existing table template. The following example uses a PROC TEMPLATE step that illustrates the former. This example uses an existing data set (the CARS data set that is shipped in the SASHELP library) and computes an average-miles-per-gallon variable called MPG_AVG:

proc template;
    define table mytable; 
       column make model mpg_city mpg_highway mpg_avg;
       define mpg_avg;
         header="Avg MPG";
         compute as mean(mpg_city,mpg_highway);
         format=3.;
         style={font_weight=bold};
    end;
    define make;
       blank_dups=on;
    end;
  end;
run;
 
data _null_;
  set sashelp.cars(where=(cylinders=4 and drivetrain="All"));
  file print ods=(template='mytable');
  put _ods_;
run;

Here are partial results from the example code:

Using the COMPUTE AS and TRANSLATE INTO Statements

You’ll notice that I just could not resist sticking with part of my normal PROC TEMPLATE routine when I added a style change to the new column MPG_AVG. I also made the values in that column stand out by using bold font. In addition, because I like to prevent duplicates from appearing, I used the BLANK_DUPS=ON column attribute for the MAKE column.

If you want to try a cool new trick that is available in SAS® 9.4, try the following ODSTABLE procedure. This code does exactly what the previous example does, only with fewer keystrokes. It's similar to combining a strength and cardio move in your exercise routine!

proc odstable
   data=sashelp.cars(where=(cylinders=4 and drivetrain="All"));
   column make model mpg_city mpg_highway mpg_avg;
   define mpg_avg;
      header="Avg MPG";
      compute as mean(mpg_city,mpg_highway);
         format=3.;
         style={font_weight=bold};
   end;
   define make;
      blank_dups=on;
   end;
run;

Most procedures (except for PRINT, REPORT, and TABULATE) use an existing table template. You can determine the table template that a procedure uses by reviewing the log information that is generated when the ODS TRACE ON statement is in effect. For example, based on the log information from an ODS TRACE ON statement, we know that the following MEANS procedure code uses a table template called BASE.SUMMARY.

proc means data=sashelp.cars;
     class drivetrain;
     var mpg_city;
run;

If you submit this PROC MEANS code, you obtain the results shown below:

Using the COMPUTE AS and TRANSLATE INTO Statements02

In the following example, I use the COMPUTE AS statement to edit the MEAN column by multiplying the mean statistic by 10:

proc template;
   edit base.summary;
   edit mean 
      compute as (mean * 10);
         format=6.1;
      end;
   end;
run;

Here are the results that are generated when you run the PROC MEANS step a second time (with an explanatory footnote added):

Using the COMPUTE AS and TRANSLATE INTO Statements06

It is important to note that the default location where PROC TEMPLATE stores results is the SASUSER library. Therefore, subsequent PROC MEANS steps also multiply the mean by 10 unless you delete this table template. Just like a personal trainer recommends stretching after each workout in order to avoid sore muscles, I recommend the following PROC TEMPLATE step to delete the updated table template after the PROC MEANS step:

proc template;
   delete base.summary;
run;

The COMPUTE AS statement is useful when you need to create or edit an entire column. However, in many of the statistical procedures, the reported statistics are stacked. Technical Support has received requests from SAS programmers to make changes to just one of the statistics in a column. Therefore, when you want to change a single cell in a table, I recommend that you exercise the TRANSLATE INTO statement.

Consider the following code in which the Parameter Estimates object is requested from the RELIABILITY procedure:

ods select ParmEst;
 
proc reliability data=fan;
   distribution Weibull;
   pplot lifetime*censor( 1 ) / covb;
run;

Note: The DATA=FAN option in the code above links to the online documentation where you can find the DATA step that builds the data set FAN.

The code above displays the results shown in this table:

Using the COMPUTE AS and TRANSLATE INTO Statements04

In this table, the Estimate column displays the parameter estimates. If you need to modify just one set of parameters (for example, just the Extreme Value [EV] parameter estimates), you can use the TRANSLATE INTO statement to do the job.

The TRANSLATE INTO statement applies conditional logic to effectively change the formatting of the Estimate column only for the EV Location and EV Scale parameters. In the following example, the Estimate column for the EV Location parameter is displayed with the 10.5 format. In addition, the EV Scale parameter is displayed with a 10.6 format, while the remaining cells are displayed with the default format applied, which is specified as 10.4.

proc template;
     edit qc.reliability.ppest;
        edit estimate;
           header="Estimate";
		format=10.4;
           translate (Parameter = 'EV Location') into put(_val_, 10.5),
                   (Parameter = 'EV Scale') into put(_val_, 10.6);
      end;
     end;
run;

When you rerun the PROC RELIABILITY step that is shown above, you get these results:

Using the COMPUTE AS and TRANSLATE INTO Statements05

The changes that you make to your statistics can be much more dramatic. Just like with your workout, the more effort you put into it, the more defined your results!

The samples and SAS Notes found on support.sas.com focus on the DEFINE, STYLE, and CLASS statements in PROC TEMPLATE. These statements are fabulous tools in the PROC TEMPLATE routine that enhance your ODS results because they help change the style of your Excel, HTML, RTF, and PDF results.  By adding the COMPUTE AS and TRANSLATE INTO statements to your tools, you can round out your ODS routine to further enhance your ODS results. But if you find that you need the help of a personal trainer (that is, a Technical Support consultant), call Technical Support or send email to support@sas.com!

Are you ready to enhance your PROC TEMPLATE programming routine even more? If so, here are some additional resources:

tags: Problem Solvers, SAS Programmers

Build your PROC TEMPLATE muscles: Using the COMPUTE AS and TRANSLATE INTO statements was published on SAS Users.

7月 072016
 

The Distributed SAS 9.4 LASR Analytic Server provides a massively parallel processing solution for working very quickly with huge volumes of data. LASR was built from its earliest invocations to provide for incredible scalability and growth. One of the primary criteria which drives the considerations of scalability is load management. And LASR was built with a very specific principle to deal with managing load. LASR architecture assumes maximum efficiency is achieved when each node of the cluster is identical in terms of CPU, RAM, data to process, and other ancillary processes. LASR’s approach to load management gives external forces the ability to affect performance. So let’s look at some of the considerations we should all know about.

Load management in LASR is based on the assumption that all host machines are equal – that is each host has the same number and kind of CPUs, same RAM, same OS, and same ongoing workload as all the other hosts which are participating in the cluster. So with all those things set, there is only one item left to manage LASR’s workload: distribution of the data. With everything being equal on our hosts, then we want to distribute the data equally to each of the LASR Workers so that when they’re given an analytic job to perform, each of them can do the same tasks on the same amount of data and therefore finish in the same elapsed time.

Load Balancing == Even Data Distribution

When data is loaded into a Distributed LASR Analytic Server from SAS, the LASR Root Node accepts the incoming data and then distributes it evenly across all of the LASR Workers. Similarly for parallel data loading from a remote data provider, the SAS Embedded Process will evenly distribute blocks of data directly over to the LASR Workers as well.

LASR Load Balancing

Growing the LASR Cluster

With Massively Parallel Processing software, like the Distributed SAS LASR Analytic Server, when you need more performance or more capacity, then the expected approach is to add more machines to the cluster. Unlike traditional SMP systems where you might add in more RAM or more disk to an existing host so that it can do more on its own, with MPP we really prefer to distribute the load across as many cheap commodity-grade machines as possible. So rack up some new server hardware, install the necessary software, and then load in the data and get to work.

The MPP approach sounds simple and the SAS High-Performance Analytics Environment software is easy to deploy. But the real world doesn’t sit still and technology progresses quickly. If you initially setup your LASR cluster a year or so ago, and now you’re shopping for additional machines to add to your cluster, there’s a good chance that the vendor is selling new models which feature faster CPU and RAM than the old ones in your cluster.

LASR Load Balancing02

Is This a Problem?

No, it’s definitely not a problem. LASR will function just fine when hosted on machines with slightly different resources. Furthermore, you will certainly see an improvement in performance. In the illustrations above, we’ve taken our data from being distributed across only four LASR Workers and upgraded the cluster so that the data is now processed across six machine hosts. Therefore you can expect almost 50% improvement in LASR processing time (not total response time – we still have network i/o, middle-tier communications, browser rendering, and more to deal with).

You’re probably wondering why the improvement is “only” 50%. After all, the two new machines (W5 and W6) have faster CPUs than the other older four machines (W1 – W4). They should be able to knock out their specific tasks faster, right? And indeed, they do. But it’s not the new machines’ performance we need to look at – the old ones are setting the pace.

This comes back to LASR’s assumption that all machines hosting the LASR Workers are identical in terms of resources and performance. When dealing with future cluster expansion, try to keep the new machines as close as possible in terms of performance and capacity to the old machines simply because you don’t want to spend money unnecessarily for capacity which might be underutilized.

The Crux of the Matter

When LASR is distributing data to its Workers, there is no accommodation made for machines which have more CPU or RAM (or any other resources) than their cluster comrades. Each gets an equal portion of data simply based on dividing the data up among the number of machines in the cluster. Since each machine in the LASR cluster is operating on the same amount of data, machines which can crunch the numbers quicker (e.g. due to faster CPU or RAM) will finish their tasks sooner. However, the final answer is only available once the LASR Root Node has the responses from all of the Workers, so it must wait for the older/slower ones to finish as well.

Unbalanced Data Does Happen

Machine resources aren’t the only thing which can impact overall performance of LASR. On the flip side of the same coin is the amount of data being worked on. It is possible for data to get loaded in such a way that some LASR Workers will actually have more data to work on than others. That means that those with more data will need more time to complete their jobs.

For example, the SAS Embedded Process which provides SAS In-Database functionality has the ability to load data directly into LASR Workers. And the SAS EP attempts to distribute that data evenly. But at what level? In a Hadoop environment, the EP will distribute blocks of data, not rows, evenly across the LASR Workers. Depending on the size and number of blocks, it’s possible that some Workers will receive one more block than others will. This is unlikely to cause any notable performance lag for end-users, however a sysadmin monitoring low-level system performance might pick up on it.  It’s not a big deal, but it can be helpful to be aware of.

Another Hadoop-based example where unbalanced data might occur is with SASHDAT tables. SASHDAT is stored as blocks in HDFS which the LASR Workers read and write themselves directly to disk. Hadoop is responsible for managing those blocks once they’re committed to disk. And if an HDFS Data Node fails or if new HDFS Data Nodes (along with associated new LASR Workers) are added to the cluster after the SASHDAT file has been written out, then Hadoop might move blocks around in line with its block replication and disk utilization schemes. Later on, when LASR attempts to read those SASHDAT blocks back into memory, some Workers might end up with more blocks of data than others.

LASR Load Balancing03

In Figure 3, we can see that one host machine has gone offline. HDFS data is still available due to block replication. However, we can see that one LASR Worker must load up two blocks whereas the rest each get one. The result will be that LASR Worker will need twice as long to perform assigned tasks. Hence, LASR Root must wait twice as long to compile its final answer. Keep in mind that this example is contrived to make a point. In the real world, data volume, block sizes and distribution, as well as other factors will determine the impact on actual performance.

For more information about the HDFS block replication and SASHDAT, see the whitepaper The Impact of Hadoop Resiliency on SAS® LASR™ Analytic Server presented at SAS Global Forum 2015.

Balancing the Data

LASR offers builtin functionality whereby you can re-balance the data which resides in memory.
LASR Load Balancing04

Here’s some sample code which shows how to promote, balance, and save your LASR in-memory data:

/* Promote, Balance, and Save */
/* Setup a SASIOLA libref that specifies the LASR server */
/* where the unbalanced table resides */
libname example sasiola host="root.lasr.com" port=10010 tag='hps';

/* Use the data management capabilities of the IMSTAT */
/* procedure */
proc imstat immediate;

/* Specify the unbalanced table */
table example.the_table;
/* Print out the distribution stats */
distributioninfo;

/* Perform the balance – each node will be +/- 1 row */
/* A new temporary table (balanced!) is created */
balance;

/* Drop the original unbalanced table */
droptable;

/* Now reference the newly balanced temporary table */
table example.&amp;_templast_;
/* Promote the temporary table to active status with the */
/* original table’s name */
promote the_table;

/* Now reference the new table by its permanent name */
table example.the_table;
/* Print out the distribution stats for confirmation */
/* of balance */
distributioninfo;

/* Save the LASR table back down to SASHDAT on HDFS */
/* and replace the old, unbalanced table there */
save path="/path/in/hdfs" replace;
quit;

Extending What We Know

The concepts of resource utilization and data balancing don’t only apply to LASR. Many similar challenges apply to the SAS High-Performance Analytics software (such as HP STAT) as well because it’s running on the same software foundation as LASR which is provided by the SAS High-Performance Analytics Environment software.

 

tags: LASR, load balancing, SAS Professional Services

About LASR Load Balancing was published on SAS Users.

7月 062016
 

Disclaimer: before you get overly excited, PROC EXPAT is not really an actual SAS procedure. Sadly, it will not transfer or translate your code based on location. But it does represent SAS’ expansion of the Customer Contact Center, and that’s good news for our users. Here’s the story behind my made-up proc.

My mission

“Buon giorno!” “Guten Tag!” “Bonjour!” Excitement is in the air, the team buzzes. I’m not at an international airport, I’m at the new SAS office in Dublin, Ireland. I’d been given a one-month assignment to help expand operations, providing training in the Customer Contact Center across channels to deliver exceptional customer support and create an enhanced customer experience around the globe. It was such a rewarding experience!

SAS is a global company with customers in 148 countries, at more than 80,000 sites. The EXPAT Procedure is what I’ve coined my month-long adventure in Dublin, training and supporting our newly expanded Customer Contact Center team. So, what does this mean for you? It means additional customer care and expanded hours for all your inquiries and requests. Win!

Bringing expanded customer service to Europe, Middle East and Africa

The expansion was announced last fall, when SAS revealed plans to open a new Inside Sales and Customer Contact Center in Dublin—an investment of around €40 million with a projected 150 new jobs to be created—to provide support across Europe, Middle East and Africa (EMEA).

The new office models the US Customer Contact Center (and this is where I come in), providing support for customers in their channel of choice—be it social media, Live Chat, phone, email and/or web inquiries. We field general questions about SAS software, training, certifications or resources, as well as specific issues, like errors in your SAS log. The Customer Contact Center is here to assist, and now our customers in EMEA can benefit from the added support as well.

And we’re not just answering inquiries, we’re listening to our customers. We’re always looking at ways to make things easier to navigate, simpler to find, and faster to share. And we love customer feedback, whether direct or indirect, to enhance your experience with SAS.

The new team in Dublin is comprised of multi-lingual individuals with loads of experience in the tech industry. They have begun covering the United Kingdom, Ireland and Italy and it’s been amazing working with such a knowledgeable, patient and fun team with a great sense of humor. I think you’ll like them, too.

While I’ve been assisting with training the team on everything SAS, I’ve gotten a little training myself, working in a new office in a different country, surrounded by colleagues from more than 15 countries across the pond. A reminder of the wide reach of SAS, impact of Big Data analytics, and importance of our worldwide SAS users.

It’s an exciting time for the Customer Contact Center, SAS and our customers. If you’re located in EMEA, don’t hesitate to reach out to us!

tags: facebook, linkedin, sas customer contact center, social media, twitter

PROC EXPAT – Expanding SAS’ global customer service was published on SAS Users.

6月 302016
 

SAS Global ForumI look forward to SAS Global Forum each year and this past conference ranked up that as one of the best I've ever attended. This year there were so many wonderful presentations, Super Demos and workshops on the topic of administration of SAS and the underlying hardware infrastructure needed for SAS applications. Below you'll find a list of the top 15 presentations based on attendance. We hope you find these papers useful.

10360 - Nine Frequently Asked Questions about Getting Started with SAS® Visual Analytics
2440 - Change Management: The Secret to a Successful SAS® Implementation
8220 - Optimizing SAS® on Red Hat Enterprise Linux (RHEL) 6 and 7
SAS6840 - Reeling Them Back In Keeping SAS® Visual Analytics Users Happy, Behind the Scenes
SAS2820 - Helpful Hints for Transitioning to SAS® 9.4
SAS6761 - Best Practices for Configuring Your I/O Subsystem for SAS®9 Applications
10861 - Best Practices in Connecting External Databases to SAS®
SAS6280 - Hands-On Workshop: SAS® Environment Manager
SAS5680 - Advanced Topics in SAS® Environment Manager
9680 - SAS Big Brother
SAS4240 - Creating a Strong Business Case for SAS® Grid Manager: Translating Grid Computing Benefits to Business Benefits
10962 - SAS® Metadata Security 201: Security Basics for a New SAS Administrator
8860 - SAS® Metadata Security 101: A Primer for SAS Administrators and Users Not Familiar with SAS
9920 - UCF SAS® Visual Analytics: Implementation, Usage, and Performance
11202 - Let the Schedule Manager Take Care of Scheduling Jobs in SAS® Management Console 9.4

Note: It's hard to believe, but we're already thinking about topics to present at SAS Global Forum 2017. If you have suggestions for SAS Administration paper topics you'd like to see at next year's conference, please share your thoughts in the comments below or contact me directly. 

tags: papers & presentations, SAS Administrators, SAS Global Forum

15 top Global Forum 2016 papers for SAS administrators was published on SAS Users.

6月 292016
 

I recently read an article in which the winner of a Kaggle Competition was not shy about sharing his technique for winning not one, but several of the analytical competitions.

“I always use Gradient Boosting,” he said. And then added, “but the key is Feature Engineering.”

A couple days later, a friend who read the same article called and asked, “What is this Feature Engineering that he’s talking about?”

It was a timely question, as I was in the process of developing a risk model for a client, and specifically, I was working through the stage of Feature Engineering.

The act, or I should say the “art” of Feature Engineering takes time – in fact, it takes much more time than building the actual model, and because many clients desire a quick return regarding analytics, I was giving some thought to explaining the value that I perceived in the effort being put forth regarding the engineering of features from their data.

But first, I needed to respond to my friend’s question, and I began by describing the environments in which we typically work; “the companies we work with are not void of analytical talent, in fact some institutions have dozens of highly qualified statisticians with many years of experience in their respective fields. Still, they may not be getting the results from their models they are expecting.”

I was quickly asked, “So, how do you get better results? Neural Nets? Random Forests?”

“It’s not the method that sets my results apart from theirs,” I answered. “In fact, the success of my modeling projects is generally solidified before I begin considering ‘which’ method.” I paused and pondered how to continue, “Before I consider methods, I simply derive more information from their data than they do, and that is Feature Engineering.”

Feature Engineering

Feature Engineering is a vital component in the process of attaining extraordinary Machine Learning results, but it doesn’t merit as many intriguing conversations, papers, books or talks at conferences. Regardless, I know the successes due my projects that are typically attributed to the application of Machine Learning techniques is actually due effective Feature Engineering that not only improves the model, but also creates information that the clients can more readily understand.

How did I justify the time spent “Feature Engineering” to the client mentioned above? First, I developed a model with PROC LOGISTIC using only the primary data given to me and called it “Model 1.”

Model 1 was then compared to “Model 2,” a model built using the primary data plus the newly developed features. Examples of those new features that proved significant were those constructed as follows:

  • Ratios of various events relative to time.
  • Continuous variables dimensionally reduced into categories.
  • Simple counts of events.

And then I proactively promoted the findings!

I often use the OUTROC option to output the necessary components such that I can create a chart of ROC curves using PROC SGPLOT. An example of this process can be viewed in Graph 1 where the value of Feature Engineering is clearly evident in the chart detailing the curves from both models: the c-statistic improved from 0.70 to 0.90.

Great machine learning starts with resourceful feature engineering

However, there is still a story to be shared as improvement is not only represented as a stronger c-statistic, but rather the in the actual classification of default. For example:

  • Model 1 correctly predicts default in 20% of accounts categorized as “most likely to default,” but it misses on 80%!
  • Model 2 correctly predicts default in 86% of those accounts it categorizes as most likely to default, but misses on only 14%.

How did this happen?

While we could drill down into a discussion of significant factors to seek an in-depth explanation of the results, a more general observation is simply this: Model 2 accomplishes its mission by correctly reclassifying a substantial portion of the population as “low risk” and leaves those with default concerns to be assessed with greater consideration.

Essentially, the default model is vastly improved by better discerning those customers who we need not worry about leaving an uncluttered pool of customers that require attention (see Graph 2).

FeatureEngineering02

Instead of assessing the default potential for 900,000 accounts, the model instead deals with 100,000 “risky” accounts. It essentially improves its chances of success by creating a more “easy to hit” target, and the result (see Graph 3) is that when it categorizes an account as “high risk” it means it!

FeatureEngineering03

Thanks to Feature Engineering, the business now possess very actionable information.

How Do I Engineer the Features Necessary to Attain Such Results?

Working from the SAS Data Step and select SAS Procedures, I work through a process in which I attempt to create useful features with the following techniques and methods:

  • Calculate statistics like the minimums, maximums, averages, medians and ranges thinking that extremes (or the lack) might help define interesting behaviors.
  • Count occurrences of events considering that I might highlight statistically interesting habitual behaviors.
  • Develop ratios seeking to add predictive value to already analytically powerful variables as well as to variables that might have previously lacked statistical vigor.
  • Develop quintiles across variables of interest seeking to create expressive segments of the population while also dealing with extreme values.
  • Apply dimensionality reduction techniques, ranks, clustering etc. expecting that grouping those with similar behaviors will be statistically beneficial.
  • Consider the element of time as an important interaction with any feature that has been developed.
  • Use regression to identify trends in continuous variables thinking that moves up or down (whether fast or slow) will be interesting.

As with the guy who won the Kaggle competitions, I’ve now revealed some of my secret weapons regarding the building of successful models.

What’s your secret weapon?

 

 

tags: machine learning

Great machine learning starts with resourceful feature engineering was published on SAS Users.

6月 282016
 

SAS CaresIt’s no secret: here at SAS, we care about our customers. In fact, we care so much that we developed a program with that exact name: our SAS Cares program is here to help you with all things SAS!

We know you may be so awesome that you never need us, but even excellent people need help sometimes (I should know). If you find yourself in a tight SAS spot, have a burning SAS question, or even just want to write some SAS poetry, we’re here to help. Our SAS Cares program is extensive and can help you with any problem, question, or comment you’d like to share with us. We make it easy for you to connect with us, no matter what you are looking for, or what channel you are on. You can find information about all the services available from the SAS Cares Program in this blog from Maggie Miller.

A little bird (tweet, tweet!) told me you’d like to know more about one of the main ways SAS connects with our customers (which also happens to be my personal favorite!), through our social customer service channels.

SAS’ social properties are excellent places for you to ask questions, make suggestions, interact with other users, and give us feedback.  We are currently on Twitter, Facebook, LinkedIn, and YouTube, constantly monitoring these channels, looking for your questions and opportunities to support you. And, because we’re only a click away, we’d like you to consider us your first stop for pretty much anything about SAS. Most of the time, we’ll be able to give you exactly what you need, and in the rare instances we don’t have the answer, we’ll point you to resources that can help you find it.

For example, if you have questions about using SAS or SAS programming, we might send you to our SAS Support Communities. Or if you are getting program errors, or have installation questions, we can help you find the right Product Documentation and SAS Notes, or, if needed, open a Technical Support track for you. If you are not sure where to start, start with us. Our social channels are perfect for real-time interaction. We are here for you, and thrilled to help!

Here’s how you can connect with us via social media:

  • Twitter: @SAS_Cares
    Our primary social customer service channel, ready to handle anything SAS related.
    Reply, mention, and direct message us. Follow us for helpful resources and links for #SASusers!
  • Facebook: SAS Software and SAS Analytics U
    We post the latest SAS news and helpful tips. Like, Follow, Comment, Message, and post to our SAS pages.
  • LinkedIn: SAS
    Primary professional landing spot for SAS users, followed by nearly 300,000 SAS professionals. Follow us and comment on our posts.
  • YouTube: SAS Software
    Subscribe, view, and comment on our videos. Many of them teach specific SAS concepts and techniques, and will help you become a more effective and efficient SAS user.

So give us a try: reach out to us when you need SAS help, have some SAS feedback, or just want to share… because who cares? SAS Cares!

tags: sas cares, sas customer contact center

SAS Cares: help just a tweet, post, or click away was published on SAS Users.

6月 242016
 

ProblemSolversXML has become one of the major standards for moving data across the Internet. Some of XML’s strengths are the abilities to better describe data and to be more extensible than any of its predecessors such as CSV. Due to the increased popularity of XML for moving data, I provide a few tips in this article that will help when you need to read XML files into SAS software.

Reading XML Files

You can read XML files into SAS using either the XML engine or the XMLV2 engine. The XML engine was the first engine that SAS created to read XML files. The XMLV2 engine includes new functionality and enhancements and is aliased to the XML92 engine in SAS® 9.2.

It is easy to read XML files using the XMLV2 engine when the XML file that you read uses the GENERIC markup type and conforms to a very rectangular definition. Here is an example of the XML file layout that is required to be read natively using the XMLV2 engine:

If the file is not in this format, the following informative error message is generated in the SAS log:

reading XML files into SAS® software

If the file is not in this format, the following informative error message is generated in the SAS log:

Reading XML files into SAS Software02

The XMLMap file referenced in the error message is a SAS specific file that instructs the engine about how to create rows, columns, and other information from the XML markup. You can use either of the following methods to generate the XMLMap file:

  • the AUTOMAP= option within the SAS session beginning with SAS® 9.3 TS1M2
  • the SAS XML Mapper, which is a stand-alone Java application

Generating Dynamic Maps within the SAS Session

Using the XMLV2 engine, you can create a dynamic XMLMap file and use it to generate SAS data sets by specifying the AUTOMAP= and the XMLMAP= options in the LIBNAME statement. The AUTOMAP= option has two possible values: REPLACE and REUSE. REPLACE updates an existing XMLMap file, whereas REUSE uses an existing XMLMap file. Both values create a new XMLMap file if one does not currently exist. When the file is automapped, one or more data sets is created. This method creates a representation that is as relational as possible based on the XML markup. In addition, it generates surrogate keys that enable you to combine the data using the SQL procedure or the DATA step. The following example demonstrates this method.

Here is the XML data that is read:

Reading XML files into SAS Software03

Here is the SAS session code:

filename datafile 'c:teststudents2.xml';
filename mapfile "c:teststudents2.map";
libname datafile xmlv2 xmlmap=mapfile automap=replace;

proc copy in=datafile out=work;
run;

Here is the output:

Reading XML files into SAS Software04

For more information about using the XMLV2 engine, refer to the “SAS® 9.4 XMLV2 LIBNAME Engine Tip Sheet.”

Using SAS XML Mapper

If you answer “yes” to the following questions, you should use SAS XML Mapper to generate the XMLMap file rather than generating this file dynamically within the SAS session:

  • Do you need to generate an XMLMap from an XML schema file?
  • Do you need to generate a custom XMLMap file or map only a portion of the file?
  • Do you need to view the data to be imported before reading the XML file?
  • Do you need to add specific metadata such as data types, informats/formats, column widths, or names?
  • Do you want to generate sample code for the import of the XML file?

The following image shows SAS XML Mapper once you click the icon that runs the automapping feature, which enables you to quickly generate tables from the XML markup:

Reading XML files into SAS Software05

Using SAS XML Mapper, you can also generate a custom XMLMap file.

For more information, refer to the following video tutorials about creating XMLMap files:

I hope this article helps you easily read XML files into SAS regardless of the file structure. This information should also help you determine the appropriate method to use when generating an XMLMap file.

tags: SAS Problem Solvers, SAS Programmers

Tips for reading XML files into SAS® software was published on SAS Users.

6月 222016
 

Report designers often discover after aggregating data by groups in the Visual Analytics Designer that it would also be nice to see additional aggregations of the data, for example, a maximum or minimum of that sum across groups. This means creating an ‘aggregation of an aggregation.’ If you plan your report objectives in advance of loading your data, you can do this by creating an aggregation initially in the Visual Data Builder, rather than the Designer.  And in many cases, it’s possible to get the desired report results by simply loading a small additional table to memory.

Here’s an example of a report with a sum aggregation on the profit measure and Facility Region as the category data item. The report also shows the maximum, minimum, and average of the regional profit values, along with the difference between each region’s profit sum and the average. It’s possible to assign a SUM aggregation to the Profit measure, but the challenge appears when trying then to create a MAX aggregation across the regions.

Visual Data Builder to pre-aggregate your data01

If you have awareness of the need for this type of report prior to loading the data, you can easily load a much smaller version of the data with the Profit data item already pre-aggregated. The data builder query, shown below, creates and loads a file containing only the FacilityRegion as a GROUP BY column and the Profit column as a SUM aggregation.

Visual Data Builder to pre-aggregate your data02

Visual Data Builder to pre-aggregate your data03

In the designer, for reporting purposes, the Profit_SUM data item is renamed to Sum of Profit for Region.

Visual Data Builder to pre-aggregate your data04

Now the Sum of Profit for Region data item can be used to calculate these aggregated measures:  Max Regional Profit Sum, Min Regional Profit Sum, and Avg Regional Profit Sum, using the formulas shown below:

Visual Data Builder to pre-aggregate your data05

Once you have the Avg Regional Profit Sum calculation, you can create the additional aggregated measure, Regional Profit – Average:

Visual Data Builder to pre-aggregate your data06

So, this is a list table displaying all of the information that was derived from the two columns and four rows of data.

Visual Data Builder to pre-aggregate your data07

To emphasize the region with the maximum profit and minimum profit, you may want to add the following display rules.

Visual Data Builder to pre-aggregate your data08

And since you have the original larger table loaded as well, you can add an additional data source and some interactions, so that the user can click on a row to view more detail data corresponding to each region of interest.

Visual Data Builder to pre-aggregate your data09

Now, for example, the user can see what facilities contributed to the max value for the Western region, and the corresponding Expenses as well!

Visual Data Builder to pre-aggregate your data10

So, although it may not be practical to pre-aggregate the values of every measure in your data, sometimes pre-aggregating and storing the much smaller table can give you the benefit of being able to get the additional aggregations that you need for a specialized report, while relying on a larger stored table for the details.

tags: SAS Professional Services, SAS Programmers, SAS Visual Analytics

Need an aggregation of an aggregation? Use Visual Data Builder to pre-aggregate your data was published on SAS Users.