2月 152017
 

Corporate compliance with an increasing number of industry regulations intended to protect personally identifiable information (PII) has made data privacy a frequent and public discussion. An inherent challenge to data privacy is, as Tamara Dull explained, “data, in and of itself, has no country, respects no law, and travels freely across borders. In the […]

The post What does the requirement for data privacy mean for data scientists, business analysts and IT? appeared first on The Data Roundtable.

2月 152017
 

A categorical response variable can take on k different values. If you have a random sample from a multinomial response, the sample proportions estimate the proportion of each category in the population. This article describes how to construct simultaneous confidence intervals for the proportions as described in the 1997 paper "A SAS macro for constructing simultaneous confidence intervals for multinomial proportions" by Warren May and William Johnson (Computer Methods and Programs in Biomedicine, p. 153–162).

Estimates of multinomial proportions

In their paper, May and Johnson present data for a random sample of 220 psychiatric patients who were categorized as either neurotic, depressed, schizophrenic or having a personality disorder. The observed counts for each diagnosis are as follows:

data Psych;
input Category $21. Count;
datalines;
Neurotic              91
Depressed             49
Schizophrenic         37
Personality disorder  43
;

If you divide each count by the total sample size, you obtain estimates for the proportion of patients in each category in the population. However, the researchers wanted to compute simultaneous confidence intervals (CIs) for the parameters. The next section shows several methods for computing the CIs.

Methods of computing simultaneous confidence intervals

May and Johnson discussed six different methods for computing simultaneous CIs. In the following, 1–α is the desired overall coverage probability for the confidence intervals, χ2(α, k-1) is the upper 1–α quantile of the χ2 distribution with k-1 degrees of freedom, and π1, π2, ..., πk are the true population parameters. The methods and the references for the methods are:

  1. Quesenberry and Hurst (1964): Find the parameters πi that satisfy
    N (pi - πi)2 ≤ χ2(α, k-1) πi(1-πi).
  2. Goodman (1965): Use a Bonferroni adjustment and find the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α/k, 1) πi(1-πi).
  3. Binomial estimate of variance: For a binomial variable, you can bound the variance by using πi(1-πi) ≤ 1/4. You can construct a conservative CI for the multinomial proportions by finding the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α, 1) (1/4).
  4. Fitzpatrick and Scott (1987): You can ignore the magnitude of the proportion when bounding the variance to obtain confidence intervals that are all the same length, regardless of the number of categories (k) or the observed proportions. The formula is
    N (pi - πi)2 ≤ c2(1/4)
    where the value c2= χ2(α/2, 1) for α ≤ 0.016 and where c2= (8/9)χ2(α/3, 1) for 0.016 < α ≤ 0.15.
  5. Q and H with sample variance: You can replace the unknown population variances by the sample variances in the Quesenberry and Hurst formula to get
    N (pi - πi)2 ≤ χ2(α, k-1) pi(1-pi).
  6. Goodman with sample variance: You can replace the unknown population variances by the sample variances in the Goodman Bonferroni-adjusted formula to get
    N (pi - πi)2 ≤ χ2(α/k, 1) pi(1-pi).

In a separate paper, May and Johnson used simulations to test the coverage probability of each of these formulas. They conclude that the simple Bonferroni-adjusted formula of Goodman (second in the list) "performs well in most practical situations when the number of categories is greater than 2 and each cell count is greater than 5, provided the number of categories is not too large." In comparison, the methods that use the sample variance (fourth and fifth in the list) are "poor." The remaining methods "perform reasonably well with respect to coverage probability but are often too wide."

A nice feature of the Q&H and Goodman methods (first and second on the list) is that they procduce unsymmetric intervals that are always within the interval [0,1]. In contrast, the other intervals are symmetric and might not be a subset of [0,1] for extremely small or large sample proportions.

Computing CIs for multinomial proportions

You can download SAS/IML functions that are based on May and Johnson's paper and macro. The original macro used SAS/IML version 6, so I have updated the program to use a more modern syntax. I wrote two "driver" functions:

  • CALL MultCIPrint(Category, Count, alpha, Method) prints a table that shows the point estimates and simultaneous CIs for the counts, where the arguments to the function are as follows:
    • Category is a vector of the k categories.
    • Count is a vector of the k counts.
    • alpha is the significance level for the (1-alpha) CIs. The default value is alpha=0.05.
    • Method is a number 1, 2, ..., 6 that indicates the method for computing confidence intervals. The previous list shows the method number for each of the six methods. The default value is Method=2, which is the Goodman (1965) Bonferroni-adjusted method.
  • MultCI(Count, alpha, Method) is a function that returns a three-column matrix that contains the point estimates, lower limit, and upper limit for the CIs. The arguments are the same as above, except that the function does not use the Category vector.

Let's demonstrate how to call these functions on the psychiatric data. The following program assumes that the function definitions are stored in a file called CONINTS.sas; you might have to specify a complete path. The PROC IML step loads the definitions and the data and calls the MultCIPrint routine and requests the Goodman method (method=2):

%include "conint.sas";   /* specify full path name to file */
proc iml;
load module=(MultCI MultCIPrint);
use Psych; read all var {"Category" "Count"}; close;
 
alpha = 0.05;
call MultCIPrint(Category, Count, alpha, 2); /* Goodman = 2 */
Estimates and 95% simultaneous confidence intervals for m ultinomial proportions

The table shows the point estimates and 95% simultaneous CIs for this sample of size 220. If the intervals are wider than you expect, remember the goal: for 95% of random samples of size 220 this method should produce a four-dimensional region that contains all four parameters.

You can visualize the width of these intervals by creating a graph. The easiest way is to write the results to a SAS data set. To get the results in a matrix, call the MultCI function, as follows:

CI = MultCI(Count, alpha, 2);  /*or simply CI = MultCI(Count) */
/* write estimates and CIs to data set */
Estimate = CI[,1];  Lower = CI[,2];  Upper = CI[,3];
create CIs var {"Category" "Estimate" "Lower" "Upper"};
append;
close;
quit;
 
ods graphics / width=600px height=240px;
title 'Simultaneous Confidence Intervals for Multinomial Proportions';
title2 'Method of Goodman (1965)';
proc sgplot data=CIs;
  scatter x=Estimate y=Category / xerrorlower=Lower xerrorupper=upper
          markerattrs=(Size=12  symbol=DiamondFilled)
          errorbarattrs=GraphOutlines(thickness=2);
  xaxis grid label="Proportion" values=(0.1 to 0.5 by 0.05);
  yaxis reverse display=(nolabel);
run;
Graph of estimates and simultaneous confidence intervals for multinomial proportions

The graph shows intervals that are likely to enclose all four parameters simultaneously. The neurotic proportion in the population is probably in the range [0.33, 0.50] and at the same time the depressed proportion is in the range [0.16, 0.30] and so forth. Notice that the CIs are not symmetric about the point estimates; this is most noticeable for the smaller proportions such as the schizophrenic category.

Because the cell counts are all relatively large and because the number of categories is relatively small, Goodman's CIs should perform well.

I will mention that it you use the Fitzpatrick and Scott method (method=4), you will get different CIs from those reported in May and Johnson's paper. The original May and Johnson macro contained a bug that was corrected in a later version (personal communication with Warren May, 25FEB2016).

Conclusions

This article presents a set of SAS/IML functions that implement six methods for computing simultaneous confidence intervals for multinomial proportions. The functions are updated versions of the macro %CONINT, which was presented in May and Johnson (1987). You can use the MultCIPrint function to print a table of statistics and CIs, or you can use the MultCI function to retrieve that information into a SAS/IML matrix.

tags: Statistical Programming

The post Simultaneous confidence intervals for multinomial proportions appeared first on The DO Loop.

2月 142017
 

When mentioning to friends that I’m going to Orlando for SAS Global Forum 2107, they asked if I would be taking my kids. Clearly my friends have not attended a SAS Global Forum before as there have been years where I never even left the hotel! My kids would NOT enjoy it… but, […]

The post Learn about SAS Studio, SAS Enterprise Guide and (drumroll) SAS Viya at SAS Global Forum 2017! appeared first on SAS Learning Post.

2月 142017
 

In this post I wanted to shed some light on a visualization you may not be using enough: the Word Cloud. Word association exercises can often be a fun way to pass the time with friends, or it can trigger immediate action – just think of your email inbox and seeing an email from a particular person: your boss, wife, husband or child. The same can be true for information for your organization. A single word can quickly, efficiently and effectively communicate the performance of a company’s metric, hence the value of using a word cloud visualization in your report.

Let’s look at some examples. Here I am using the Insight Toy data and looking at the performance of Products based on customer orders.

As the word cloud in SAS Visual Analytics 7.3 Designer has a maximum row return of 100, I have used the Rank feature to look at the top 25 Products and the bottom 25 Products. I also created a filtered interaction between the word clouds and their respective list tables below to show a bit more detail around the next level in the hierarchy after Product Make.

Notice how impactful these Product names are compared to when using their corresponding SKUs. Be sure to pick a meaningful category to represent your data in the word cloud.

This type of visualization could lead to a great comparison report, comparing what the top and bottom Products were for the same month in the previous year.

What if your data doesn’t have the appropriate column to display on a word cloud? No problem. In this next example, I took the value of Sales Rep Rating and created a new Calculated Data Item to represent values less than or equal to 25% to be Poor, inclusively between 26% and 50% to be Average and everything else to be Above Average.

Using a word cloud for this new category data item allows you to quickly move through the different states and compare the Sales Rep Performance frequency. You could also use this new category to compare each performance group’s Order Totals.

Here is California’s Sales Rep Performance:

And here is Maryland’s Sales Rep Performance:


These are two ideas for you to think about how you might include the word cloud visualization into your reports to help quickly and effectively represent the status of a company’s metric beyond the standard text analytics usage.

tags: SAS Professional Services, SAS Visual Analytics

Visualization Spotlight: Visual Analytics Designer 7.3 Word Cloud was published on SAS Users.

2月 142017
 

Editor's note: This following post is from Shara Evans, CEO of Market Clarity Pty Ltd. Shara is a featured speaker at SAS Global Forum 2017 and a globally acknowledged Keynote Speaker and widely regarded as one of the world’s Top Female Futurists.

Learn more about Shara.


In the movie Minority Report lead character John Anderton, played by Tom Cruise, has an eye transplant in order to avoid being recognized by ubiquitous iris scanning identification systems.

Such surgical procedures still face some fairly significant challenges, in particular connecting the optic nerve of the transplanted eye to that of the recipient. However the concept of pervasive individual identification systems is now very close to reality and although the surgical solution is already available, it’s seriously drastic!

We’re talking face recognition here.

Many facial recognition systems are built on the concept of “cooperative systems,” where you look directly at the camera from a pre-determined distance and you are well lit, and your photo is compared against a verified image stored in a database. This type of system is used extensively for border control and physical security systems.

Facial recognition

Face in the Crowd Recognition (Crowd walking towards camera in corridor) Source: Imagus

Where it gets really interesting is with “non-cooperative systems,” which aim to recognize faces in a crowd: in non-optimal lighting situations and from a variety of angles. These systems aim to recognize people who could be wearing spectacles, scarves or hats, and who might be on the move. An Australian company, Imagus Technology has designed a system that is capable of doing just that — recognizing faces in a crowd.

To do this, the facial recognition system compiles a statistical model of a face by looking at low-frequency textures such as bone structure. While some systems may use very high-frequency features such as moles on the skin, eyelashes, wrinkles, or crow’s feet at the edges of the eyes — this requires a very high-quality image. Whereas, with people walking past, there’s motion blur, non-optimal camera angles, etcetera, so in this case using low-frequency information gets very good matches.

Biometrics are also gaining rapid acceptance for both convenience and fraud prevention in payment systems. The two most popular biometric markers are fingerprints and facial recognition, and are generally deployed as part of a two-factor authentication system. For example, MasterCard’s “Selfie Pay” app was launched in Europe in late 2016, and is now being rolled out to other global locations. This application was designed to speed-up and secure online purchases.

Facial recognition is particularly interesting, because while not every mobile phone in the world will be equipped with a fingerprint reader, virtually every device has a camera on it. We’re all suffering from password overload, and biometrics - if properly secured, and rolled out as part of a multi-factor authentication process - can provide a solution to coming up with, and remembering, complex passwords for the many apps and websites that we frequent.

Its not just about recognizing individuals

Facial recognition systems are also being used for marketing and demographics. In a store, for example, you might want to count the number of people looking at your billboard or your display. You'd like to see a breakdown of how many males and females there are, age demographics, time spent in front of the ad, and other relevant parameters.

Can you imagine a digital advertising sign equipped with facial recognition? In Australia, Digital Out-of-Home (DOOH) devices are already being used to choose the right time to display a client’s advertising. To minimize wastage in ad spend, ads are displayed only to a relevant audience demographic; for instance, playing an ad for a family pie only when it sees a mum approaching.

What if you could go beyond recognizing demographics to analyzing people’s emotions? Advances in artificial intelligence are turning this science fiction concept into reality. Robots such as “Pepper” are equipped with specialized emotion recognition software that allows it to adapt to human emotions. Again, in an advertising context, this could prove to be marketing gold.

Privacy Considerations

Of course new technologies is always a double-edged sword, and biometrics and advanced emotion detection certainly fall into this category.

For example, customers typically register for a biometric payment system in order to realize a benefit such as faster or more secure e-commerce checkouts or being fast-tracked through security checks at airports. However, the enterprise collecting and using this data must in turn satisfy the customer that their biometric reference data will be kept and managed securely, and used only for the stated purpose.

The advent of advanced facial recognition technologies provides new mechanisms for retailers and enterprises to identify customers, for example from CCTV cameras as they enter shops or as they view public advertising displays. It is when these activities are performed without the individual’s knowledge or consent that concerns arise.

Perhaps most worrisome is that emotion recognition technology would be impossible to control. For example, anyone would be able to take footage of world leaders fronting the press in apparent agreement after the outcome of major negotiations and perhaps reveal their real emotions!

From a truth perspective, maybe this would be a good thing.

But, imagine that you’re involved in intense business negotiations. In the not too distant future advanced augmented reality glasses or contacts could be used to record and analyze the emotions of everyone in the room in real time. Or, maybe you’re having a heart-to-heart talk with a family member or friend. Is there such a thing as too much information?

Most of the technology for widespread exploitation of face recognition is already in place: pervasive security cameras connected over broadband networks to vast resources of cloud computing power. The only piece missing is the software. Once that becomes reliable and readily available, hiding in plain sight will no longer be an option.

Find out more at the SAS Global User Forum

This is a preview of some of the concepts that Shara will explore in her speech on “Emerging Technologies: New Data Sets to Interpret and Monetize” at the SAS Global User Forum:

  • Emerging technologies such as advanced wearables, augmented and virtual reality, and biometrics — all of which will generate massive amounts of data.
  • Smart Cities — Bringing infrastructure to life with sensors, IoT connections and robots
  • Self Driving Cars + Cars of the Future — Exploring the latest in automotive technologies, robot vision, vehicle sensors, V2V comms + more
  • The Drone Revolution — looking at both the incredible benefits and challenges we face as drones take to the skies with high definition cameras and sensors.
  • The Next Wave of Big Data — How AI will transform information silos, perform advanced voice recognition, facial recognition and emotion detection
  • A Look Into the Future — How the convergence of biotech, ICT, nanotechnologies and augmentation of our bodies may change what it means to be human.

Join Shara for a ride into the future where humans are increasingly integrated with the ‘net!

About Shara Evans

Technology Futurist Shara Evans is a globally acknowledged Keynote Speaker and widely regarded as one of the world’s Top Female Futurists. Highly sought after and in demand by conference producers and media, Shara provides the latest insights and thought provoking ideas on a broad spectrum of issues. Shara can be reached via her website: www.sharaevans.com

(Note: My new website will be launching in a few weeks. In the meantime, the URL automatically redirects to my company website – www.marketclarity.com.au )

tags: analytics, SAS Global Forum

Facial recognition: Monetizing faces in the crowd was published on SAS Users.

2月 132017
 

After the recent presidential election, I was updating my graphs of the voter registration data and noticed that the number of registered voters decreased after the election. At first I thought that was odd, but then I realized that maybe inactive voters were being purged. I wanted to find out […]

The post Purging inactive voter registrations in North Carolina appeared first on SAS Learning Post.

2月 132017
 

The term compliance is most often associated with control. It evokes visions of restrictions, regulations and security protecting something which is to remain private. The term open is most often associated with access, and it evokes visions of an absence of restrictions, regulations and security – making something available which is […]

The post Can you be open and compliant at the same time? appeared first on The Data Roundtable.

2月 132017
 

A common question on SAS discussion forums is how to repeat an analysis multiple times. Most programmers know that the most efficient way to analyze one model across many subsets of the data (perhaps each country or each state) is to sort the data and use a BY statement to repeat the analysis for each unique value of one or more categorical variables. But did you know that a BY-group analysis can sometimes be used to replace macro loops? This article shows how you can efficiently run hundreds or thousands of different regression models by restructuring the data.

One model: Many samples

As I've written before, BY-group analysis is also an efficient way to analyze simulated sample or bootstrapped samples. I like to tell people that you can choose "the slow way or the BY way" to analyze many samples.

In that phrase, "the slow way" refers to the act of writing a macro loop that calls a SAS procedure to analyze one sample. The statistics for all the samples are later aggregated, often by using PROC APPEND. As I (and others) have written, macro loops that call a procedure hundreds or thousands of time are relatively slow.

As a general rule, if you find yourself programming a macro loop that calls the same procedure many times, you should ask yourself whether the program can be restructured to take advantage of BY-group processing.


Stuck in a macro loop? BY-group processing can be more efficient. #SASTip
Click To Tweet


Many models: One sample

There is another application of BY-group processing, which can be incredibly useful when it is applicable. Suppose that you have wide data with many variables: Y, X1, X2, ..., X1000. Suppose further that you want to compute the 1000 single-variable regression models of the form Y=Xi, where i = 1 to 1000.

One way to run 1000 regressions would be to write a macro that contains a %DO loop that calls PROC REG 1000 times. The basic form of the macro would look like this:

%macro RunReg(DSName, NumVars);
...
%do i = 1 %to &NumVars;                    /* repeat for each x&i */
   proc reg data=&DSName noprint
            outest=PE(rename=(x&i=Value)); /* save parameter estimates */
   model Y = x&i;                          /* model Y = x_i */
   quit;
 
   /* ...then accumulate statistics... */
%end;
%mend;

The OUTEST= option saves the parameter estimates in a data set. You can aggregate the statistics by using PROC APPEND or the DATA step.

If you use a macro loop to do this computation, it will take a long time for all the reasons stated in the article "The slow way or the BY way." Fortunately, there is a more efficient alternative.

The BY way for many models

An alternative way to analyze those 1000 regression models is to transpose the data to long form and use a BY-group analysis. Whereas the macro loop might take a few minutes to run, the BY-group method might complete in less than a second. You can download a test program and compare the time required for each method by using the link at the end of this article.

To run a BY-group analysis:

  1. Transpose the data from wide to long form. As part of this process, you need to create a variable (the BY-group variable) that will be unique for each model.
  2. Sort the data by the BY-group variable.
  3. Run the SAS procedure, which uses the BY statement to specify each model.

1. Transpose the data

In the following code, the explanatory variables are read into an array X. The name of each variable is stored by using the VNAME function, which returns the name of the variable that is in the i_th element of the array X. If the original data had N observations and p explanatory variables, the LONG data set contains Np observations.

/* 1. transpose from wide (Y, X1 ,...,X100) to long (varNum VarName Y Value) */
data Long;
set Wide;                       /* <== specify data set name HERE         */
array x [*] x1-x&nCont;         /* <== specify explanatory variables HERE */
do varNum = 1 to dim(x);
   VarName = vname(x[varNum]);  /* variable name in char var */
   Value = x[varNum];           /* value for each variable for each obs */
   output;
end;
drop x:;
run;

2. Sort the data

In order to perform a BY-group analysis in SAS, sort the data by the BY-group variable. You can use the VARNUM variable if you want to preserve the order of the variables in the wide data. Or you can sort by the name of the variable, as done in the following call to PROC SORT:

/* 2. Sort by BY-group variable */
proc sort data=Long;  by VarName;  run;

3. Run the analyses

You can now call a SAS procedure one time to compute all regression models:

/* 3. Call PROC REG and use BY statement to compute all regressions */
proc reg data=Long noprint outest=PE;
by VarName;
model Y = Value;
quit;
 
/* Look at the results */
proc print data=PE(obs=5);
var VarName Intercept Value;
run;

The PE data set contains the parameter estimates for every single-variable regression of Y onto Xi. The table shows the parameter estimates for the first few models. Notice that the models are presented in the order of the BY-group variable, which for this example is the alphabetical order of the name of the explanatory variables.

Conclusions

You can download the complete SAS program that generates example data and runs many regressions. The program computes the regression estimates two ways: by using a macro loop (the SLOW way) and by transforming the data to long form and using BY-group analysis (the BY way).

This technique is applicable when the models all have a similar form. In this example, the models were of the form Y=Xi, but a similar result would work for GLM models such as Y=A|Xi, where A is a fixed classification variable. Of course, you could also use generalized linear models such as logistic regression.

Can you think of other ways to use this trick? Leave a comment.

tags: Data Analysis, Getting Started, Statistical Programming

The post An easy way to run thousands of regressions in SAS appeared first on The DO Loop.