202017
 

Marketers today use varying adaptations of the customer journey to describe a circular, looped decision pathway with four distinct phases.

Mapping the right data to specific stages of the customer journey is all about getting to know your customers and developing initiatives to put that knowledge into action. Applying analytical models across the key customer journey phases uncovers opportunities to cultivate value generating behaviors and extend the customer’s lifetime value.

  • Initial Consideration Set (Research/Discover). Data and analytics in this phase help you gain deeper customer understanding of customers and prospects. Segmentation surfaces stated and unmet customer needs and buying motivations. Reach the right prospects with look-alike acquisition models and evaluate prospects with lead scoring techniques.
  • Active Evaluation (Explore/Consider). Data and analytics in this phase help you dynamically adapt marketing efforts to customer response – in real-time. Offer optimization techniques can match the appropriate offer based on historical customer response. Amazon’s recommendation engine is a familiar example. Also, A/B and multivariate testing can assess various marketing variables, such as messaging and content types before you roll out initiatives on a wider scale.
  • Moment of Purchase (Buy/Convert). Data and analytics help you understand how and when customers will purchase. Predictive techniques such as propensity models help marketers predict the likelihood that a customer will respond to a specific offer or message and convert. Expand share of wallet with cross-sell and affinity models; or, understand future buying behavior through propensity models.

Post-purchase experience (Use/Maintain/Advocate). Data and analytics in this phase help you uncover patterns of usage behavior and further drive customer engagement. For example, a retail site may tell you the status of your recent order the moment you land on the home page. Churn models such as uplift modeling and survival analysis can provide early warning signs of defection. Preempt customer churn with corrective actions, such as special offers or free upgrades.

Open, unified capabilities needed

Brands that build the most effective customer journeys master three interrelated capabilities: unified customer data platforms, proactive analytics and contextual interactions.

  • Unified customer data platforms: This capability unifies a company's customer data from online and offline channels to extract customer insights and steer customer experience. This includes the ability to cleanse, normalize and aggregate data from disparate systems – within the enterprise and externally – at an individual level.
  • Proactive analytics: Purpose-built data collection and analytics capabilities that incorporates both customer analytics (give brands the customer insight necessary to provide offers that are anticipated, relevant and timely) and marketing analytics (evaluate marketing performance using metrics, such as ROI, channel attribution, and overall marketing effectiveness).
  • Contextual interactions: This capability involves using real-time insights about where a customer is in a journey digitally (browsing product reviews) or physically (entering a retail outlet) or to draw her forward into subsequent actions the company wants her to pursue.

The results are dramatic when marketers can combine data management, analytics and insights execution into unified marketing platform.

Consider gourmet gift retailing icon, Harry & David. By combining data-driven marketing with enriched customer insight, the company transformed its catalog heritage into a contemporary, digital retailing powerhouse. In the past three years, customer retention has increased by 14 percent and sales per customer have gone up 7 percent.

The largest retail group in Switzerland, Migros, used data and analytics to further optimize the customer journey.

The upshot: Change perception to reality

“If change is happening on the outside faster than on the inside the end is in sight.” – Jack Welch

Digitally-empowered prospects and customers are calling the shots, going after what they want when they want it. With a unified view of data and analytics, brands can position themselves in front of their customers’ paths as they navigate the customer journey.

For the brands that can see the world as their customers do – and shape the customer journey accordingly--the reward is higher brand preference, revenue and cost improvements, and a lasting competitive advantage.

Assess your marketing confidence

Take stock of your digital marketing approach with the Marketing Confidence Quotient. This assessment tool quickly identifies and scores your company's strengths and weaknesses across four marketing dimensions: data management, analytics use, process integration and business alignment. It's like having your own personal marketing adviser.

A better approach: Align data and analytics across the customer journey was published on Customer Intelligence Blog.

202017
 
Distribution of colors of M&M's

Many introductory courses in probability and statistics encourage students to collect and analyze real data. A popular experiment in categorical data analysis is to give students a bag of M&M® candies and ask them to estimate the proportion of colors in the population from the sample data. In some classes, the students are also asked to perform a chi-square analysis to test whether the colors are uniformly distributed or whether the colors match a hypothetical set of proportions.

M&M's® have a long history at SAS. SAS is the world’s largest corporate consumer of M&M's. Every Wednesday a SAS employee visits every breakroom on campus and fill two large containers full of M&M's. This article uses SAS software to analyze the classic "distribution of colors" experiment.

The proportion of colors for M&M's

The "plain" M&M candies (now called "milk chocolate M&M's") are produced by the Mars, Inc. company. The distribution of colors in M&M's has a long and colorful history. The colors and proportions occasionally change, and the distribution is different for peanut and other varieties. A few incidents from my lifetime that made the national news are:

  • 1976: Red M&M's are replaced by orange. This was a big story. "Red dye #2" had been discovered to be a carcinogen. Although Mars did not use this dye in their candies, the company changed colors to alleviate customer concerns.
  • 1986: Red M&M's are brought back. Orange stays.
  • 1995: The tan color is replaced by a more vivid color. In a promotional campaign, the public is asked to vote for the replacement color. Ten million vote; blue wins in a landslide.
  • Late 1990s: The M&M web site lists the distribution of colors. Circa 1997, the color distribution was
    30% brown, 20% yellow, 20% red, 10% orange, 10% green, and 10% blue.
    Statistician and educators rejoice and publish many papers on the topic.
  • 2008: Mars changes the color distribution to
    24% blue, 20% orange, 16% green, 14% yellow, 13% red, 13% brown.
    Some time later, the proportions were removed from the web site and have not been restored.
  • 2017: What is the current distribution of colors? Read on for an interesting story!

Proportions and chi-square test

The breakroom containers at SAS are filled from two-pound bags. So as to not steal all the M&M's in the breakroom, I conducted this experiment over many weeks in late 2016 and early 2017, taking one scoop of M&M's each week. The following data set contains the cumulative counts for each of the six colors in a sample of size N = 712:

data MandMs;
input Color $7. Count;
datalines;
Red     108
Orange  133
Yellow  103
Green   139
Blue    133
Brown    96
;

A bar chart that shows the observed distribution of colors in M&M's is shown at the top of this article.

To estimate the proportion of colors in the population, simply divide each count by the total sample size, or use the FREQ procedure in SAS. PROC FREQ also enables you to run a chi-square test that compares the sample counts to the expected counts under a specified distribution. The most recent published distribution is from 2008, so let's test those proportions:

proc freq data = MandMs order=data;
   weight Count;
   tables Color / nocum chisq
   /* 2008 proportions:  red orange yellow green blue brown */
                  testp=(0.13  0.20  0.14  0.16  0.24  0.13);
run;
Chi-square test for distribution of colors in M&M's

The observed and expected proportions are shown in the table to the right. The chi-square test rejects the test hypothesis at the α = 0.05 significance level (95% confidence). In other words, the distribution of colors for M&M's in this 2017 sample does NOT appear to be the same as the color distribution from 2008! You can see this visually from the bar chart: the red and green bars are too tall and the blue bar is too short compared with the expected values.

You need a large sample to be confident that this empirical deviation is real. After collecting data for a few weeks, I did a preliminary analysis that analyzed about 300 candies. With that smaller sample, the difference between the observed and expected proportions could be attributed to sampling variability and so the chi-square test did not reject the null hypothesis. However, while running that test I noticed that the green and blue colors accounted for the majority of the difference between the observed and theoretical proportions, so I decided to collect more data.

Simultaneous confidence intervals for the M&M proportions

As I explained in a previous article, you can use the sample proportions to construct simultaneous confidence intervals for the population proportions. The following SAS/IML statements load and call the functions from the previous post:

%include "conint.sas";         /* define the MultCI and MultCIPrint modules */
proc iml; 
load module=(MultCI MultCIPrint);  
use MandMs; read all var {"Color" "Count"}; close;
 
alpha = 0.05;
call MultCIPrint(Color, Count, alpha, 2); /* construct CIs using Goodman's method */
Simultaneous confidence limits for proportions of colors in plain M&M's

The table indicates that the published 2008 proportion for blue (0.24) is far outside the 95% confidence interval, and the proportion for green (0.16) is just barely inside its interval. That by itself does not prove that the 2008 proportion are no longer valid (we might have gotten unlucky during sampling), but combined with the earlier chi-square test, it seems unlikely that the 2008 proportions are applicable to these data.

Calling the experts

The published proportions for green and blue do not seem to match the sample proportions from 2008. For this large sample, the published proportion of blue is too large whereas the published proportion of green is too small.

From reading previous articles, I know that the Customer Care team at M&M/Mars is very friendly and responsive. Apparently they get asked about the distribution of colors quite often, so I sent them a note. The next day they sent a breakdown of the colors for all M&M candies.

Interestingly, plain (and peanut) M&M's are now produced at two different factories in the US, and the factories do not use the same mixture of colors! You need to look on the packaging for the manufacturing code, which is usually stamped inside a rectangle. In the middle of the code will be the letters HKP or CLV. For example, the code might read 632GCLV20.

  • CLV: The Cleveland plant uses the following proportion of colors for plain M&M's:
    Red=0.131, Orange=0.205, Yellow=0.135, Green=0.198, Blue=0.207, and Brown=0.124.
  • HKP: The Hackettstown, NJ, plant uses the following proportion of colors for plain M&M's:
    Red=0.125, Orange=0.25, Yellow=0.125, Green=0.125, Blue=0.25, and Brown=0.125.

Although I did not know about the manufacturing codes when I collected the data, I think it is clear that the bulk of my data came from the CLV plant. You can create a graph that shows the sample proportions, the 95% simultaneous confidence intervals, and vertical hash marks to indicate the CLV population parameters, as follows:

Observed and Expected proportions of M&M color (2017)

The graph shows that the observed proportions are close to the proportions from the CLV plant. All proportions are well within the 95% simultaneous confidence intervals from the data. If you rerun the PROC FREQ chi-square analysis with the CLV proportions, the test does not reject the null hypothesis.

Summary

The experimental evidence indicates that the colors of plain M&M's in 2017 do not match the proportions that were published in 2008.

After contacting the M&M/Mars Customer Care team, I was sent a new set of proportions for 2017. The color proportions now depend on where the candies were manufactured. My data matches the proportion of colors from the Cleveland plant (manufacturing code CLV).

If you are running this analysis yourself, be sure to record whether your candies came from the HKP or CLV plant. If you want to see my analysis, you can download the complete SAS program that analyzes these data.

Educators who use M&M's to teach probability and statistics need to record the manufacturing plant, but this is still a fun (and yummy!) experiment. What do you think? Do you prefer the almost-equal orange-blue-green distribution from the CLV plant? Or do you like the orange-blue dominance from the HKP plant? Or do you just enjoy the crunchy shell and melt-in-your-mouth goodness, regardless of what colors the candies are?

The post The distribution of colors for plain M&M candies appeared first on The DO Loop.

202017
 

We have come a long way since Thomas Edison invented the electric light bulb and the industrial revolution democratized artificial power. Today, most of us take reliable, affordable power for granted. Historically, steady growth in demand for power allowed utilities to invest heavily in generation and delivery infrastructure and to [...]

SAS and Cisco deliver edge-to-enterprise IoT analytics platform was published on SAS Voices.

182017
 

As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s ability to find the abstraction and the engineer’s ability to find the implementation.

For a typical engineering problem, the steps are usually -
- 1. Abstract the problem with a formula or some pseudocodes
- 2. Solve the problem with the formula
- 3. Iterate the initial solution until it achieves the optimal time complexity and space complexity

I feel that a mathematician would like dynamic programming or DP questions most, because they are too similar to the typical deduction question in math. An engineer will feel it challenging, since it needs the imagination and some sense of math.

The formula is the most important: without it, try-and-error or debugging does not help. Once the the formula is figured out, the rest becomes a piece of the cake. However, sometimes things are not that straightforward. Good mathematics does not always lead to good engineering.

Let’s see one question from Leetcode.

You are given a list of non-negative integers, a1, a2, ..., an, and a target, S. Now you have 2 symbols + and -. For each integer, you should choose one from + and - as its new symbol.

Find out how many ways to assign symbols to make sum of integers equal to target S.

Example 1:
Input: nums is [1, 1, 1, 1, 1], S is 3.
Output: 5
Explanation:

-1+1+1+1+1 = 3
+1-1+1+1+1 = 3
+1+1-1+1+1 = 3
+1+1+1-1+1 = 3
+1+1+1+1-1 = 3

There are 5 ways to assign symbols to make the sum of nums be target 3.

1. The quick solution

For each of the element of a list, it has two options: plus or minus. So the question asks how many ways to get a special number by all possible paths. Of course, if the sum of numbers is unrealistic, we just need to return 0.

Sounds exactly like a DP question. If we have a pencil and a paper, we can start to explore the relationship between dp(n) and dp(n-1). For example, our goal is to get a sum of 5, and we are given a list of [1, 1, 1, 1, 1]. If th a smaller tuple/list is (1, 1, 1, 1) and some paths get 4, that is exactly what we want since it adds 1 and becomes 5. Similarly, if they could get 6, that is fine as well. We add simply both paths together, since there are only two paths.

The formula is is dp(n, s) = dp(n-1, s-x) + dp(n-1, s+x), where n is the size of the list, s is the sum of the numbers and x is the one that adds to the previous list. OK, the second step is easy.
def findTargetSumWays_1(nums, S):
"""
:type nums: Tuple[int]
:type S: int
:rtype: int
"""

if not nums:
if S == 0:
return 1
else:
return 0
return findTargetSumWays_1(nums[1:], S+nums[0]) + findTargetSumWays_1(nums[1:], S-nums[0])

small_test_nums = (1, 1, 1, 1, 1)
small_test_S = 3

%time findTargetSumWays_1(small_test_nums, small_test_S)
It is theoretically correct and works perfectly with small test cases. But we know that it is going to a nightmare for an engineering application, because it has a hefty time complexity of O(2^N). So math part is done, and We have to move to the third step.

2. The third step that is hard

So we need to find a data structure to record all the paths. If it is the Fibonacci number problem, a simple linear data structure like a list will slash O(2^N) to O(N).

But the hard part is: what data structure is going to be used here. Since the get operation in Hashtable is O(1), a rolling dictionary will help record the previous states. However, Python’s dictionary does not support change/add ops while it is in a loop, then we have to manually replace it. The overall path will be like a tree structure. So the ideal solution will be like -

def findTargetSumWays_2(nums, S):
if not nums:
return 0
dic = {nums[0]: 1, -nums[0]: 1} if nums[0] != 0 else {0: 2}
for i in range(1, len(nums)):
tdic = {}
for d in dic:
tdic[d + nums[i]] = tdic.get(d + nums[i], 0) + dic.get(d, 0)
tdic[d - nums[i]] = tdic.get(d - nums[i], 0) + dic.get(d, 0)
dic = tdic
return dic.get(S, 0)

big_test_nums = tuple(range(100))
big_test_S = sum(range(88))
%time findTargetSumWays_2(big_test_nums, big_test_S)

The time is exactly what we need. However, the codes are not elegant and hard to understand.
CPU times: user 189 ms, sys: 4.77 ms, total: 194 ms
Wall time: 192 ms

3. Finally the easy solution

If we don’t want things to get complicated. Here we just want a cache and Python 3 provides a lru_cache decorator. Then adding one line to the first solution will quickly solve the problem.

@lru_cache(10000000)
def findTargetSumWays_3(nums, S):
if not nums:
if S == 0:
return 1
else:
return 0
return findTargetSumWays_3(nums[1:], S+nums[0]) + findTargetSumWays_3(nums[1:], S-nums[0])

%time findTargetSumWays_3(big_test_nums, big_test_S)

CPU times: user 658 ms, sys: 19.7 ms, total: 677 ms
Wall time: 680 ms

Conclusion

Good math cannot solve all the engineering problems. It has to combine with the details of the languange, the application and the system to avoid bad engineering implementation.

The Jupyter notebook is at Github. If you have any comment, please email me wm@sasanalysis.com.
172017
 

I’ve always loved watching crime series or films – from Homeland and True Detective to Sherlock Holmes. But after years of building software products and solutions designed to help security and law enforcement agencies solve crimes, I've noticed there are several clichés that need to be conquered by technology. No. [...]

Modern investigations: It’s not like the movies was published on SAS Voices.

172017
 

Are your friends passing around clever memes (supposedly) featuring something your favorite actor said, or sharing news articles that you think might be "fake news"? If there's even a hint of data analyst in you, then you probably check the actual data, and confirm or disprove the supposed facts yourself. I [...]

The post Data wrangling - down the rabbit hole, and back again! appeared first on SAS Learning Post.

172017
 


Nowadays Elasticsearch is more and more popular. Besides it original search functionalities, I found Elasticsearch can be

  1. used as a logging container. That is what the ELK stack is created for.
  2. utilized as a JSON server with richful APIs, which can be combined with its Kibana as BI servers.

That is the data store I see everyday

  • 10PB stocking data
  • average 30TB incoming data everyday
  • various data sources including binary files such PDF
  • including very complicated SQL queries (fortunately no stored procedures)
  • millions of JSON creations daily

People want to know what is going on with such data. So a business intelligence or an OLAP system is needed to visualize/aggregate the data and its flow. Since Elasticsearch is so easy to scale out, it beats other solutions for big data on the market.

1. Batch Worker

There are many options to implement a batch worker. Finally the decision falls to either Spring Data Batch or writing a library from the scratch in Python.

1.1 Spring Data Batch v.s. Python

Spring Data Batch
  • Pros:
    • is a full framework that includes rollback, notification and scheduler features
    • provides great design pattern for Dependency Injection, such as factory and singleton, which help multiple persons work together. For example -
@Bean
public Step IndexMySQLJob01() {
return stepBuilderFactory.get("IndexMySQLJob01")
.<Data, Data> chunk(10)
.reader(reader())
.processor(processor())
.writer(writer())
.build();
}
Python
  • Pros:
    • less codes; flexible
    • super easy from dictionary to JSON
    • has official/3rd party libraries for everything
  • Cons:
    • you create your own library/framework
    • if the pattern like Spring Data Batch is deployed, has to inject dependencies manually, such as -
class IndexMySQLJob01(object):
def __init__(self, reader, processor, writer, listener):
self.reader = reader
self.processor = processor
self.writer = writer
self.listener = listener
...

Eventually Python is picked, because the overall scenario is more algorithm-bound instead of language-bound.

1.2 The algorithms

Since the data size is pretty big, time and space are always considered. The direct way to decrease the time complexity is using the hash tables, as long as the memory can hold the data. For example, a join between an N rows table and an M rows table can be optimized from O(M*N) to O(M).

To save the space, a generator chain is used to stream data from the start to the end, instead of materializing sizable objects.

class JsonTask01(object):
...
def get_json(self, generator1, hashtable1):
for each_dict in generator1:
key = each_dict.get('key')
each_dict.update(hashtable1.get(key))
yield each_dict

1.3 The scheduler

A scheduler is a must: cron is enough for simple tasking, while a bigger system requires a work flow. Airflow is the one that helps organize and schedule. It has a web UI and is written in Python, which is easy to be integrated with the batch worker.

1.4 High availability with zero downtime

Indexing of large quantity of data will impose significant impact. For mission-critical indexes that need 100% up time, the zero down algorithm is implemented and we keep two copies of an index for maximum safety. The alias will switch between the two copies once the indexing is finished.

    def add_alias(self, idx):
LOGGER.warn("The alias {} will point to {}.".format(self.index, idx))
self.es.indices.put_alias(idx, self.index)

def delete_alias(self, idx):
LOGGER.warn("The alias {} will be removed from {}.".format(self.index, idx))
self.es.indices.delete_alias(idx, self.index)

2. Elasticsearch Cluster

2.1 Three kinds of nodes

An Elasticsearch node can choose one of three roles: master node, data node and ingest node(previously called client node). It is commonly seen to dedicate a node as master and ingest and data all together. For a large system, it is always helpful to assign the three roles to different machines/VMs. Therefore, once a node is down/up, it will be quicker to failover or recover.
elasticsearch-head can clearly visualize the data transfer process of the shards once an accident occurs.

With the increased number of cluster nodes, the deployment becomes painful. I feel the best tool so far is ansible-elasticsearch. With ansible-playbook -i hosts ./your-playbook.yml -c paramiko, the cluster is on the fly.

2.2 Memory is cheap but heap is expensive

The rules of thumb for Elasticsearch are -

Give (less than) Half Your Memory to Lucene

Don’t Cross 32 GB!

The result causes an awkward situation: if you have a machine that has more than 64GB memory, then the additional memory will mean nothing to Elasticsearch. Actually it is meaningful to run two or more Elasticsearch instances side by side to save the hardware. For example, there is a machine with 96GB memory. We can allocate 31GB for an ingest node, 31 GB for a data node and the rest for the OS. However, two data nodes in a single machine will compete for the disk IO that damages the performance, while a master node and a data node will increase the risk of downtime.

The great thing for Elasticsearch is that it provides richful REST APIs, such as http://localhost:9200/_nodes/stats?pretty. We could use Xpack(paid) or other customized tools to monitor them. I feel that the three most important statistics for the heap and therefore the performance are -

The heap usage and the old GC duration

The two statistics intertwined together. The high heap usage, such as 75%, will lead to a GC, while GC with high heap usage will take longer time. We have to keep both numbers as low as possible.

     "jvm" : {
"mem" : {
"heap_used_percent" : 89,
...
"gc" : {
"collectors" : {
...
"old" : {
"collection_count" : 225835,
"collection_time_in_millis" : 22624857
}
}
}
Thread pools

There are three kinds of thread pools: active, queue, and reject. It is useful to visualize the real time change. Once there are a lot of queued threads or rejected threads, it is good time to think about scale up or scale out.

"thread_pool" : {
"bulk" : {
"threads" : 4,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 4,
"completed" : 53680
},
The segments’ number/size

The segments are the in-memory inverted indexes corresponding to the indexes on the hard disk, which are persistent in the physical memory and GC will have no effect on them. The segment will have the footage on every search thread. The size of the segments are important because they will be multiplied by a factor of the number of threads.

        "segments" : {
"count" : 215,
"memory_in_bytes" : 15084680,
},

The number of shards actually controls the number of the segments. The shards increase, then the size of the segments decreases and the number of the segments increases. So we cannot increase the number of shards as many as we want. If there are many small segments, the heap usage will turn much higher. The solution is Force merge, which is time-consuming but effecitve.

3. Kibana as BI dashboard

3.1 Plugins

Kibana integreated DevTools(previously The Sense UI) for free. DevTools has code assistance and is a powerful tool for debugging. If the budget is not an issue, Xpack is also highly recommended. As for Elasticsearch, since 5.0, ingest-geoip is now a plugin. We will have to write it to Ansible YAML such as -

es_plugins:
- plugin: ingest-geoip

3.2 Instant Aggregation

There are quite a few KPIs that need system-wide term aggregations. From 5.0 the request cache will be enabled by default for all requests with size:0.
For example -

POST /big_data_index/data/_search
{ "size": 0,
"query": {
"bool": {
"must_not": {
"exists": {
"field": "interesting_field"
}
}
}
}
}

The Fore merge as mentioned above, such asPOST /_forcemerge?max_num_segments=1, will combine the segments and dramatically increase the aggregation speed.

3.3 Proxy

Nginx is possibly the best proxy as the frontend toward Kibana. There are two advantages: first the proxy can cache the static resources of Kibana; second we can always check the Nginx logs to figure out what causes problem for Kibana.

Conclusion

Elasticsearch and Kibana together provide high availability and high scalability for large BI system.

if you have any comment, please email me wm@sasanalysis.com

152017
 

Corporate compliance with an increasing number of industry regulations intended to protect personally identifiable information (PII) has made data privacy a frequent and public discussion. An inherent challenge to data privacy is, as Tamara Dull explained, “data, in and of itself, has no country, respects no law, and travels freely across borders. In the […]

The post What does the requirement for data privacy mean for data scientists, business analysts and IT? appeared first on The Data Roundtable.

152017
 

A categorical response variable can take on k different values. If you have a random sample from a multinomial response, the sample proportions estimate the proportion of each category in the population. This article describes how to construct simultaneous confidence intervals for the proportions as described in the 1997 paper "A SAS macro for constructing simultaneous confidence intervals for multinomial proportions" by Warren May and William Johnson (Computer Methods and Programs in Biomedicine, p. 153–162).

Estimates of multinomial proportions

In their paper, May and Johnson present data for a random sample of 220 psychiatric patients who were categorized as either neurotic, depressed, schizophrenic or having a personality disorder. The observed counts for each diagnosis are as follows:

data Psych;
input Category $21. Count;
datalines;
Neurotic              91
Depressed             49
Schizophrenic         37
Personality disorder  43
;

If you divide each count by the total sample size, you obtain estimates for the proportion of patients in each category in the population. However, the researchers wanted to compute simultaneous confidence intervals (CIs) for the parameters. The next section shows several methods for computing the CIs.

Methods of computing simultaneous confidence intervals

May and Johnson discussed six different methods for computing simultaneous CIs. In the following, 1–α is the desired overall coverage probability for the confidence intervals, χ2(α, k-1) is the upper 1–α quantile of the χ2 distribution with k-1 degrees of freedom, and π1, π2, ..., πk are the true population parameters. The methods and the references for the methods are:

  1. Quesenberry and Hurst (1964): Find the parameters πi that satisfy
    N (pi - πi)2 ≤ χ2(α, k-1) πi(1-πi).
  2. Goodman (1965): Use a Bonferroni adjustment and find the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α/k, 1) πi(1-πi).
  3. Binomial estimate of variance: For a binomial variable, you can bound the variance by using πi(1-πi) ≤ 1/4. You can construct a conservative CI for the multinomial proportions by finding the parameters that satisfy
    N (pi - πi)2 ≤ χ2(α, 1) (1/4).
  4. Fitzpatrick and Scott (1987): You can ignore the magnitude of the proportion when bounding the variance to obtain confidence intervals that are all the same length, regardless of the number of categories (k) or the observed proportions. The formula is
    N (pi - πi)2 ≤ c2(1/4)
    where the value c2= χ2(α/2, 1) for α ≤ 0.016 and where c2= (8/9)χ2(α/3, 1) for 0.016 < α ≤ 0.15.
  5. Q and H with sample variance: You can replace the unknown population variances by the sample variances in the Quesenberry and Hurst formula to get
    N (pi - πi)2 ≤ χ2(α, k-1) pi(1-pi).
  6. Goodman with sample variance: You can replace the unknown population variances by the sample variances in the Goodman Bonferroni-adjusted formula to get
    N (pi - πi)2 ≤ χ2(α/k, 1) pi(1-pi).

In a separate paper, May and Johnson used simulations to test the coverage probability of each of these formulas. They conclude that the simple Bonferroni-adjusted formula of Goodman (second in the list) "performs well in most practical situations when the number of categories is greater than 2 and each cell count is greater than 5, provided the number of categories is not too large." In comparison, the methods that use the sample variance (fourth and fifth in the list) are "poor." The remaining methods "perform reasonably well with respect to coverage probability but are often too wide."

A nice feature of the Q&H and Goodman methods (first and second on the list) is that they procduce unsymmetric intervals that are always within the interval [0,1]. In contrast, the other intervals are symmetric and might not be a subset of [0,1] for extremely small or large sample proportions.

Computing CIs for multinomial proportions

You can download SAS/IML functions that are based on May and Johnson's paper and macro. The original macro used SAS/IML version 6, so I have updated the program to use a more modern syntax. I wrote two "driver" functions:

  • CALL MultCIPrint(Category, Count, alpha, Method) prints a table that shows the point estimates and simultaneous CIs for the counts, where the arguments to the function are as follows:
    • Category is a vector of the k categories.
    • Count is a vector of the k counts.
    • alpha is the significance level for the (1-alpha) CIs. The default value is alpha=0.05.
    • Method is a number 1, 2, ..., 6 that indicates the method for computing confidence intervals. The previous list shows the method number for each of the six methods. The default value is Method=2, which is the Goodman (1965) Bonferroni-adjusted method.
  • MultCI(Count, alpha, Method) is a function that returns a three-column matrix that contains the point estimates, lower limit, and upper limit for the CIs. The arguments are the same as above, except that the function does not use the Category vector.

Let's demonstrate how to call these functions on the psychiatric data. The following program assumes that the function definitions are stored in a file called CONINTS.sas; you might have to specify a complete path. The PROC IML step loads the definitions and the data and calls the MultCIPrint routine and requests the Goodman method (method=2):

%include "conint.sas";   /* specify full path name to file */
proc iml;
load module=(MultCI MultCIPrint);
use Psych; read all var {"Category" "Count"}; close;
 
alpha = 0.05;
call MultCIPrint(Category, Count, alpha, 2); /* Goodman = 2 */
Estimates and 95% simultaneous confidence intervals for m ultinomial proportions

The table shows the point estimates and 95% simultaneous CIs for this sample of size 220. If the intervals are wider than you expect, remember the goal: for 95% of random samples of size 220 this method should produce a four-dimensional region that contains all four parameters.

You can visualize the width of these intervals by creating a graph. The easiest way is to write the results to a SAS data set. To get the results in a matrix, call the MultCI function, as follows:

CI = MultCI(Count, alpha, 2);  /*or simply CI = MultCI(Count) */
/* write estimates and CIs to data set */
Estimate = CI[,1];  Lower = CI[,2];  Upper = CI[,3];
create CIs var {"Category" "Estimate" "Lower" "Upper"};
append;
close;
quit;
 
ods graphics / width=600px height=240px;
title 'Simultaneous Confidence Intervals for Multinomial Proportions';
title2 'Method of Goodman (1965)';
proc sgplot data=CIs;
  scatter x=Estimate y=Category / xerrorlower=Lower xerrorupper=upper
          markerattrs=(Size=12  symbol=DiamondFilled)
          errorbarattrs=GraphOutlines(thickness=2);
  xaxis grid label="Proportion" values=(0.1 to 0.5 by 0.05);
  yaxis reverse display=(nolabel);
run;
Graph of estimates and simultaneous confidence intervals for multinomial proportions

The graph shows intervals that are likely to enclose all four parameters simultaneously. The neurotic proportion in the population is probably in the range [0.33, 0.50] and at the same time the depressed proportion is in the range [0.16, 0.30] and so forth. Notice that the CIs are not symmetric about the point estimates; this is most noticeable for the smaller proportions such as the schizophrenic category.

Because the cell counts are all relatively large and because the number of categories is relatively small, Goodman's CIs should perform well.

I will mention that it you use the Fitzpatrick and Scott method (method=4), you will get different CIs from those reported in May and Johnson's paper. The original May and Johnson macro contained a bug that was corrected in a later version (personal communication with Warren May, 25FEB2016).

Conclusions

This article presents a set of SAS/IML functions that implement six methods for computing simultaneous confidence intervals for multinomial proportions. The functions are updated versions of the macro %CONINT, which was presented in May and Johnson (1987). You can use the MultCIPrint function to print a table of statistics and CIs, or you can use the MultCI function to retrieve that information into a SAS/IML matrix.

tags: Statistical Programming

The post Simultaneous confidence intervals for multinomial proportions appeared first on The DO Loop.