4月 072020
 

As many of us are learning to navigate the changing world we are living in amid the COVID-19 outbreak, and as we care for our loved ones, friends, and our community, many of us now find ourselves working and studying from home much more than we did before. As an employee at SAS and an instructor at NC State University, I have found myself splitting my time between setting up my temporary home office and an at-home routine while also trying to help my students feel safe and comfortable as we move to a virtual classroom. But, as my commute and previous social time becomes Facetime calls and text messages, I’ve found myself with more downtime then I previously had, the time I want to dedicate to the training I’ve been wanting to do for the past year.

At SAS, we are striving to care for our users during this time—in that spirit, I wanted to share with you some free SAS offerings, as well as coping techniques I am doing from home.

Take care of yourself and your family

First and foremost, make sure you and your family are taking time for self-care. Whether it be meditation, using a mobile app or YouTube video, or getting some exercise together. I am finding my daily walk something I need to help relax my mind and get myself back to focus on my tasks.

Retrain on skills you haven’t touched in awhile

Sometimes we all need that reboot on tools or methods we use every day. I love SAS’ Statistics 1 course and SAS Programming 1, both of which are free. It’s a great refresher to those who haven’t taken a math course in a few years, or for those just wanting to start out with data science using SAS. Understanding the fundamentals and getting some refreshers on SAS language tricks is something I try to push through with my work constantly. I am also preparing to take the SAS Certification Exam at the end of the year, so the practice exam is something I also plan to use often during my study sessions.

Learn a new skill

It is also a great time to learn something you have always wanted to do. I started by taking some free online photography classes. I also have on my list enrolling in the SAS Academy for Data Science, which is free until the end of May 2020. The advanced analytics professional courses are something I have wanted to complete for a long time, so I am excited to get started on learning more about data modeling. The SAS e-books collection is now also free until April 30, 2020, so I’ve downloaded some great additional materials using code FREEBOOKS at checkout. 

Be kind to others

Being stuck inside can sometimes make you feel like you are spending more time with family than you are used to, or maybe you are spending more time alone. Using this time to connect with those I haven’t made time to talk to has been something I am really thankful for. I call my sister who is a nurse in Florida and check on her. As this outbreak affects us all differently, this is a great time to come together to connect. It is also a great time to think of others affected by the outbreak and who don’t have the ability to continue working. There are some great ways you can help others by getting involved in volunteer work or donating to a helpful cause.

Do fun activities

Though we are stuck at home, this time has been great for enjoying things I often don’t get to during my normal busy schedule. Besides taking free training, I’ve been playing some new video games (#ACNH) and some games I’ve neglected for far too long. I’ve also used this time to find what brings me joy from my home. Making time for reading is bringing great joy to my life right now.

As we all move through this turbulent time, make sure to take care of yourself and others. I hope some of these free tools and training will come in handy as you work towards your personal goals while remaining safe and healthy.

Working remotely? A list of 5 ways to spend your down time was published on SAS Users.

4月 062020
 

For Climate Month I thought I’d research something on the positive side of the equation.  My last two climate posts were a bit gloomy, as they focused on greenhouse gasses and ocean acidification.  Instead we’re going to tackle a product adoption issue related to the climate:  When might the United [...]

When will residential solar power reach half the United States? was published on SAS Voices by Leo Sadovy

4月 062020
 

I have written several articles about how to work with continuous probability distributions in SAS. I always emphasize that it is important to be able to compute the four essential functions for working with a statistical distribution. Namely, you need to know how to generate random values, how to compute the PDF, how to compute the CDF, and how to compute quantiles. In this article, I describe how to compute each of the four quantities for the geometric distribution, which is a DISCRETE probability distribution. The graphs that visualize a discrete distribution are slightly different than for continuous distributions. Also, the geometric distribution has two different definitions, and I show how to work with either definition in SAS.

The definition of the geometric distribution

A Bernoulli trial is an experiment that has two results, usually referred to as a "failure" or a "success." The success occurs with probability p and the failure occurs with probability 1-p. "Success" means that a specific event occurred whereas "failure" indicates that the event did not occur. Because the event can be negative (death, recurrence of cancer, ...) sometimes a "success" is not a cause for celebration!

The geometric distribution has two definitions:

  1. The number of trials until the first success in a sequence of independent Bernoulli trials. The possible values are 1, 2, 3, ....
  2. The number of failures before the first success in a sequence of independent Bernoulli trials. The possible values are 0, 1, 2, ....

Obviously, the two definitions are closely related. If X is a geometric random variable according to the first definition, then Y=X-1 is a geometric random variable according to the second definition. For example, define "heads" as the event that you want to monitor. If you toss a coin and it first shows heads on the third toss, then the number of trials until the first success is 3 and the number of failures is 2. Whenever you work with the geometric distribution (or its generalization, the negative binomial distribution), you should check to see which definition is being used.

The definition of the geometric distribution in SAS software

It is regrettable that SAS was not consistent in choosing a definition. The first definition is used by the RAND function to generate random variates. The second definition is used by the PDF function, the CDF function, and the QUANTILE function. In this article, I will use the "number of trials," which is the first definition. In my experience, this definition is more useful in applications. I will point out how to adjust the syntax of the SAS functions so that they work for either definition.

The probability mass function for the geometric distribution

For a discrete probability distribution, the probability mass function (PMF) gives the probability of each possible value of the random variable. So for the geometric distribution, we want to compute and visualize the probabilities for n=1, 2, 3,... trials.

The probabilities depend on the parameter p, which is the probability of success. I will use three different values to illustrate how the geometric distribution depends on the parameter:

  • p = 1/2 = 0.5: the probability of "heads" when you toss a coin
  • p = 1/6 = 0.166: the probability of rolling a 6 with a six-sided die
  • p = 1/13 = 0.077: the probability of drawing an ace from a shuffled deck of 52 cards

The following SAS DATA step uses the PDF function to compute the probabilities for these three cases. The parameters in the PDF function are the number of failures and the probability of success. If you let n be the number of trials until success, then n-1 is the number of failures before success. Thus, the following statements use n-1 as a parameter. Since I will need the cumulative probabilities in the next section, I use the CDF function to compute them in the same DATA step:

data Geometric;
do p = 0.5, 0.166, 0.077;
   /* We want the probability that the event occurs on the n_th trial.
      This is the same as the probability of n-1 failure before the event. */
   do n = 1 to 16;                     
      pdf = pdf("Geometric", n-1, p);  /* probability that event occurs on the n_th trial */
      cdf = cdf("Geometric", n-1, p);  /* probability that event happens on or before n_th Trial*/
      output;
   end;
end;
run;
 
title "Probability Mass Function for Geom(p)";
proc sgplot data=Geometric;
   scatter x=n y=pdf / group=p markerattrs=(symbol=CircleFilled);
   series  x=n y=pdf / group=p lineattrs=(color=Gray);
   label n="Number of Trials" pdf="Probability of Event";
   xaxis grid integer values=(0 to 16 by 2) valueshint;
run;

If you are visualizing the PMF for one value of p, I suggest that you use a bar chart. Because I am showing multiple PMFs in one graph, I decided to use a scatter plot to indicate the probabilities, and I used gray lines to visually connect the probabilities that belong to the same distribution. This is the same technique that is used on the Wikipedia page for the geometric distribution.

For large probabilities, the PMF probability decreases rapidly. When the probability for success is large, the event usually occurs early; it is rare that you would have to wait for many trials before the event occurs. This geometric rate of decrease is why the distribution is named the "geometric" distribution! For small probabilities, the PMF probability decreases slowly. It is common to wait for many trials before the first success occurs.

The cumulative probability function

If you want to know the cumulative probability that the event occurs on or before the nth trial, you can use the CDF function. The CDF curve was computed in the previous section for all three probabilities. The following call to PROC SGPLOT creates a graph. If I were plotting a single distribution, I would use a bar chart. Because I am plotting several distributions, the call uses the STEP statement to create a step plot for the discrete horizontal axis. However, if you want to show the CDF for many trials (maybe 100 or more), then you can use the SERIES statement because at that scale the curve will look smooth to the eye.

title "Cumulative Distribution for Geom(p)";
title2 "Probability That Event Occurs on or before n_th Trial";
proc sgplot data=Geometric;
   step x=n y=cdf / group=p curvelabel markers markerattrs=(symbol=CircleFilled);
   label n="Number of Trials" cdf="Cumulative Probability";
   xaxis grid integer values=(0 to 16 by 2) valueshint;
   yaxis grid;
run;

The cumulative probability increases quickly when the probability of the event is high. When the probability is low, the cumulative probability is initially almost linear. You can prove this by using a Taylor series expansion of the CDF, as follows. The formula for the CDF is F(n) = 1 – (1 – p)n. When p is much much less than 1 (written p ≪ 1), then (1 – p)n ≈ 1 – n p + .... Consequently, the CDF is nearly linear (with slope p) as a function of n when p ≪ 1.

Quantiles of the geometric distribution

When the density function (PDF) of a continuous distribution is positive, the CDF is strictly increasing. Consequently, the inverse CDF function is continuous and increasing. For discrete distributions, the CDF function is a step function, and the quantile is the smallest value for which the CDF is greater than or equal to the given probability. Consequently, the quantile function is also a step function. The following DATA step uses the QUANTILE function to compute the quantiles for p=1/6 and for evenly spaced values of the cumulative probability.

data Quantile;
prob = 0.166;
do p = 0.01 to 0.99 by 0.005;
   n = quantile("Geometric", p, prob); * number of failures until event;
   n = n+1;                            * number of trials until event;
   output;
end;
run;
 
title "Quantiles for Geom(0.166)";
proc sgplot data=Quantile;
   scatter x=p y=n / markerattrs=(symbol=CircleFilled);
   label n="Number of Trials" p="Cumulative Probability";
   xaxis grid;
   yaxis grid;
run;

Random variates

For a discrete distribution, it is common to use a bar chart to show a random sample. For a large sample, you might choose a histogram. However, I think it is useful to plot the individual values in the sample, especially if some of the random variates are extreme values. You can Imagine 20 students who each roll a six-side die until it shows 6. The following DATA step simulates this experiment.

%let numSamples = 20;           *sample of 20 people;
data RandGeom;
call streaminit(12345);         * random seed for covid-19;
p = 0.166;                      * 1/6 ~ 0.166;
do ID = 1 to &numSamples;
   x = rand("Geometric", p);    *number of trials until event occurs;
   /* if you want the number of failures before the event: x = x-1 */
   Zero = 0;
   output;
end;
run;

In the simulated random sample, the values range from 1 (got a 6 on the first roll) to 26 (waited a long time before a 6 appeared). You can create a bar chart that shows these values. You can also add reference lines to the chart to indicate the mean and median of the Geom(1/6) distribution. For a Geom(p) distribution, the expected value is 1/p and the median of the distribution is ceil( -1/log2(1-p) ). When p = 1/6, the expected value is 6 and the median value is 4.

When a bar chart contains a few very long bars, you might want to "clip" the bars at some reference value so that the smaller bars are visible. The HIGHLOW statement in PROC SGPLOT supports the CLIPCAP option, which truncates the bar and places an arrow to indicate that the true length of the bar is longer than indicated in the graph. If you add labels to the end of each bar, the reader can see the true value even though the bar is truncated in length, as follows:

proc sort data=RandGeom;
   by x;
run;
/* compute mean and median of the Geom(1/6) distribution */
data Stats;
p = 0.166;
PopMean = 1/p;                    PopMedian = ceil( -1/log2(1-p) );
call symputx("PopMean", PopMean); call symputx("PopMedian", PopMedian);
run;
 
/* use CLIPCAP amd MAX= option to prevent long bars */
title "&numSamples Random Rolls of a Six-side Die";
title2 "Random Sample of Geom(0.166)";
proc sgplot data=RandGeom;
   highlow y=ID low=Zero high=x / type=bar highlabel=x    CLIPCAP; 
   refline &PopMean / axis=x label="Mean" labelpos=min; 
   refline &PopMedian / axis=x label="Population Median" labelpos=min; 
   xaxis grid label="Number of Trials until Event" minor  MAX=10;  
   yaxis type=discrete discreteorder=data reverse valueattrs=(size=7);
run;

This chart shows that four students got "lucky" and observed a 6 on their first roll. Half the students observed a 6 is four or fewer rolls, which is the median value of the Geom(1/6) distribution. Some students rolled many times before seeing a 6. The graph is truncated at 10, and students who required more than 10 rolls are indicated by an arrow.

Conclusions

This article shows how to compute the four essential functions for the geometric distribution. The geometric distribution is a discrete probability distribution. Consequently, some concepts are different than for continuous distributions. SAS provides functions for the PMF, CDF, quantiles, and random variates. However, you need to be a careful because there are two common ways to define the geometric distribution. The SAS statements in this article show how to define the geometric distribution as the number of trials until an event occurs. However, with minor modifications, the same program can handle the second definition, which is the number of failures before the event.

The post The geometric distribution in SAS appeared first on The DO Loop.

4月 052020
 

最近新冠肆虐,在高度全球化的地球村,没有哪一个国家和地区的人能躲避这个病毒, 它是全人类的敌人。在油管上看到一个有趣的动图展示(Trajectory of COVID-19 confirmed cases,请自行搜索)。其中有一个图形截图如下:

图1, Trajectory of covid 19 cases

这个图横纵坐标都用的是对数坐标系,有意思的是,横坐标是总确诊例数,纵坐标是新增例数。对于这种全球性的大型数据SAS公司常用的是鼓泡泡图,漂亮美观大气,特别在大屏幕上拿着长竹竿指指点点,非常的霸气。虽然能够看到各国的变化趋势和比较,但是没法判断关键节点。而这种弹道图却能很好地恰如其分的看到拐点。毕竟天天隔离在家,不能出门是非常的无聊无助,看到拐点就看到了解封的希望。

这种图的就是曲线图,随时间增加而变化的动态图,比较能反映其变化。画这种几个要点:

1,数据源,全球数据长时间的数据,这次的数据很符合。在网上找到一个WHO的数据源

2,坐标系采用对数坐标系。病毒这种微生物,其生长趋势分为潜伏期,对数增长期,平稳期和消亡期。课本上讲的清清楚楚,相关理论及现象也是多如牛毛,这里就不罗嗦了。 人类对线性的理解比较深刻,对什么指数式增长的趋势有种失控感。采用线性变化趋势能够很好的保护人类的脆弱心理,有利于做出理性的判断。

3,横纵坐标的变量的选取,这是个难点。常规下横坐标是时间,纵坐标是数量的对数。这个通常用来预测限制条件下的生长变化趋势。但其实很多时候,实际影响因素太多而无法做出判断,通常都是马后炮。不过如果能从历史曲线中学习到新知识,也能受益匪浅。除了时间、总数以外,每日新增数,时间拐点这两个参数也是非常受关注。 怎么把这两个参数在图表上表示出来呢? 问题很好,答案就在本文图1中。总数,每日新增,拐点,就缺个时间戳。从图1可知,中国和韩国已经上岸,其他国家还挂在线上。

4,需要吸引人。 SAS画图的毛病我在群里吐槽了很多,这里就不多表。不过SAS画图工工整整,严谨,直接挂Nature上都行。如果能增加一点活泼感就更好了,这里,我在横纵坐标轴上做了一点改进,让坐标轴从1000起步,随时间增加,新增数和总数隔一段时间会增加一个量级,效果不错。如下图。

5,还有一个处理,由于上报的时间存在滞后,并且检测技术,诊断标准各个国家都在变,所以上报的数据在某些阶段过于集中,导致有假拐点出现。因此,这里做了一个平滑,使用移动平价,也即是今天的新增数据用将过去7天(包括今天)平均数来代替,避免假拐点迷惑人类的眼睛。

上面这个是全球的,齐齐整整的,全都上线了。中韩提前下线,但是要看到后面新增例数存在一个反复的过程,防疫任务艰巨,大家还是不能掉以轻心。下面重点看下几个“模范生”,不管是自称的,还是公认的。曲线上写的清清楚楚。看下面图,日本,新加坡,这分明还在线上挣扎嘛。最多是把曲线拉平了一些(”flatten the curve”,请自行搜索。),拐点还不明朗,还需少吹牛,多努力,别被忽悠瘸了。

原创文章: ”COVDI-19全球各国病例数弹道追踪图示“,转载请注明: 转自SAS资源资讯列表

本文链接地址: http://saslist.net/archives/460

4月 032020
 

These are incredibly tough times, and fortunately, everyone I know is still healthy, safe and doing everything they can to stay that way. I know I'm lucky -- the worst I've suffered so far under coronavirus stay-at-home orders is sports withdrawal. I'm trying to fill that void in my life [...]

Visualizing shots, goals, players and more with SAS was published on SAS Voices by Frank Silva

4月 032020
 

Whether you like it or not, Microsoft Excel is still a big hit in the data analysis world. From small to big customers, we still see fit for daily routines such as filtering, generating plots, calculating items on ad-hoc analysis or even running statistical models. Whenever I talk to customers, there is always someone who will either ask: Can this be exported to excel or can we import data from excel?. Recently, other questions started to come up more often: Can we run Python within SAS? How do I allow my team to choose their language of preference? How do I provide an interface that looks like Microsoft Excel, but has SAS functionalities?.

Well… good news is: we can answer YES to all of these questions. With the increase in number of users performing analytics and the number of analytical tools available, for me it was clear that we would end up having lots of disparate processes. For a while this was a problem, but naturally, companies started developing ways to integrate these siloed teams.

In the beginning of last decade, SAS developed SAS Add-in for Microsoft Office. The tool allows customers to run/embed SAS analytic capabilities inside Microsoft Office applications. More recently, SAS released a new version of PROC FCMP allowing users to write Python code and call, if as a function, inside SAS programs.

These advancements provide users the ability to run Python inside Excel. When I say inside, I really mean from within Excel's interface.

Before we jump to how we can do it, you may ask yourself: Why is this relevant to me? If I know SAS, I import the dataset and work with the data in SAS; If I know Python, I open a Jupyter notebook, import the data set and do my thing. Well… you are kind of right, but let me tell you a story.

The use case

Recently I worked with a customer and his business process was like this: I have a team of data scientists that is highly technical and knowledgeable in Python and SAS. Additionally, I have a team of analysts with little Python knowledge, but are always working with Excel to summarize data, create filters, graphs, etc. My teams need to communicate and collaborate. The normal chain of events follows:

  1. the Python team works on the data, and exports the results to Excel
  2. the analytics team picks up the data set, and runs SAS scripts and excel formulas

This is a problem of inefficiency for the customer. Why can't the data scientist pass his or her code to the analyst to execute it on the same project without having to wait on the Python specialist to run the code?

I know this sounds overly complicated, but as my SAS colleague Mike Zizzi concludes in his post SAS or Python? Why not use both? Using Python functions inside SAS programs, at the end of the day what matters is that you get your work done. No matter which language, software or IDE you are using. I highly recommend Mike's article if you want a deep dive on what PROC FCMP has to offer.

The process

Let's walk through a data scoring scenario similar to my customer's story. Imagine I am a SAS programmer using Excel to explore data. I am also part of a team that uses Python, creating scoring data code using analytical models developed in Python. My job is to score and analyze the data on Excel and pass the results to the service representative, so they can forward the response to the customer.

Importing data

The data set we'll work with in this example will help me analyze which customers are more likely to default on a loan. The data and all code used in this article are in the associated GitHub repository. The data dictionary for the data set is located here. First, I open the data set as seen on Sheet1 below in Excel.

Upload data to SAS

Before we jump to the coding part with SAS and Python, I need to send the data to SAS. We'll use the SAS add-in, in Excel to send data to the local server. I cover the steps in detail below.

I start by selecting the cells I want to upload to the library.

Next, I move to the SAS tab and select the Copy to SAS Server task.

A popup shows up where I confirm the selected cells.

After I click OK, I configure column, table, naming and location options.

SAS uploads the table to the requested library. Additionally, a new worksheet with the library.table name displays the results. As you can see on the image below, the sheet created follows the name WORK.IMPORTED_DATA we setup on the previous step. This represents the table in the SAS library memory. Notice, however, we are still working in Excel.

The next step is to incorporate the code sent from my teammate.

The Python code

The code our colleague sent is pure Python. I don't necessarily have to understand the code details, just what it does. The Python code below imports and scores a model and returns a score. Note: if you're attempting this in your own environment, make sure to update the hmeq_model.sav file location in the # Import model pickle file section.

def score_predictions(CLAGE, CLNO, DEBTINC,DELINQ, DEROG, LOAN, MORTDUE, NINQ,VALUE, YOJ):
	"Output: scored"
	# Imporing libraries
	import pandas as pd
	from sklearn.preprocessing import OneHotEncoder
	from sklearn.compose import ColumnTransformer
	from sklearn.externals import joblib
 
	# Create pandas dataframe with input vars
	dataset = pd.DataFrame({'CLAGE':CLAGE, 'CLNO':CLNO, 'DEBTINC':DEBTINC, 'DELINQ':DELINQ, 'DEROG':DEROG, 'LOAN':LOAN, 'MORTDUE':MORTDUE, 'NINQ':NINQ, 'VALUE':VALUE, 'YOJ':YOJ}, index=[0])
 
	X = dataset.values
 
	# Import model pickle file
	loaded_model = joblib.load("C://assets/hmeq_model.sav")
 
	# Score the input dataframe and get 0 or 1 
	scored = int(loaded_model.predict_proba(X)[0,1])
 
	# Return scored dataframe
	return scored

My SAS code calls this Python code from a SAS function defined in the next section.

The SAS code

Turning back to Excel, in the SAS Add-in side of the screen, I click on Programs. This displays a code editor, and as explained on this video, is like any other SAS code editor.

We will use this code editor to write, run and view results from our code.

The code below defines a FCMP function called Score_Python, that imports the Python script from my colleague and calls it from a SAS datastep. The output table, HMEQ_SCORED, is saved on the WORK library in SAS. Note: if you're attempting this in your own environment, make sure to update the script.py file location in the /* Getting Python file */ section.

proc fcmp outlib=work.fcmp.pyfuncs;
 
/* Defining FCMP function */
proc fcmp outlib=work.fcmp.pyfuncs;
	/* Defining name and arguments of the Python function to be called */
 
	function Score_Python(CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
		/* Python object */
		declare object py(python);
 
		/* Getting Python file  */
		rc = py.infile("C:\assets\script.py");
 
		/* Send code to Python interpreter */
		rc = py.publish();
 
		/* Call python function with arguments */
		rc = py.call("score_predictions",CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
 
		/* Pass Python results to SAS variable */
		MyFCMPResult = py.results["scored"];
 
		return(MyFCMPResult);
	endsub;
run;
 
options cmplib=work.fcmp;
 
/* Calling FCMP function from data step */
data work.hmeq_scored;
	set work._excelexport;
	scored_bad = Score_Python(CLAGE, CLNO, DEBTINC, DELINQ, DEROG, LOAN, MORTDUE, NINQ, VALUE, YOJ);
	put scored_bad=;
run;

 

I place my code in the editor.

We're now ready to run the code. Select the 'Running man' icon in the editor.

The output

The result represents the outcome of the Python model scoring code. Once the code completes the WORK.HMEQ_SCORED worksheet updates with a new column, scored_bad.

The binary value represents if the customer is likely (a '1') or unlikely (a '0') to default on his or her loan. I could now use any built in Excel features to filter or further analyze the data. For instance, I could filter all the customers likely to default on their loans and pass a report on to the customer management team.

Final thoughts

In this article we've explored how collaboration between teams with different skills can streamline processes for efficient data analysis. Each team focuses on what they're good at and all of the work is organized and completed in one place. It is a win-win for everyone.

Related resources

Extending Excel with Python and SAS Viya was published on SAS Users.

4月 022020
 

In February, SAS was recognized as a Leader in the 2020 Gartner Magic Quadrant for Data Science & Machine Learning Platforms report. SAS is the only vendor to be a leader in this report for all seven years of its existence. According to us, the topic of the research is [...]

SAS Customer Intelligence 360: Analyst viewpoints was published on Customer Intelligence Blog.

4月 022020
 

Being a state governor has never been more difficult than right at this moment, because the extreme steps governors are being forced to take in order to protect their citizens’ health are having an equally extreme impact on other critical aspects of our society. In the face of social distancing [...]

Governors - here's how your data dashboard can help you lead through the pandemic was published on SAS Voices by Grant Brooks

4月 012020
 

During an epidemic, such as the coronavirus pandemic of 2020, the media often shows graphs of the cumulative numbers of confirmed cases for different countries. Often these graphs use a logarithmic scale for the vertical axis. In these graphs, a straight line indicates that new cases are increasing at an exponential rate. The slope of the line indicates how quickly cases will double, with steep lines indicating a short doubling time. The doubling time is the length of time required to double the number of confirmed cases, assuming nothing changes.

This article shows one way to estimate the doubling time by using the most recent data. The method uses linear regression to estimate the slope (m) of a curve, then estimates the doubling time as log(2) / m.

The data in this article are the cumulative counts for COVID-19 cases in four countries (Italy, the United States, Canada, and South Korea) for the dates 03Mar20202 to 27Mar2020. You can download the data and SAS program for this article.

A log-scale visualization of the cumulative cases

The data set contains four variables:

  • The Region variable specifies the country.
  • The Day variable specifies the number of days since 03Mar2020.
  • The Cumul variable specifies the cumulative counts of confirmed cases of COVID-19.
  • The Log10Cumul variable specifies the base-10 logarithm of the confirmed cases. In SAS, you can use the LOG10 function to compute the base-10 logarithm.

You can use PROC SGPLOT to visualize these data. The following graph plots the total number of cases, but uses the TYPE=LOG and LOGBASE=10 options to specify a base-10 logarithmic axis for the counts:

title "Cumulative Counts (log scale)";
proc sgplot data=Virus;
  where Cumul > 0;
  series x=Day y=Cumul / group=Region curvelabel;
  xaxis grid;
  yaxis type=LOG logbase=10 grid
        values=(100 500 1000 5000 10000 50000 100000) ValuesHint;
run;

This graph is sometimes called a semi-log graph because only one axis is displayed on a log scale. A straight line on a semi-log graph indicates exponential growth. However, all exponential growth is not equal. The slope of the line indicates how quickly the growth is occurring, and the doubling time is one way to measure the growth. A line with a steep slope indicates that the underlying quantity (confirmed cases) will double in a short period of time. A line with a flat slope indicates that the underlying quantity is not growing as quickly and will take a long time to double. For these data, the visualization reveals the following facts:

  • The curves for the United States and Canada have steep slopes.
  • The curve for South Korea is much flatter, which indicates that the number of confirmed cases is growing very slowly in that country.
  • The slope of the curve for Italy looks similar to the US curve for days 0–6, but then the Italy curve begins to flatten. Although the US and Italy had the same number of cases on Day 24, the slope of the Italy curve was less than the slope of the US curve. The interpretation is that (on Day 24) the estimated doubling time for US cases is shorter than for Italy.

An estimate of the slope at the end of each curve

Some researchers fit a linear regression to all values on the curve in order to estimate an average slope. This is usually not a good idea because interventions (such as stay-at-home orders) cause the curves to bend over time. This is clearly seen in the curves for Italy and South Korea.

You can get a better estimate for the current rate of growth if you fit a regression line by using only recent data values. I suggest using data from several previous days, such as the previous 5 or 7 days. You can use the REG procedure in SAS to estimate the slope of each line based on the five most recent observations:

%let MaxDay = 24;
proc reg data=Virus outest=Est noprint;
  where Day >= %eval(&MaxDay-4);           *previous 5 days: Day 20, 21, 22, 23, and 24;
  by Region notsorted;
  model log10Cumul = Day;
quit;

The Est data set contains estimates of the slope (and intercept) for the line that best fits the recent data. The estimates are shown in a subsequent section.

An estimate of the doubling time

You can use these estimated slopes to estimate the doubling time for each curve. If a quantity Y increases from Y0 at time t0 to 2*Y0 at some future time t0 + Δt, the value Δt is the doubling time. The next paragraph shows that the doubling time at t0 is log(2) / m, where m is an estimate of the slope at t0.

The idea is to use the tangent line at t0 to estimate the doubling time. Let log(Y) = m*t + b be the equation of the tangent line at t0 on the semi-log graph. When Y increases from Y0 to 2*Y0, the logarithm increases from log(Y0) to log(2*Y0) = log(2)+log(Y0). Since the slope is "rise over run," the tangent line reaches the doubled value when
m = [(log(2) + log(Y0)) - log(Y0)] / Δt = log(2) / Δt.
Solving for Δt gives
Δt = log(2) / m,
where m is the slope of a regression line for the semi-log curve at t0. This formula estimates the doubling time, which does not depend on the value of Y, only on the slope at t0.

Estimate the doubling time from the slope

The following SAS DATA step estimates the doubling time by using the slope estimates at the end of each curve (Day 24):

data DoublingTime;
set Est(rename=(Day=Slope));
label Slope = "Slope of Line" DoublingTime = "Doubling Time";
DoublingTime = log10(2) / Slope;
keep Region Intercept Slope DoublingTime;
run;
 
proc print data=DoublingTime label noobs;
  format Slope DoublingTime 6.3;
  var Region Slope DoublingTime;
run;

For a pandemic, short doubling times are bad and long doubling times are good. Based on the 27Mar2020 data, the table estimates the doubling time for Italy to be 9 days. In contrast, the estimate for the US doubling time is about 3.3 days, and the estimate for Canada is about 2.5. The estimate for South Korea is 67 days, but for such a long time period the assumption that "the situation stays the same" is surely not valid.

Visualizing the doubling time

You can visualize the doubling times by adding an arrow to the end of each curve:

  • The base of the arrow is located at the most recent data.
  • The direction of the arrow is determined by the estimated slope of the curve.
  • The horizontal extent of the arrow is the doubling time.
  • The vertical extent of the arrow is twice the current count.

The arrows are shown in the following visualization, which excludes South Korea:

This graph shows how quickly the US and Canadian counts are predicted to double. The tip of each arrow indicates the time at which the number of cases are predicted to double. For the US and Canada, this is only a few days. The arrow for Italy indicates a longer time before the Italian cases double. Again, these calculations assume that the number of cases continues to grow at the estimated rate on Day 24. Because the curve for Italy appears to be flattening, the Italian estimate will (hopefully) be overly pessimistic.

Summary

In summary, you can use the slope of a cumulative curve (on a log scale) to estimate the doubling time for the underlying quantity. To find the slope at the most recent observation, you can fit a linear regression line to recent data. The doubling time is given by log(2)/m, where m is the estimate of the slope of the cumulative curve in a semi-log graph. If you want to visualize the doubling time on the graph, you can add an arrow to the end of each curve.

The post Estimates of doubling time for exponential growth appeared first on The DO Loop.