The little SAS program’s official name was Extract_Transform_Load_ 0314.sas. But, that name was much too formal, way too long, and did not roll off of the tongue very easily at all. So, everybody simply called her: ETL Pi. ETL Pi was conceived in a 2-hour project strategy meeting in conference [...]
Last week I showed a timeline of living US presidents. The number of living presidents is constant during the time interval between inaugurations and deaths of presidents. The data was taken from a Wikipedia table (shown below) that shows the number of years and days between events. This article shows how you can use the INTCK and INTNX functions in SAS to compute the time between events in this format. In particular, I use two little-known options to these functions that make this task easy.
Intervals between dates
If you are computing the interval between two dates (a start date and an end date) there are two SAS functions that you absolutely must know about.
- The INTCK function returns the number of time units between two dates. For the time unit, you can choose years, months, weeks, days, and more. For example, in my previous article I used the INTCK function to determine the number of days between two dates.
- The INTNX function returns a SAS date that is a specified number of time units away from a specified date. For example, you can use the INTNX function to compute the date that is 308 days in the future from a given date.
These two functions complement each other: one computes the difference between two dates, the other enables you to add time units to a date value.
By default, these functions use the number of "calendar boundaries" between the dates, such as the first day of a year, month, or week. For example, if you choose to measure year intervals, the INTCK function counts how many times 01JAN occurred between the dates, and the INTNX function returns a future 01JAN date. Similarly, if you measure month intervals, the INTCK function counts how many first-of-the-months occur between two dates, and the INTNX function returns a future first-of-the-month date.
Options to compute anniversary dates
Both functions support many options to modify the default behavior. If you want to count full year intervals, instead of the number of times people celebrated New Year's Eve, these function support options (as of SAS 9.2) to count the number of "anniversaries" between two dates and to compute the date of a future anniversary. You can use the 'CONTINUOUS' option for the INTCK function and the 'SAME' option for the INTNX function, as follows:
- The 'CONTINUOUS' option in the INTCK function enables you to count the number of anniversaries of one date that occur prior to a second date. For example, the statement
Years = intck('year', '30APR1789'd, '04MAR1797'd, 'continuous');
returns the value 7 because there are 7 full years (anniversaries of 30APR) between those two dates. Without the 'CONTINUOUS' option, the function returns 8 because 01JAN occurs 8 times between those dates.
- The statement
Anniv = intnx('year', '30APR1789'd, 7, 'same');
returns the 7th anniversary of the date 30APR1789. In other words, it returns the date value for 30APR1796.
The beauty of these functions is that they automatically handle leap years! If you request the number of days between two dates, the INTCK function includes leap days in the result. If an event occurs on a leap day, and you ask the INTNX function for the next anniversary of that event, you will get 28FEB of the next year, which is the most common convention for handling anniversaries of a leap day.
An algorithm to compute years and days between events
The following algorithm computes the number of years and days between dates in SAS:
- Use the INTCK function with the 'CONTINUOUS' option to compute the number of complete years between the two dates.
- Use the INTNX function to find a third date (the anniversary date) which is the same month and day as the start date, but occurs less than one year prior to the end date. (The anniversary of a leap days is either 28FEB or 29FEB, depending on whether the anniversary occurs in a leap year.)
- Use the INTCK function to compute the number of days between the anniversary date and the end date.
The following DATA step computes the time interval in years and days between the first few US presidential inaugurations and deaths. The resulting Year and Day variables contain the same information as is displayed in the Wikipedia table.
data YearDays; format Date prevDate anniv Date9.; input @1 Date anydtdte12. @13 Event $26.; prevDate = lag(Date); if _N_=1 then do; /* when _N_=1, lag(Date)=. */ Years=.; Days=.; return; /* set years & days, go to next obs */ end; Years = intck('year', prevDate, Date, 'continuous'); /* num complete years */ Anniv = intnx('year', prevDate, Years, 'same'); /* most recent anniv */ Days = intck('day', anniv, Date); /* days since anniv */ datalines; Apr 30, 1789 Washington Inaug Mar 4, 1797 J Adams Inaug Dec 14, 1799 Washington Death Mar 4, 1801 Jefferson Inaug Mar 4, 1809 Madison Inaug Mar 4, 1817 Monroe Inaug Mar 4, 1825 JQ Adams Inaug Jul 4, 1826 Jefferson Death Jul 4, 1826 J Adams Death run; proc print data=YearDays; var Event prevDate Date Anniv Years Days; run;
Summary and references
In summary, the INTCK and INTNX functions are essential for computing intervals between dates. In this article, I emphasized two little-known options: the 'CONTINUOUS' option in INTCK and the 'SAME' option in INTNX. By using these options, you can to compute the number of anniversaries between dates and the most recent anniversary. Thus you can compute the years and days between two dates.
There have been countless articles and papers written about SAS dates and finding intervals between dates. I recommend the following articles:
- A brief introduction to SAS date and time functions is Andrew Karp (2003) "Working with SAS Date and Time Functions." Unfortunately, this paper was written before the 'CONTINUOUS' and 'SAME' options were added.
- I learned about the ‘SAME’ from a short paper by Bruce Gilsen (2006) "Improve Your Dating: The INTNX Function Alignment Value SAMEDAY."
- A more advanced paper with many examples, including examples of the 'CONTINUOUS' and 'SAME' options, is Derek Morgan (2015) "Demystifying Date and Time Intervals." Derek also wrote the book The Essential Guide to SAS Dates and Times (Second Edition, 2014)
Lastly, do you know what the acronyms INTCK and INTNX stand for? Obviously the 'INT' part refers to INTervals. The general consensus is that 'INTCK' stands for 'Interval Check' and 'INTNX' stands for "Interval Next."
The post INTCK and INTNX: Two essential functions for computing intervals between dates in SAS appeared first on The DO Loop.
An idiom is a group of words established by usage as having a meaning not deducible from those of the individual words. For example, "don't cry over spilled milk," or "the cat is out of the bag." Idioms are fun to use, and fun to hear - don't you agree? And [...]
Ivor Moan explains how SAS Data Management software can help you address GDPR requirements.
The post Tackle GDPR data management requirements using SAS appeared first on The Data Roundtable.
When a busy university analytics team is tasked with creating a new, interactive way to share data with dozens of different constituents, data visualization from SAS is the obvious answer. The University of Idaho’s office of Institutional Effectiveness and Accreditation is the source for comprehensive information, analyses and university statistics. [...]
Multivariate testing (MVT) is another “decision helper” in SAS® Customer Intelligence 360 that is geared at empowering digital marketers to be smarter in their daily job. MVT is the way to go when you want to understand how multiple different web page elements interact with each other to influence goal conversion rate. A web page is a complex assortment of content and it is intuitive to expect that the whole is greater than the sum of the parts. So, why is MVT less prominent in the web marketer’s toolkit?
One major reason – cost. In terms of traffic and opportunity cost, there is a combinatoric explosion in unique versions of a page as the number of elements and their associated levels increases. For example, a page with four content spots, each of which have four possible creatives, leads to a total of 256 distinct versions of that page to test.
If you want to be confident in the test results, then you need each combination, or variant, to be shown to a reasonable sample size of visitors. In this case, assume this to be 10,000 visitors per variant, leading to 2.5 million visitors for the entire test. That might take 100 or more days on a reasonably busy site. But by that time, not only will the web marketer have lost interest – the test results will likely be irrelevant.
A/B testing: The current standard
Today, for expedience, web marketers often choose simpler, sequential A/B tests. Because an A/B test can only tell you about the impact of one element and its variations, it is a matter of intuition when deciding which elements to start with when running sequential tests.
Running a good A/B test requires consideration of any confounding factors that could bias the results. For example, someone changing another page element during a set of sequential A/B tests can invalidate the results. Changing the underlying conditions can also reduce reliability of one or more of the tests.
The SAS Customer Intelligence 360 approach
The approach SAS has developed is the opposite of this. First, you run an MVT across a set of spots on a page. Each spot has two or more candidate creatives available. Then you look to identify a small number of variants with good performance. These are then used for a subsequent A/B test to determine the true winner. The advantage is that underlying factors are better accounted for and, most importantly, interaction effects are measured.
But, of course, the combinatoric challenge is still there. This is not a new problem – experimental design has a history going back more than 100 years – and various methods were developed to overcome it. Among these, Taguchi designs are the best known. There are others as well, and most of these have strict requirements on the type of design. safety consideration.
SAS Customer Intelligence 360 provides a business-user interface which allows the marketing user to:
- Set up a multivariate test.
- Define exclusion and inclusion rules for specific variants.
- Optimize the design.
- Place it into production.
- Examine the results and take action.
The analytic heavy lifting is done behind the scenes, and the marketer only needs to make choices for business relevant parameters.
MVT made easy
The immediate benefit is that that multivariate tests are now feasible. The chart below illustrates the reduction in sample size for a test on a page with four spots. The red line shows the number of variants required for a conventional test, and how this increase exponentially with the number of content items per spot.
In contrast, the blue line shows the number of variants required for the optimized version of the test. Even with three content items per spot, there is a 50 percent reduction in the number of unique variants, and this percentage grows larger as the number of items increase. We can translate these numbers into test duration by making reasonable assumptions about the required sample size per variant (10,000 visitors) and about the traffic volume for that page (50,000 visitors per day). The result is shown below.
A test that would have taken 50 days will only take18 days using SAS’ optimized multivariate testing feature. More impressively, a test that would take 120 days to complete can be completed in 25 days.
What about those missing variants?
If only a subset of the combinations are being shown, how can the marketer understand what would happen for an untested variant? Simple. SAS Customer Intelligence 360 fits a model using the results for the tested variants and uses this to predict the outcomes for untested combinations. You can simulate the entire multivariate test and draw reliable conclusions in the process.
The Top Variant Performance report in the upper half of the results summary above indicates the lift for the best-performing variants relative to a champion variant (usually the business-as-usual version of the page). The lower half of the results summary (Variant Metrics) represents each variant as a point located according to a measured or predicted conversion rate. Each point also has a confidence interval associated with the measurement. In the above example, it’s easy to see that there is no clear winner for this test. In fact, the top five variants cannot reliably be separated. In this case, the marketer can use the results from this multi-variate test to automatically set up an A/B test. Unlike the A/B-first approach, narrowing down the field using an optimized multivariate test hones in on the best candidates while accounting for interaction effects.
Making MVT your go-to option
Until now, multivariate testing has been limited to small experiments for all but the busiest websites. SAS Customer Intelligence 360 brings the power of multi-variate testing to more users, without requiring them to have intimate knowledge of design of experiment theory. While multivariate testing will always require larger sample sizes than simple A/B testing, the capabilities presented here show how many more practical use cases can be addressed.
In the Tuscan hilltop village of Montepulciano, a tourist can visit the shop of Coppersmith Cesare Mazzetti. But Mazzetti is not just running his family’s traditional business, he’s demonstrating all the qualities of a historic craftsman. The visitor may see him work and have him show you photographs of some of [...]
A SAS customer asked how to simulate data from a three-parameter lognormal distribution as specified in the PROC UNIVARIATE documentation. In particular, he wanted to incorporate a threshold parameter into the simulation.
Simulating lognormal data is easy if you remember an important fact: if X is lognormally distributed, then Y=log(X) is normally distributed. The converse is also true: If Y is normally distributed, then X=exp(Y) is lognormally distributed. Consequently, to simulate lognormal data you can simulate Y from the normal distribution and exponentiate it to get X, which is lognormally distributed by definition. If you want, you can add a threshold parameter to ensure that all values are greater than the threshold.
Simulate a sample from a two-parameter lognormal distribution
To reiterate: if Y ~ N(μ, σ) is normally distributed with location parameter μ and scale parameter σ, then X = exp(Y) is lognormally distributed with log-location parameter μ and log-scale parameter σ. Different authors use different names for the μ and σ parameters. The PROC UNIVARIATE documentation uses the symbol ζ (zeta) instead of μ, and it calls ζ a scale parameter. Hence, I will use the symbol ζ, too. I have previously written about the relationship between the two lognormal parameters and the mean and variance of the lognormal distribution.
Regardless of what name and symbol you use, you can use the definition to simulate lognormal data. The following SAS DATA set simulates one sample of size 1000 from a lognormal distribution with parameters ζ=2 and σ=0.5. PROC UNIVARIATE then fits a two-parameter lognormal distribution to the simulated data. The maximum likelihood estimates for the sample are 2.01 and 0.49, so the estimates from the simulated data are very close to the parameter values:
ods graphics/reset; %let N = 1000; /* sample size */ data LN1; call streaminit(98765); sigma = 0.5; /* shape or log-scale parameter */ zeta = 2; /* scale or log-location parameter */ do i = 1 to &N; Y = rand("Normal", zeta, sigma); /* Y ~ N(zeta, sigma) */ X = exp(Y); /* X ~ LogN(zeta, sigma) */ output; end; keep X; run; proc univariate data=LN1; /* visualize simulated data and check fit */ histogram X / lognormal endpoints=(0 to 50 by 5) odstitle="Simulated Lognormal Data (zeta=2, sigma=0.5)"; run;
Simulate many samples from a three-parameter lognormal distribution
You can modify the previous program to simulate from a lognormal distribution that has a threshold parameter. You simply add the threshold value to the exp(Y) value, like this: X = theta + exp(Y). Because exp(Y) is always positive, X is always greater than theta, which is the threshold value.
In Monte Carlo simulation studies, you often want to investigate the sampling distribution of the model parameter estimates. That is, you want to generate many samples from the same model and see how the estimates differ across the random samples. The following DATA step simulates 500 random samples from the three-parameter lognormal distribution with threshold value 10. You can analyze all the samples with one call to PROC UNIVARIATE that uses the BY statement to identify each sample. This is the efficient way to perform Monte Carlo simulation studies in SAS.
%let N = 100; /* sample size */ %let NumSamples = 500; /* number of samples */ %let Threshold = 10; data LN; /* generate many random samples */ call streaminit(98765); sigma = 0.5; /* shape or log-scale parameter */ zeta = 2; /* scale or log-location parameter */ do SampleID = 1 to &NumSamples; do i = 1 to &N; Y = rand("Normal", zeta, sigma); X = &Threshold + exp(Y); output; end; end; keep SampleID X; run; ods exclude all; /* do not produce tables during analyses */ proc univariate data=LN; by SampleID; /* analyze the many random samples */ histogram x / lognormal(theta=&Threshold); /* 2-param estimation */ ods output parameterestimates=PE; run; ods exclude none; data Estimates; /* convert from long to wide for plotting */ keep SampleID Zeta Sigma; merge PE(where=(Parameter="Scale") rename=(Estimate=Zeta)) PE(where=(Parameter="Shape") rename=(Estimate=Sigma)); by sampleID; label Zeta="zeta: Estimates of Scale (log-location)" Sigma="sigma: Estimate of Shape (log-scale)"; run; title "Approximate Sampling Distribution of Lognormal Estimates"; title2 "Estimates from &NumSamples Random Samples (N=&N)"; proc sgplot data=Estimates; scatter x=Zeta y=Sigma; refline 2 / axis=x; refline 0.5 / axis=y; run;
The distribution of the 500 estimates appears to be centered on (ζ, σ) = (2, 0.5), which are the parameter values that were used to simulate the data. You can use the usual techniques in Monte Carlo simulation to estimate the standard deviation of the estimates.
A few closing remarks:
- The RAND function does not support location and scale parameters for the lognormal distribution in SAS in 9.4m4. However, the RANDGEN function in SAS/IML does support two-parameter lognormal parameters. The RAND function will support lognormal parameters in 9.4m5.
- In this study, the estimates are all two-parameter estimates, which assumes that you know the threshold value in the population. If not, you can use THETA=EST on the HISTOGRAM statement to obtain three-parameter lognormal estimates.
- Because you need to exponentiate the Y variable, random values of Y must be less than the value of CONSTANT('logbig'), which is about 709. To avoid numerical overflows, make sure that ζ + 4*σ is safely less than 709.
- This sort of univariate simulation is discussed in detail in Chapter 7 of the book Simulating Data with SAS, along with a general discussion about how to simulate from location-scale families even for distributions for which the RAND function does not support location or scale parameters.