5月 162017
 

Artificial intelligence. Big data. Cognitive computing. These buzzwords are the ABCs of today’s marketplace. In a recent interview at SAS® Global Forum, I discussed the unprecedented pace of change that we’re seeing in the market. It’s creating what I like to call an analytics economy. In this economy, analytics – [...]

How companies can succeed in an analytics economy was published on SAS Voices by Randy Guard

5月 162017
 

The forecast is frightening:  Robots will take over all manual labor and self-generating code will automatically spin out the algorithms once developed by statisticians and programmers.  Humans will become obsolete.  What will mere mortals do all day long? Ride captive as our self-driving cars take us on a sentimental journey [...]

Teaching the machine a lesson was published on SAS Voices by Elliot Inman

5月 162017
 

I guess a coding dinosaur is someone who uses an old/legacy computer language, or at least a language that isn't en vogue these days. Coding dinosaurs are still around (and probably will be for a while), whereas the real dinosaurs that lived millions of years ago are extinct. What caused [...]

The post What happened to the dinosaurs? (not the coding dinosaurs - the real ones!) appeared first on SAS Learning Post.

5月 152017
 

The little SAS program’s official name was Extract_Transform_Load_ 0314.sas.  But, that name was much too formal, way too long, and did not roll off of the tongue very easily at all.  So, everybody simply called her:  ETL Pi. ETL Pi was conceived in a 2-hour project strategy meeting in conference [...]

The post The Little SAS Program’s Big Night Out appeared first on SAS Learning Post.

5月 152017
 

Last week I showed a timeline of living US presidents. The number of living presidents is constant during the time interval between inaugurations and deaths of presidents. The data was taken from a Wikipedia table (shown below) that shows the number of years and days between events. This article shows how you can use the INTCK and INTNX functions in SAS to compute the time between events in this format. In particular, I use two little-known options to these functions that make this task easy.

Intervals between dates

If you are computing the interval between two dates (a start date and an end date) there are two SAS functions that you absolutely must know about.

  • The INTCK function returns the number of time units between two dates. For the time unit, you can choose years, months, weeks, days, and more. For example, in my previous article I used the INTCK function to determine the number of days between two dates.
  • The INTNX function returns a SAS date that is a specified number of time units away from a specified date. For example, you can use the INTNX function to compute the date that is 308 days in the future from a given date.

These two functions complement each other: one computes the difference between two dates, the other enables you to add time units to a date value.

By default, these functions use the number of "calendar boundaries" between the dates, such as the first day of a year, month, or week. For example, if you choose to measure year intervals, the INTCK function counts how many times 01JAN occurred between the dates, and the INTNX function returns a future 01JAN date. Similarly, if you measure month intervals, the INTCK function counts how many first-of-the-months occur between two dates, and the INTNX function returns a future first-of-the-month date.

Options to compute anniversary dates

Both functions support many options to modify the default behavior. If you want to count full year intervals, instead of the number of times people celebrated New Year's Eve, these function support options (as of SAS 9.2) to count the number of "anniversaries" between two dates and to compute the date of a future anniversary. You can use the 'CONTINUOUS' option for the INTCK function and the 'SAME' option for the INTNX function, as follows:

  • The 'CONTINUOUS' option in the INTCK function enables you to count the number of anniversaries of one date that occur prior to a second date. For example, the statement
    Years = intck('year', '30APR1789'd, '04MAR1797'd, 'continuous');
    returns the value 7 because there are 7 full years (anniversaries of 30APR) between those two dates. Without the 'CONTINUOUS' option, the function returns 8 because 01JAN occurs 8 times between those dates.
  • The statement
    Anniv = intnx('year', '30APR1789'd, 7, 'same');
    returns the 7th anniversary of the date 30APR1789. In other words, it returns the date value for 30APR1796.

The beauty of these functions is that they automatically handle leap years! If you request the number of days between two dates, the INTCK function includes leap days in the result. If an event occurs on a leap day, and you ask the INTNX function for the next anniversary of that event, you will get 28FEB of the next year, which is the most common convention for handling anniversaries of a leap day.

An algorithm to compute years and days between events

The following algorithm computes the number of years and days between dates in SAS:

  • Use the INTCK function with the 'CONTINUOUS' option to compute the number of complete years between the two dates.
  • Use the INTNX function to find a third date (the anniversary date) which is the same month and day as the start date, but occurs less than one year prior to the end date. (The anniversary of a leap days is either 28FEB or 29FEB, depending on whether the anniversary occurs in a leap year.)
  • Use the INTCK function to compute the number of days between the anniversary date and the end date.

The following DATA step computes the time interval in years and days between the first few US presidential inaugurations and deaths. The resulting Year and Day variables contain the same information as is displayed in the Wikipedia table.

data YearDays;
format Date prevDate anniv Date9.;
input @1  Date anydtdte12.
      @13 Event $26.;
prevDate = lag(Date);
if _N_=1 then do;                               /* when _N_=1, lag(Date)=. */
   Years=.; Days=.; return;            /* set years & days, go to next obs */
end;
Years = intck('year', prevDate, Date, 'continuous'); /* num complete years */
Anniv = intnx('year', prevDate, Years, 'same');      /* most recent anniv  */
Days = intck('day', anniv, Date);                    /* days since anniv   */
datalines;
Apr 30, 1789 Washington Inaug
Mar 4, 1797  J Adams Inaug
Dec 14, 1799 Washington Death
Mar 4, 1801  Jefferson Inaug
Mar 4, 1809  Madison Inaug
Mar 4, 1817  Monroe Inaug
Mar 4, 1825  JQ Adams Inaug
Jul 4, 1826  Jefferson Death
Jul 4, 1826  J Adams Death
run;
 
proc print data=YearDays;
var Event prevDate Date Anniv Years Days;
run;

Summary and references

In summary, the INTCK and INTNX functions are essential for computing intervals between dates. In this article, I emphasized two little-known options: the 'CONTINUOUS' option in INTCK and the 'SAME' option in INTNX. By using these options, you can to compute the number of anniversaries between dates and the most recent anniversary. Thus you can compute the years and days between two dates.

There have been countless articles and papers written about SAS dates and finding intervals between dates. I recommend the following articles:

Lastly, do you know what the acronyms INTCK and INTNX stand for? Obviously the 'INT' part refers to INTervals. The general consensus is that 'INTCK' stands for 'Interval Check' and 'INTNX' stands for "Interval Next."

The post INTCK and INTNX: Two essential functions for computing intervals between dates in SAS appeared first on The DO Loop.

5月 122017
 

An idiom is a group of words established by usage as having a meaning not deducible from those of the individual words. For example, "don't cry over spilled milk,"  or "the cat is out of the bag." Idioms are fun to use, and fun to hear - don't you agree? And [...]

The post Map of idioms, from around the world appeared first on SAS Learning Post.

5月 122017
 
最近实际项目需要构建复杂网络,这块一直没有实践,之前主要是看看paper,尤其是大数据下的图计算模型。基于hadoop的图计算框架giraph(facebook实践),通过实践对pregel的理解更加深入,实现热传导算法等等。
hadoop graph框架学习和实践   giraph http://giraph.apache.org/ , http://grafos.ml/  http://arabesque.io/
Arabesque: A System for Distributed Graph Mining http://sigops.org/sosp/sosp15/current/2015-Monterey/printable/093-teixeira.pdf
tensorflow https://github.com/skcript/tensorflow-resources
spark https://github.com/endymecy/spark-ml-source-analysis
spark 关闭运行日志 http://stackoverflow.com/questions/25193488/how-to-turn-off-info-logging-in-pyspark
Architecture of Giants: Data Stacks at Facebook, Netflix, Airbnb, and Pinterest https://blog.keen.io/architecture-of-giants-data-stacks-at-facebook-netflix-airbnb-and-pinterest-9b7cd881af54?imm_mid=0f1550
TensorFlow template application for deep learning https://github.com/tobegit3hub/deep_recommend_system
Top 20 Recent Research Papers on Machine Learning and Deep Learning http://www.kdnuggets.com/2017/04/top-20-papers-machine-learning.html
jblas http://jblas.org/
spark 机器学习 https://book.douban.com/subject/26350074/
machine learning dataset http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf
CTR predict
Y. W. Chang, C. J. Hsieh, K. W. Chang, M. Ringgaard, and C.-J. Lin, “Training and testing low- degree polynomial data mappings via linear SVM,” Journal of Machine Learning Research, vol. 11, pp. 1471–1490, 2010.
T. Kudo and Y. Matsumoto, “Fast methods for kernel-based text analysis,” in Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics (ACL), 2003
S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
B. Mcmahan, G. Holt, D. Scully , “Ad Click Prediction: a View from the Trenches”
J. Pan, O. Jin, T. Xu, “Practical Lessons from Predicting Clicks on Ads at Facebook”
Y. Juan, Y. Xhuang, W, Chin, “Field-aware Factorization Machines for CTR Prediction”
G. James, D. Witten, T. Hastie, R. Tibshirani, “An Introduction to Statistical Learning”, 2013.
Neural Models for Information Retrieval https://arxiv.org/pdf/1705.01509.pdf
https://kowshik.github.io/JPregel/pregel_paper.pdf Pregel: A System for Large-Scale Graph Processing
Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction https://arxiv.org/abs/1704.05194
High Performance Linear Algebra OOP https://github.com/fommil/matrix-toolkits-java
抓取京东评论数据 https://github.com/awolfly9 

 
 Posted by at 8:32 下午
5月 112017
 

When a busy university analytics team is tasked with creating a new, interactive way to share data with dozens of different constituents, data visualization from SAS is the obvious answer. The University of Idaho’s office of Institutional Effectiveness and Accreditation is the source for comprehensive information, analyses and university statistics. [...]

University of Idaho replaces legacy BI system with SAS Visual Analytics was published on SAS Voices by Georgia Mariani

5月 112017
 

Multivariate testing (MVT) is another “decision helper” in SAS® Customer Intelligence 360 that is geared at empowering digital marketers to be smarter in their daily job. MVT is the way to go when you want to understand how multiple different web page elements interact with each other to influence goal conversion rate. A web page is a complex assortment of content and it is intuitive to expect that the whole is greater than the sum of the parts. So, why is MVT less prominent in the web marketer’s toolkit?

One major reason – cost. In terms of traffic and opportunity cost, there is a combinatoric explosion in unique versions of a page as the number of elements and their associated levels increases. For example, a page with four content spots, each of which have four possible creatives, leads to a total of 256 distinct versions of that page to test.

If you want to be confident in the test results, then you need each combination, or variant, to be shown to a reasonable sample size of visitors. In this case, assume this to be 10,000 visitors per variant, leading to 2.5 million visitors for the entire test. That might take 100 or more days on a reasonably busy site. But by that time, not only will the web marketer have lost interest – the test results will likely be irrelevant.

A/B testing: The current standard

Today, for expedience, web marketers often choose simpler, sequential A/B tests. Because an A/B test can only tell you about the impact of one element and its variations, it is a matter of intuition when deciding which elements to start with when running sequential tests.

Running a good A/B test requires consideration of any confounding factors that could bias the results. For example, someone changing another page element during a set of sequential A/B tests can invalidate the results. Changing the underlying conditions can also reduce reliability of one or more of the tests.

The SAS Customer Intelligence 360 approach

The approach SAS has developed is the opposite of this. First, you run an MVT across a set of spots on a page. Each spot has two or more candidate creatives available. Then you look to identify a small number of variants with good performance. These are then used for a subsequent A/B test to determine the true winner. The advantage is that underlying factors are better accounted for and, most importantly, interaction effects are measured.

But, of course, the combinatoric challenge is still there. This is not a new problem – experimental design has a history going back more than 100 years – and various methods were developed to overcome it. Among these, Taguchi designs are the best known. There are others as well, and most of these have strict requirements on the type of design. safety consideration.

SAS Customer Intelligence 360 provides a business-user interface which allows the marketing user to:

  • Set up a multivariate test.
  • Define exclusion and inclusion rules for specific variants.
  • Optimize the design.
  • Place it into production.
  • Examine the results and take action.

The analytic heavy lifting is done behind the scenes, and the marketer only needs to make choices for business relevant parameters.

MVT made easy

The immediate benefit is that that multivariate tests are now feasible. The chart below illustrates the reduction in sample size for a test on a page with four spots. The red line shows the number of variants required for a conventional test, and how this increase exponentially with the number of content items per spot.


In contrast, the blue line shows the number of variants required for the optimized version of the test. Even with three content items per spot, there is a 50 percent reduction in the number of unique variants, and this percentage grows larger as the number of items increase. We can translate these numbers into test duration by making reasonable assumptions about the required sample size per variant (10,000 visitors) and about the traffic volume for that page (50,000 visitors per day). The result is shown below.

A test that would have taken 50 days will only take18 days using SAS’ optimized multivariate testing feature. More impressively, a test that would take 120 days to complete can be completed in 25 days.

What about those missing variants?

If only a subset of the combinations are being shown, how can the marketer understand what would happen for an untested variant? Simple. SAS Customer Intelligence 360 fits a model using the results for the tested variants and uses this to predict the outcomes for untested combinations. You can simulate the entire multivariate test and draw reliable conclusions in the process.

The Top Variant Performance report in the upper half of the results summary above indicates the lift for the best-performing variants relative to a champion variant (usually the business-as-usual version of the page). The lower half of the results summary (Variant Metrics) represents each variant as a point located according to a measured or predicted conversion rate. Each point also has a confidence interval associated with the measurement. In the above example, it’s easy to see that there is no clear winner for this test. In fact, the top five variants cannot reliably be separated. In this case, the marketer can use the results from this multi-variate test to automatically set up an A/B test. Unlike the A/B-first approach, narrowing down the field using an optimized multivariate test hones in on the best candidates while accounting for interaction effects.

Making MVT your go-to option

Until now, multivariate testing has been limited to small experiments for all but the busiest websites. SAS Customer Intelligence 360 brings the power of multi-variate testing to more users, without requiring them to have intimate knowledge of design of experiment theory. While multivariate testing will always require larger sample sizes than simple A/B testing, the capabilities presented here show how many more practical use cases can be addressed.

Multivariate Testing: Test more in less time was published on Customer Intelligence Blog.