David Loshin discusses big data identity resolution in a programming and execution environment.
The post Understanding big data identity resolution appeared first on The Data Roundtable.
David Loshin discusses big data identity resolution in a programming and execution environment.
The post Understanding big data identity resolution appeared first on The Data Roundtable.
I'm sure I'm not the only one who has read and contributed to threads on the internet about all the different languages used for data mining. But one aspect that's been left out of most of these comparisons is that SAS is more than a 4th generation programming language (4GL). [...]
SAS is an analytical platform, not just a language was published on SAS Voices by David Pope
In my last blog, we examined the data pane in SAS Visual Analytics 8.1. That blog discussed how to have the data pane display the data items of your active data source, and how to perform tasks such as viewing measure details, changing data item properties, and creating geographic data items, hierarchies, and custom categories. In this blog, we’ll look at creating new calculated data items and calculated aggregations.
If you recall, l you display the Data pane in the Visual Analytics interface by clicking the Data icon on the left menu.
A calculated data item is a new data item created from existing data by using an expression.
You can create a derived calculation from a category or measure data item by right-clicking on the data item and selecting Create calculation from data item.
For a category data item, you can create a distinct count, count, or number missing. Creating a derived calculation from a category data item:
For a measure data item, you can create a percent of total, or a periodic calculation based on one of your date data items. Creating a derived calculation from a measure data item:
Notice that in both cases, the new data item is an aggregation, so the new item will appear under the Aggregated Measure category in the data pane.
Note: In order to use the periodic calculation types, your selected data item must include the year.
You can also edit these new data items by right-clicking on the data item and selecting Edit. Editing a derived calculation:
There is now a single interface for creating calculated data items of type Numeric, Character, Date or Datetime or Aggregated measures.
Creating a calculated data item or aggregated measure:
Specifying the calculation result type and format:
Some notes for using operators in calculations and aggregations:
In the same interface, you have access to simple and advanced numeric operators, simple and advanced text operators, along with boolean, date and time, and comparison operators for your calculations.
You also have access to periodic operators and simple and advanced aggregated operators for calculation aggregations.
The most important point to remember in using this interface is to think ahead as to whether you are creating a calculation (operating on each row) or an aggregation (operating across rows) and specify the data type and format before you begin to drag and drop data items and operators. The default data type is Numeric, but if you add an aggregation operator, the type will automatically switch to Aggregated Measure.
Remember that you also create calculated items of character, date, and datetime data types–and you can choose from a list of date and datetime formats for those data types.
The SAS Visual Analytics 8.1 Data Pane: Creating Calculations and Aggregations was published on SAS Users.
One of the big benefits of the SAS Viya platform is how approachable it is for programmers of other languages. You don't have to learn SAS in order to become productive quickly. We've seen a lot of interest from people who code in Python, maybe because that language has become known for its application in machine learning. SAS has a new product called SAS Visual Data Mining and Machine Learning. And these days, you can't offer such a product without also offering something special to those Python enthusiasts.
And so, SAS has published the Python SWAT project (where "SWAT" stands for the SAS scripting wapper for analytical transfer. The project is a Python code library that SAS released using an open source model. That means that you can download it for free, make changes locally, and even contribute those changes back to the community (as some developers have already done!). You'll find it at github.com/sassoftware/python-swat.
SAS developer Kevin Smith is the main contributor on Python SWAT, and he's a big fan of Python. He's also an expert in SAS and in many programming languages. If you're a SAS user, you probably run Kevin's code every day; he was an original developer on the SAS Output Delivery System (ODS). Now he's a member of the cloud analytics team in SAS R&D. (He's also the author of more than a few conference papers and SAS books.)
Kevin enjoys the dynamic, fluid style that a scripting language like Python affords - versus the more formal "code-compile-build-execute" model of a compiled language. Watch this video (about 14 minutes) in which Kevin talks about what he likes in Python, and shows off how Python SWAT can drive SAS' machine learning capabilities.
The analytics engine behind the SAS Viya platform is called CAS, or SAS Cloud Analytic Services. You'll want to learn that term, because "CAS" is used throughout the SAS documentation and APIs. And while CAS might be new to you, the Python approach to CAS should feel very familiar for users of Python libraries, especially users of pandas, the Python Data Analysis Library.
CAS and SAS' Python SWAT extends these concepts to provide intuitive, high-performance analytics from SAS Viya in your favorite Python environment, whether that's a Jupyter notebook or a simple console. Watch the video to see Kevin's demo and discussion about how to get started. You'll learn:
There are plenty of helpful resources to help you learn about using Python with SAS Viya:
And finally, what if you don't have SAS Viya yet, but you're interested in using Python with SAS 9.4? Check out the SASPy project, which allows you to access your traditional SAS features from a Jupyter notebook or Python console. It's another popular open source project from SAS R&D.
The post Using Python to work with SAS Viya and CAS appeared first on The SAS Dummy.
Throughout my time as a student at NC State University, I have been involved with undergraduate research. Last semester I was part of a project to collect various biometric sensor data. Our goal was to use this data to measure stress and learn how certain tasks affect participants. My task was [...]
The post The Challenges of Biometric Data Collection appeared first on SAS Analytics U Blog.
Suppose you roll six identical six-sided dice. Chance are that you will see at least one repeated number. The probability that you will see six unique numbers is very small: only 6! / 6^6 ≈ 0.015.
This example can be generalized. If you draw a random sample with replacement from a set of n items, duplicate values occur much more frequently than most people think. This is the source of the famous "birthday matching problem" in probability, which shows that the probability of duplicate birthdays among a small group of people is higher than most people would guess. (Among 23 people, the probability of a duplicate birthday is more than 50%.)
This fact has implications for bootstrap resampling. Recall that if a sample has n observations, then a bootstrap sample is obtained by sampling n times with replacement from the data. Since most bootstrap samples contain a duplicate of at least one observation, it is also true that most samples omit at least one observation. That raises the question: On average, how many of the original observations are not present in an average bootstrap sample?
I'll give you the answer: an average bootstrap sample contains 63.2% of the original observations and omits 36.8%. The book by Chernick and LaBudde (2011, p. 199) states the following result about bootstrap resamples: "If the sample size is large and we [generate many] bootstrap samples, we will find that on average, approximately 36.8% of the original observations will be missing from the individual bootstrap samples. Another way to look at this is that for any particular observation, approximately 36.8% of the bootstrap samples will not contain it." (Emphasis added.)
You can use elementary probability to derive this result. Suppose the original data contains n observations. A bootstrap sample is generated by sampling with replacement from the data. The probability that a particular observation is not chosen from a set of n observations is 1 - 1/n, so the probability that the observation is not chosen n times is (1 - 1/n)^n. This is the probability that the observation does not appear in a bootstrap sample.
You might remember from calculus that the limit as n → ∞ of (1 - 1/n)^n is 1/e. Therefore, when n is large, the probability that an observation is not chosen is approximately 1/e ≈ 0.368.
If you prefer simulation to calculation, you can simulate many bootstrap samples and count the proportion of samples that do not contain observation 1, observation 2, and so forth. The average of those proportions should be close to 1/e.
The theoretical result is only valid for large values of n, but let's start with n=6 so that we can print out intermediate results during the simulation. The following SAS/IML program uses the SAMPLE function to simulate rolling six dice a total of 10 times. Each column of the matrix M contains the result of one trial:
options linesize=128; proc iml; call randseed(54321); n = 6; NumSamples = 10; x = T(1:n); /* the original sample {1,2,...,n} */ M = sample(x, NumSamples//n); /* each column is a draw with replacement */ |
The table shows that each sample (column) contains duplicates. The first column does not contain the values {4,5,6}. The second column does not contain the value 6.
Given these samples, what proportion of samples does not contain 1? What proportion does not contain 2, and so forth? For this small simulation, we can answer these questions by visual inspection. The number 1 does not appear in the sample S5, so it does not appear in 0.1 of the samples. The number 2 does not appears in S3, S5, S8, S9, or S10, so it does not appear in 0.5 of the samples. The following SAS/IML statements count the proportion of columns that do not contain each number and then takes the average of those proportions:
/* how many samples do not contain x[1], x[2], etc */ cnt = j(n, 1, .); /* allocate space for results */ do i = 1 to n; Y = (M=x[i]); /* binary matrix */ s = Y[+, ]; /* count for each sample (column) */ cnt[i] = sum(s=0); /* number of samples that do not contain x[i] */ end; prop = cnt / NumSamples; avg_prop = mean(prop); print avg_prop; |
The result says that, on the average, a sample does not contain 0.35 of the data values. Let's increase the sample size and the number of bootstrap samples. Change the parameters to the following and rerun the program:
n = 75; NumSamples = 10000; |
The new estimate is 0.365, which is close to the theoretical value of 1/e. If you like, you can plot a histogram of the n proportions that make up the average:
title "Proportion of Samples That Do Not Contain an Observation"; title2 "n=75; NumSamples=10000"; call histogram(prop) label="Proprtion" other="refline "+char(avg_prop)+"/axis=x;"; |
Elementary statistical theory tells us that the proportions are approximately normally distributed with mean p=1/e and standard deviation sqrt(p(1-p)/NumSamples) ≈ 0.00482. The mean and standard deviation of the simulated proportions are very close to the theoretical values.
In conclusion, when you draw n items with replacement from a large sample of size n, on average the sample contains 63.2% of the original observations and omits 36.8%. In other words, the average bootstrap sample omits 36.8% of the original data.
The post The average bootstrap sample omits 36.8% of the data appeared first on The DO Loop.
Summer is here, which means vacations and time at the pool with a good book. If expanding your knowledge is a goal of yours this summer, SAS has a shelf full of new titles becoming available over the next few months. From new editions of classics – such as SAS® for Forecasting [...]
The post 8 new summer reads for SAS users appeared first on SAS Learning Post.
Let me start by posing a question: "Are you forecasting at the edge to anticipate what consumers want or need before they know it?" Not just forecasting based on past demand behavior, but using real-time information as it is streaming in from connected devices on the Internet of Things (IoT). [...]
Forecasting at the edge for real-time demand execution was published on SAS Voices by Charlie Chase
In a previous blog, I demonstrated a program and macro that could identify all numeric variables set to a specific value, such as 999. This blog discusses an immensely useful technique that allows you to perform an operation on all numeric or all character variables in a SAS data set. [...]
The post How to perform an operation on all numeric or all character variables in a SAS data set appeared first on SAS Learning Post.
Matthew Magne describes how SAS Data Quality can help you build a trusted data foundation, one stone at a time.
The post Move "mountains" and build a trusted data foundation with SAS Data Quality appeared first on The Data Roundtable.