We hear a lot about data science nowadays, but do you ever wonder how it’s being used to help solve real-world problems? In my first post of this blog series, we heard why two students chose to pursue a STEM field and what appealed to them about data science. Today, we'll hear [...]
This article shows how to use SAS to fit a growth curve to data. Growth curves model the evolution of a quantity over time. Examples include population growth, the height of a child, and the growth of a tumor cell. This article focuses on using PROC NLIN to estimate the parameters in a nonlinear least squares model. PROC NLIN is my first choice for fitting nonlinear parametric models to data. Other ways to model growth curves include using splines, mixed models (PROC MIXED or NLMIXED), and nonparametric methods such as loess.
The SAS DATA step specifies the mean height (in centimeters) of 58 sunflowers at 7, 14, ..., 84 days after planting. The American naturalist H. S. Reed studied these sunflowers in 1919 and used the mean height values to formulate a model for growth. Unfortunately, I do not have access to the original data for the 58 plants, but using the mean values will demonstrate the main ideas of fitting a growth model to data.
/* Mean heights of 58 sunflowers: Reed, H. S. and Holland, R. H. (1919), "Growth of sunflower seeds" Proceedings of the National Academy of Sciences, volume 5, p. 140. http://www.pnas.org/content/pnas/5/4/135.full.pdf */ data Sunflower; input Time Height; label Time = "Time (days)" Height="Sunflower Height (cm)"; datalines; 7 17.93 14 36.36 21 67.76 28 98.1 35 131 42 169.5 49 205.5 56 228.3 63 247.1 70 250.5 77 253.8 84 254.5 ;
Fit a logistic growth model to data
A simple mathematical model for population growth that is constrained by resources is the logistic growth model, which is also known as the Verhulst growth model. (This should not be confused with logistic regression, which predicts the probability of a binary event.) The Verhulst equation can be parameterized in several ways, but a popular parameterization is
Y(t) = K / (1 + exp(-r*(t – b)))
- K is the theoretical upper limit for the quantity Y. It is called the carrying capacity in population dynamics.
- r is the rate of maximum growth.
- b is a time offset. The time t = b is the time at which the quantity is half of its maximum value.
The model has three parameters, K, r, and b. When you use PROC NLIN in SAS to fit a model, you need to specify the parametric form of the model and provide an initial guess for the parameters, as shown below:
proc nlin data=Sunflower list noitprint; parms K 250 r 1 b 40; /* initial guess */ model Height = K / (1 + exp(-r*(Time - b))); /* model to fit; Height and Time are variables in data */ output out=ModelOut predicted=Pred lclm=Lower95 uclm=Upper95; estimate 'Dt' log(81) / r; /* optional: estimate function of parameters */ run; title "Logistic Growth Curve Model of a Sunflower"; proc sgplot data=ModelOut noautolegend; band x=Time lower=Lower95 upper=Upper95; /* confidence band for mean */ scatter x=Time y=Height; /* raw observations */ series x=Time y=Pred; /* fitted model curve */ inset ('K' = '261' 'r' = '0.088' 'b' = '34.3') / border opaque; /* parameter estimates */ xaxis grid; yaxis grid; run;
The output from PROC NLIN includes an ANOVA table (not shown), a parameter estimates table (shown below), and an estimate for the correlation of the parameters. The parameter estimates include estimates, standard errors, and 95% confidence intervals for the parameters. The OUTPUT statement creates a SAS data set that contains the original data, the predicted values from the model, and a confidence interval for the predicted mean. The output data is used to create a "fit plot" that shows the model and the original data.
Articles about the Verhulst model often mention that the "maximum growth rate" parameter, r, is sometimes replaced by a parameter that specifies the time required for the population to grow from 10% to 90% of the carrying capacity, K. This time period is called the "characteristic duration" and denoted as Δt. You can show that Δt = log(81)r. The ESTIMATE statement in PROC NLIN produces a table (shown below) that estimates the characteristic duration.
The value 50.1 tells you that, on average, it takes about 50 days for a sunflower to grow from 10 percent of its maximum height to 90 percent of its maximum height. By looking at the graph, you can see that most growth occurs during the 50 days between Day 10 and Day 60.
This use of the ESTIMATE statement can be very useful. Some models have more than one popular parameterization. You can often fit the model for one parameterization and use the ESTIMATE statement to estimate the parameters for a different parameterization.
Deep learning (DL) is a subset of neural networks, which have been around since the 1960’s. Computing resources and the need for a lot of data during training were the crippling factor for neural networks. But with the growing availability of computing resources such as multi-core machines, graphics processing units (GPUs) accelerators and hardware specialized, DL is becoming much more practical for business problems.
Financial institutions use a large number of computations to evaluate portfolios, price securities, and financial derivatives. For example, every cell in a spreadsheet potentially implements a different formula. Time is also usually of the essence so having the fastest possible technology to perform financial calculations with acceptable accuracy is paramount.
In this blog, we talk to Henry Bequet, Director of High-Performance Computing and Machine Learning in the Finance Risk division of SAS, about how he uses DL as a technology to maximize performance.
Henry discusses how the performance of numerical applications can be greatly improved by using DL. Once a DL network is trained to compute analytics, using that DL network becomes drastically faster than more classic methodologies like Monte Carlo simulations.
We asked him to explain deep learning for numerical analysis (DL4NA) and the most common questions he gets asked.
Can you describe the deep learning methodology proposed in DL4NA?
Yes, it starts with writing your analytics in a transparent and scalable way. All content that is released as a solution by the SAS financial risk division uses the "many task computing" (MTC) paradigm. Simply put, when writing your analytics using the many task computing paradigm, you organize code in SAS programs that define task inputs and outputs. A job flow is a set of tasks that will run in parallel, and the job flow will also handle synchronization.
The job flow in Figure 1.1 visually gives you a hint that the two tasks can be executed in parallel. The addition of the task into the job flow is what defines the potential parallelism, not the task itself. The task designer or implementer doesn’t need to know that the task is being executed at the same time as other tasks. It is not uncommon to have hundreds of tasks in a job flow.
Using that information, the SAS platform, and the Infrastructure for Risk Management (IRM) is able to automatically infer the parallelization in your analytics. This allows your analytics to run on tens or hundreds of cores. (Most SAS customers run out of cores before they run out of tasks to run in parallel.) By running SAS code in parallel, on a single machine or on a grid, you gain orders of magnitude of performance improvements.
This methodology also has the benefit of expressing your analytics in the form of Y= f(x), which is precisely what you feed a deep neural network (DNN) to learn. That organization of your analytics allows you to train a DNN to reproduce the results of your analytics originally written in SAS. Once you have the trained DNN, you can use it to score tremendously faster than the original SAS code. You can also use your DNN to push your analytics to the edge. I believe that this is a powerful methodology that offers a wide spectrum of applicability. It is also a good example of deep learning helping data scientists build better and faster models.
The number of neurons of the input layer is driven by the number of features. The number of neurons of the output layer is driven by the number of classes that we want to recognize, in this case, three. The number of neurons in the hidden layers as well as the number of hidden layers is up to us: those two parameters are model hyper-parameters.
How do I run my SAS program faster using deep learning?
In the financial risk division, I work with banks and insurance companies all over the world that are faced with increasing regulatory requirements like CCAR and IFRS17. Those problems are particularly challenging because they involve big data and big compute.
The good news is that new hardware architectures are emerging with the rise of hybrid computing. Computers are increasing built as a combination of traditional CPUs and innovative devices like GPUs, TPUs, FPGAs, ASICs. Those hybrid machines can run significantly faster than legacy computers.
The bad news is that hybrid computers are hard to program and each of them is specific: you write code for GPU, it won’t run on an FPGA, it won’t even run on different generations of the same device. Consequently, software developers and software vendors are reluctant to jump into the fray and data scientist and statisticians are left out of the performance gains. So there is a gap, a big gap in fact.
To fill that gap is the raison d’être of my new book, Deep Learning for Numerical Applications with SAS. Check it out and visit the SAS Risk Management Community to share your thoughts and concerns on this cross-industry topic.
Focus on data governance, quality and storage if you want to do data management for analytics right.
The post Data management for analytics: Best practices and examples appeared first on The Data Roundtable.
As part of my research for a different article, I recently collected data about my driving commute home via an accelerometer recorder app on my phone. The app generates a simple TSV file. (A TSV file is like a CSV file, but instead of a comma separator, it uses a TAB character to separate the values.) The raw data looks like this:Related from Analytics Experience 2018: Using your smartphone accelerometer to build a safe driving profile
With SAS, it's simple to import the file into a data set. Here's my DATA step code that uses the INFILE statement to identify the file and how to read it. Note that the DLM= option references the hexadecimal value for the TAB character in ASCII (09x), the delimiter for fields in this data.
data drive; infile "/home/chris.hemedinger/tsv/drivehome.tsv" dlm='09'x; length counter 8 timestamp 8 x 8 y 8 z 8 filename $ 25; input counter timestamp x y z filename; run;
In my research, I didn't stop with just my drive home. In addition to my commute, I collected data about 4 other activities, and thus accumulated a collection of TSV files. Here's my file directory in my SAS OnDemand for Academics account:
To import each of these data files into SAS, I could simply copy and paste my code 4 times and then replace the name of the file for each case that I collected. After all, copy-and-paste is a tried and true method for writing large volumes of code. But as the number of code lines grows, so does the maintenance work. If I want to add any additional logic into my DATA step, that change would need to be applied 5 times. And if I later come back and add more files to my TSV collection, I'll need to copy-and-paste the same code blocks for my additional cases.
Using a wildcard on the INFILE statement
I can read all of my TSV files in a single step by *.tsv, which tells SAS to match on all of the TSV files in the folder and process each of them in turn. I also changed the name of the data set from "drive" to the more generic "accel".
data accel; infile "/home/chris.hemedinger/tsv/*.tsv" dlm='09'x; length counter 8 timestamp 8 x 8 y 8 z 8 filename $ 25; input counter timestamp x y z filename; run;
The SAS log shows which files have been processed and added into my data set.
With a single data set that has all of my accelerometer readings, I can easily segment these with a WHERE clause in later processing. It's convenient that my accelerometer app also captured the name of each TSV file so that I can keep these cases distinct. A quick PROC FREQ shows the allocation of records for each case that I collected.
Add the filename into the data set
filename tsvs "/home/chris.hemedinger/tsv/*.tsv"; data accel; length casefile $ 100 /* to write to data set */ counter 8 timestamp 8 x 8 y 8 z 8 filename $ 25 tsvfile $ 100 /* to hold the value */ ; /* store the name of the current infile */ infile tsvs filename=tsvfile dlm='09'x ; casefile=tsvfile; input counter timestamp x y z filename; run;
In the output, you'll notice that we now have the fully qualified file name that SAS processed using INFILE.
Managing data files: fewer files is better
Because we started this task with 5 distinct input files, it might be tempting to store the records in separate tables: one for each accelerometer case. While there might be good reasons to do that for some types of data, I believe that we have more flexibility when we keep all of these records together in a single data set. (But if you must split a single data set into many, here's a method to do it.)
In this single data set, we still have the information that keeps the records distinct (the name of the original files), so we haven't lost anything. SAS procedures support CLASS and BY statements that allow us to simplify our code when reporting across different groups of data. We'll have fewer blocks of repetitive code, and we can accomplish more across all of these cases before we have to resort to SAS macro logic to repeat operations for each file.
As a simple example, I can create a simple visualization with a single PROC SGPANEL step.
ods graphics / width=1600 height=400; proc sgpanel data=accel; panelby filename / columns=5 noheader; series x=counter y=x; series x=counter y=y; series x=counter y=z; colaxis display=none minor; rowaxis label="m/s**2" grid; where counter<11000; run;
Take a look at these 5 series plots. Using just what you know of the file names and these plots, can you guess which panel represents which accelerometer case?
Leave your guess in the comments section. I'll explore these data further in a future blog post!