8月 212019
 

An important application of nonlinear optimization is finding parameters of a model that fit data. For some models, the parameters are constrained by the data. A canonical example is the maximum likelihood estimation of a so-called "threshold parameter" for the three-parameter lognormal distribution. For this distribution, the objective function is the log-likelihood function, which includes a term that looks like log(xi - θ), where xi is the i_th data value and θ is the threshold parameter. Notice that you cannot evaluate the objective function for large values of θ. Similar situations occur for other distributions (such as the Weibull) that contain threshold parameters. The domain of the objective function is restricted by the data.

In SAS, you can fit these three-parameter distributions by using PROC UNIVARIATE. However, it is instructive to fit the lognormal distribution "manually" because it illustrates the general problem, which is how to handle optimization for which parameters depend on the data. This article provides two tips. First, you can use boundary constraints to ensure that the optimal solution enforces the condition θ < min(x). Second, you can define the objective function so that it does not attempt to evaluate expressions like log(xi - θ) if the argument to the LOG function is not positive.

Fit the three-parameter lognormal distribution in SAS

Before discussing the details of optimization, let's look at some data and the MLE parameter estimates for the three-parameter lognormal distribution. The following example is from the PROC UNIVARIATE documentation. The data are measurements of 50 diameters of steel rods that were produced in a factory. The company wants to model the distribution of rod diameters.

data Measures;
   input Diameter @@;
   label Diameter = 'Diameter (mm)';
   datalines;
5.501  5.251  5.404  5.366  5.445  5.576  5.607  5.200  5.977  5.177
5.332  5.399  5.661  5.512  5.252  5.404  5.739  5.525  5.160  5.410
5.823  5.376  5.202  5.470  5.410  5.394  5.146  5.244  5.309  5.480
5.388  5.399  5.360  5.368  5.394  5.248  5.409  5.304  6.239  5.781
5.247  5.907  5.208  5.143  5.304  5.603  5.164  5.209  5.475  5.223
;
 
title 'Lognormal Fit for Diameters';
proc univariate data=Measures noprint;
   histogram Diameter / lognormal(scale=est shape=est threshold=est) odstitle=title;
   ods select ParameterEstimates Histogram;
run;

The UNIVARIATE procedure overlays the MLE fit on a histogram of the data. The ParameterEstimates table shows the estimates. The threshold parameter (θ) is the troublesome parameter when you fit models like this. The other two parameters ("Zeta," which I will call μ in the subsequent sections, and σ) are easier to fit.

An important thing to note is that the estimate for the threshold parameter is 5.069. If you look at the data (or use PROC MEANS), you will see that the minimum data value is 5.143. To be feasible, the value of the threshold parameter must always be less than 5.143 during the optimization process. The log-likelihood function is undefined if θ ≥ min(x).

Tip 1: Constrain the domain of the objective function

Although PROC UNIVARIATE automatically fits the three-parameter lognormal distribution, it is instructive to explicitly write the log-likelihood function and solve the optimization problem "manually." Given N observations {x1, x2, ..., xN}, you can use maximum likelihood estimation (MLE) to fit a three-parameter lognormal distribution to the data. I've previously written about how to use the optimization functions in SAS/IML software to carry out maximum likelihood estimation. The following SAS/IML statements define the log-likelihood function:

proc iml;
start LN_LL1(parm) global (x);
   mu = parm[1]; sigma = parm[2]; theta = parm[3];
   n = nrow(x);
   LL1 = -n*log(sigma) - n*log(sqrt(2*constant('pi')));
   LL2 = -sum( log(x-theta) );                         /* not defined if theta >= min(x) */
   LL3 = -sum( (log(x-theta)-mu)##2 ) / (2*sigma##2);
   return LL1 + LL2 + LL3;
finish;

Recall that for MLE problems, the data are constant values that are known before the optimization begins. The previous function uses the GLOBAL statement to access the data vector, x. The log-likelihood function contains three terms, two of which (LL2 and LL3) are undefined for θ ≥ min(x). Most optimization software enables you to specify constraints on the parameters. In the SAS/IML matrix language, you can define a two-row matrix where the first row indicates lower bounds on the parameters and the second row indicates upper bound. For this problem, the upper bound for the θ parameter is set to min(x) – ε, where ε is a small number so that the quantity (xi - θ) is always bounded away from 0:

use Measures; read all var "Diameter" into x; close;  /* read data */
/* Parameters of the lognormal(mu, sigma, theta):
      mu   sigma>0  theta<min(x)  */   
con = {.    1e-6      .,           /* lower bbounds: sigma > 0 */
       .    .         .};          /* upper bounds */
con[2,3] = min(x) - 1e-6;          /* replace upper bound for theta */
 
/* optimize the log-likelihood function */
x0 = {0 0.5 4};           /* initial guess is not feasible */
opt = {1,                 /* find maximum of function   */
       0};                /* do not display information about optimization */
call nlpnra(rc, result, "LN_LL1", x0, opt) BLC=con;   /* Newton's method with boundary constraints */
print result[c={'mu' 'sigma' 'theta'} L="Optimum"];

The results are very close to the estimates that were produced by PROC UNIVARIATE. Because of the boundary constraints, the optimization process never evaluates the objective function outside of the feasible domain, as determined by the boundary constraints in the CON matrix.

Tip 2: Trap domain errors in the objective function

Ideally, the log-likelihood function should be robust, meaning that it should be able to handle inputs that are outside of the domain of the function. This enables you to use the objective function in many ways, including visualizing the feasible region and solving problems that have nonlinear constraints. When you have nonlinear constraints, some algorithms might evaluate the function in the infeasible region, so it is important that the objective function does not experience a domain error.

The best way to make the log-likelihood function robust is to trap domain errors. But what value should the software return if an input value is outside the domain of the function?

  • If your optimization routine can handle missing values, return a missing value. This is the best mathematical choice because the function cannot be evaluated.
  • If your optimization routine does not handle missing values, return a very large value. If you are maximizing the function, return a large negative value such as -1E20. If you are minimizing the function, return a large positive value, such as 1E20. Essentially, this adds a penalty term to the objective function. The penalty is zero when the input is inside the domain and has a large magnitude when the input is outside the domain.

The SAS/IML optimization routines support missing values, so the following module redefines the objective function to return a missing value when θ ≥ min(x):

/* Better: Trap out-of-domain errors to ensure that the objective function NEVER fails */
start LN_LL(parm) global (x);
   mu = parm[1]; sigma = parm[2]; theta = parm[3];
   n = nrow(x);
 
   if theta > min(x) then return .;     /* Trap out-of-domain errors */
 
   LL1 = -n*log(sigma) - n*log(sqrt(2*constant('pi')));
   LL2 = -sum( log(x-theta) );
   LL3 = -sum( (log(x-theta)-mu)##2 ) / (2*sigma##2);
   return LL1 + LL2 + LL3;
finish;

This modified function can be used for many purposes, including optimization with nonlinear constraints, visualization, and so forth.

In summary, this article provides two tips for optimizing a function that has a restricted domain. Often this situation occurs when one of the parameters depends on data, such as the threshold parameter in the lognormal and Weibull distributions. Because the data are constant values that are known before the optimization begins, you can write constraints that depend on the data. You can also access the data during the optimization process to ensure that the objective function does not ever evaluate input values that are outside its domain. If your optimization routine supports missing values, the objective function can return a missing value for invalid input values. Otherwise, the function can return a very large (positive or negative) value for out-of-domain inputs.

The post Two tips for optimizing a function that has a restricted domain appeared first on The DO Loop.

8月 202019
 

‘Quality‘ means many things to many people. It’s subjective and depends on the industry and product being made, but the fundamental objective is to provide the best product to the right standard associated to fit, form and function. And cost and required profit margin must also be taken into account. [...]

How analytics can help meet the quality standard in manufacturing was published on SAS Voices by Tim Clark

8月 202019
 

‘Quality‘ means many things to many people. It’s subjective and depends on the industry and product being made, but the fundamental objective is to provide the best product to the right standard associated to fit, form and function. And cost and required profit margin must also be taken into account. [...]

How analytics can help meet the quality standard in manufacturing was published on SAS Voices by Tim Clark

8月 202019
 

You can now easily embed a Python script inside a SAS decision within SAS Intelligent Decisioning. If you want to execute in SAS Micro Analytic Service (MAS), you no longer need to wrap it in DS2 code. The new Python code node does it for you. Here is how you can achieve it in less than 5 minutes:

Ready? Steady? Go!

The Python Script

If you want to run the following in MAS:

X1=1
X2=2
if X1 == None:
   X1 = 0
if X2 == None:
   X2 = 0
Y = 0.55 + 1 * X1 + 2 * X2 
print(Y)

Convert it to a Python function to meet the PyMAS requirements:

def execute(X1, X2):
       "Output: Y"
       if X1 == None:
            X1 = 0
       if X2 == None:
            X2 = 0
        Y = 0.55 + 1 * X1 + 2 * X2
       return Y
 
X1=1
X2=2
print(execute(X1,X2))

In a Jupyter Notebook, it will look like this:

Create an input data set to test the results

In SAS Studio V:

cas mysession sessopts=(metrics=true);
caslib _all_ assign;
options dscas;
 
data CASUSER.X1X2 (promote=yes);
length X1 8 X2 8;
X1=1; X2=1; output;
X1=1; X2=2; output;
X1=1; X2=3; output;
X1=1; X2=4; output;
run;
cas mysession terminate;

Create a decision in SAS Intelligent Decisioning 5.3

Choose New Python code file and call it python_logic. Copy the code from the Jupyter Notebook: from def until return Y. Watch out for your indentation!

Save and Close. Go to Variables:

Click on variables X1, X2, Y and change their type to Decimal.

Save the Decision.

Publish the decision to MAS

Test the publishing destination

Click on the published validation. Choose the data set you created:

Run. The code is executed.

Check the execution results

Y is the output of the python function. For the second line in the X1X2 data set, where X1 = 1 X2 =2, we get the result 5.55. Just as in the Jupyter Notebook.

Concepts

About Decisions in SAS

Put simply, there are three main components to a decision in SAS: inputs, logic, and outputs.

Inputs: the decision needs input variables. These can come from a CAS data set, a REST API or manual inputs.

Logic: a decision is defined by business rules, conditions, analytic models, custom code (DS2), etc. The new version allows execution of Python code in PyMAS (see below).

Outputs: a decision computes an output based on inputs and logic.

About SAS Micro Analytic Service (MAS)

A picture says a thousand words; here is a simplified diagram of MAS architecture (thanks to Michael Goddard):

MAS Architecture: Execution engine

You can apply or publish a decision using MAS. The SAS Micro Analytic Service provides the capability to publish a decision into operational environments.

When deployed as part of SAS Decision Manager, MAS is called as a web application with a REST interface by both SAS Decision Manager and by other client applications. MAS provides hosting for DS2 and Python programs and supports a "compile-once, execute-many-times" usage pattern.

The REST interface provides easy integration with client applications and adds persistence and clustering for scalability and high availability.

Prerequisites for Python decisions

You need SAS Intelligent Decisioning 5.3 in SAS Viya 3.4. SAS Intelligent Decisioning 5.3 is the wiz-kid of SAS Decision Manager 5.2. You do not need a certain Python version in your environment, but if you use certain libraries (e.g.: numpy, scipy, etc.), they might depend on the Python version.

Debugging your Python-based decisions

If you cannot replicate the example, it might be useful to consult the MAS logs. Log with MobaXtrem (or the software of your choice) to your server. Browse to the log of the concerned microservice, e.g.: microanalyticservice = MAS.

cd /opt/sas/viya/config/var/log/microanalyticservice/default/

Connect to the node using SFTP and open the log files. Check for errors, such as:

2019-06-27T21:31:12,251 [00000007] ERROR App.tk.MAS – Module ‘python1_0’ failed to compile in user context ‘provider’.

Resolve the Python error, per the messages you find in the log.

Solution for some errors

When you've made changes in your environment and have trouble getting your Python decisions to work, try to restart the following services:

  • decisionmanager
  • compsrv
  • launcher
  • runlauncher
  • microanalyticservice

Acknowledgements

Thanks to Marilyn Tomasic for finding the solution on what to do if you do not get the expected results. Thanks to Yi Jian Ching for sharing his knowledge and material.

References

Execute Python inside a SAS Decision: Learn how in less than 5 minutes was published on SAS Users.

8月 192019
 

I'm old enough to remember when USA Today began publication in the early 1980s. As a teenager who was not particularly interested in current events, I remember scanning each edition for the USA Today Snapshots, a mini infographic feature that presented some statistic in a fun and interesting way. Back then, I felt that these stats made me a little bit smarter for the day. I had no reason to question the numbers I saw, nor did I have the tools, skill or data access to check their work.

Today I still enjoy the USA Today Snapshots feature, but for a different reason. An interesting infographic will spark curiosity. And provided that I have time and interest, I can use the tools of data science (SAS, in my case) and public data to pursue more answers.

In the August 7, 2019 issue, USA Today published this graphic about marijuana use in Colorado. Before reading on, I encourage you to study the graphic for a moment and see what questions arise for you.

Source: USA Today Snapshot from Aug 7 2019

I have some notes

For me, as I studied this graphic, several questions came to mind immediately.

  • Why did they publish this graphic? USA Today Snapshots are usually offered without explanation or context -- that's sort of their thing. So why did the editors choose to share these survey results about marijuana use in Colorado? As readers, we must supply our own context. Most of us know that Colorado recently legalized marijuana for recreational use. The graphic seems to answer the question, "Has marijuana use among certain age groups increased since the law changed?" And a much greater leap: "Has marijuana use increased because of the legal change?"
  • Just Colorado? We see trend lines here for Colorado, but there are other states that have legalized marijuana. How does this compare to Maine or Alaska or California? And what about those states where it's not yet legal, like North Carolina?
  • People '26 and older' are also '18 and older' The reported age categories overlap. '18 and older' includes '18 to 25' and '26 and older'. I believe that the editors added this combined category by aggregating the other two. Why did they do that?
  • Isn't '26 and older' a wide category? '12 to 17' is a 6-year span, and '18 to 25' is an 8-year span. But '26 and older' covers what? 60-plus years?
  • "Coloradoans?" Is that really how people from Colorado refer to themselves? Turns out that's a matter of style preference.

The vagaries of survey results

To its credit, the infographic cites the source for the original data: the National Survey on Drug Use and Health (NSDUH). The organization that conducts the annual survey is the Substance Abuse and Mental Health Services Administration (SAMHSA), which is under the US Department of Health and Human Services. From the survey description: The data provides estimates of substance use and mental illness at the national, state, and sub-state levels. NSDUH data also help to identify the extent of substance use and mental illness among different sub-groups, estimate trends over time, and determine the need for treatment services.

This provides some insight into the purpose of the survey: to help policy makers plan for mental health and substance abuse services. "How many more people are using marijuana for fun?" -- the question I've inferred from the infographic choices -- is perhaps tangential to that charter.

Due to privacy concerns, SAMHSA does not provide the raw survey responses for us analyze. The survey collects details about the respondent's drug use and mental health treatment, as well as demographic information about gender, age, income level, education level, and place of residence. For a deep dive into the questions and survey flow, you can review the 2019 questionnaire here. SAMHSA uses the survey responses to extrapolate to the overall population, producing weighted counts for each response across recoded categories, and imputing counts and percentages for each aspect of substance use (which drugs, how often, how recent).

SAMHSA provides these survey data in two channels: the Public-use Data Analysis System and the Restricted-use Data Analysis System. The "Public-use" data provides annualized statistics about the substance use, mental health, and demographics responses across the entire country. If you want data that includes locale information (such as the US state of residence), then you have to settle for the "Restricted-use" system -- which does not provide annual data, but instead provides data summarized across multi-year study periods. In short, if you want more detail about one aspect of the survey responses, you must sacrifice detail across other facets of the data.

My version of the infographic

I spent hours reviewing the available survey reports and data, and here's what I learned: I am an amateur when it comes to understanding health survey reports. However, I believe that I successfully reverse-engineered the USA Today Snapshot data source so that I could produce my own version of the chart. I used the "Restricted-use" version of the survey reports, which allowed access to imputed data values across two-year study periods. My version shows the same data points, but with these formatting changes:

  • I set the Y axis range as 0% to 100%, which provides a less-exaggerated slope of the trend lines.
  • I did not compute the "18 and over" data point.
  • I added reference lines (dashed blue) to indicate the end of each two-year study period for which I have data points.

Here's one additional data point that's not in the survey or in the USA Today graphic. Colorado legalized marijuana for recreational use in 2012. In my chart, you can see that marijuana use was on the rise (especially among 18-25 years old) well before that, especially since 2009. Medical use was already permitted then (see Robert Allison's chart of the timeline), and we can presume that Coloradoans (!) were warming up to the idea of recreational use before the law was passed. But the health survey measures only reported use, and does not measure the user's purpose (recreational, medical, or otherwise) or attitudes toward the substance.

Limitations of the survey data and this chart

Like the USA Today version, my graph has some limitations.

  • My chart shows the same three broad age categories as the original. These are recoded age values from the study data. For some of the studies it was possible to get more granular age categories (5 or 6 bins instead of 3), but I could not get this for all years. Again, when you push for more detail on one aspect, the "Restricted-use" version of the data pushes back.
  • The "used in the past 12 months" indicators is computed. The survey report doesn't offer this as a binary value. Instead it offers "Used in the past 30 days" and "Used more than 30 days ago but less than 12 months." So, I added those columns together, and I assume that the USA Today editors did the same.
  • I'm not showing the confidence intervals for the imputed survey responses. Since this is survey data, the data values are not absolute but instead are estimates accompanied by a percent-confidence that the true values fall in this certain range. The editors probably decided that this is too complex to convey in your standard USA Today Snapshot -- and it might blunt the potential drama of the graphic. Here's what it would look like for the "Used marijuana in past 30 days" response, with the colored band indicating the 95% confidence interval.

Beyond Colorado: what about other states?

Having done the work to fetch the survey data for Colorado, it was simple to gather and plot the same data for other states. Here's the same graph with data from North Carolina (where marijuana use is illegal) and Maine and California.

While I was limited to the two-year study reports for data at the state level, I was able to get the corresponding data points for every year for the country as a whole:

I noticed that the reported use among those 12-17 years old declined slightly across most states, as well as across the entire country. I don't know what the logistics are for administering such a comprehensive survey to young people, but this made me wonder if something about the survey process had changed over time.

The survey data also provides results for other drugs, like alcohol, tobacco, cocaine, and more. Alcohol has been legal for much longer and is certainly widely used. Here are the results for Alcohol use (imputed 12 months recency) in Colorado. Again I see a decline in the self-reported use among those 12-17 years old. Are fewer young people using alcohol? If true, we don't usually hear about that. Or has something changed in the survey methods with regard to minors?

SAS programs to access NSDUH survey data

On my GitHub repo, you can find my SAS programs to fetch and chart the NSDUH data. The website offers a point-and-click method to select your dimensions: a row, column, and control variable (like a BY group).

I used the interactive report tool to navigate to the data I wanted. After some experimentation, I settled on the "Imputed Marijuana Use Recency" (IRMJRC) value for the report column -- I think that's what USA Today used. Also, I found other public reports that referenced it for similar purposes. The report tool generates a crosstab report and an optional chart, but it also then offers a download option for the CSV version of the data.

I was able to capture that download directive as a URL, and then used PROC HTTP to download the data for each study period. This made it possible to write SAS code to automate the process -- much less tedious than clicking through reports for each study year.

%macro fetchStudy(state=,year=);
  filename study "&workloc./&state._&year..csv";
 
  proc http
   method="GET"
   url="https://rdas.samhsa.gov/api/surveys/NSDUH-&year.-RD02YR/crosstab.csv/?" ||
       "row=CATAG2%str(&)column=IRMJRC%str(&)control=STNAME%str(&)weight=DASWT_1" ||
        "%str(&)run_chisq=false%str(&)filter=STNAME%nrstr(%3D)&state."
   out=study;
  run;
%mend;
 
%let state=COLORADO;
/* Download data for each 2-year study period */
%fetchStudy(state=&state., year=2016-2017);
%fetchStudy(state=&state., year=2015-2016);
%fetchStudy(state=&state., year=2014-2015);
%fetchStudy(state=&state., year=2012-2013);
%fetchStudy(state=&state., year=2010-2011);
%fetchStudy(state=&state., year=2008-2009);
%fetchStudy(state=&state., year=2006-2007);

Each data file represents one two-year study period. To combine these into a single SAS data set, I use the INFILE-with-a-wildcard technique that I've shared here.

 INFILE "&workloc./&state._*.csv"
    filename=fname
    LRECL=32767 FIRSTOBS=2 ENCODING="UTF-8" DLM='2c'x
    MISSOVER DSD;
  INPUT
    state_name   
    recency 
    age_cat 
    /* and so on */

The complete programs are in GitHub -- one version for the state-level two-year study data, and one version for the annual data for the entire country. These programs should work as-is within SAS Enterprise Guide or SAS Studio, including in SAS University Edition. Grab the code and change the STATE macro variable to find the results for your favorite US state.

Conclusion: maintain healthy skepticism

News articles and editorial pieces often use simplified statistics to convey a message or support an argument. There is just something about including numbers that lends credibility to reporting and arguments. Citing statistics is a time-honored and effective method to inform the public and persuade an audience. Responsible journalists will always cite their data sources, so that those with time and interest can fact-check and find additional context beyond what the media might share.

I enjoy features like the USA Today Snapshot, even when they send me down a rabbit hole as this one did. As I tell my children often (and they are weary of hearing it), statistics in the media should not be accepted at face value. But if they make you curious about a topic so that you want to learn more, then I think the editors should be proud of a job well done. It's on the rest of us to follow through to find the deeper answers.

The post A skeptic's guide to statistics in the media appeared first on The SAS Dummy.

8月 192019
 

In parts one and two of this blog series, we introduced the automation of AI (i.e., artificial intelligence) and natural language explanations applied to segmentation and marketing. Following this, we began marching down the path of practitioner-oriented examples, making the case for why we need it and where it applies. [...]

SAS Customer Intelligence 360: Automated AI and segmentation [Part 3] was published on Customer Intelligence Blog.

8月 192019
 

One of my friends likes to remind me that "there is no such thing as a free lunch," which he abbreviates by "TINSTAAFL" (or TANSTAAFL). The TINSTAAFL principle applies to computer programming because you often end up paying a cost (in performance) when you call a convenience function that simplifies your program.

I was thinking about TINSTAAFL recently when I was calling a Base SAS function from the SAS/IML matrix language. The SAS/IML language supports hundreds of built-in functions that operate on vectors and matrices. However, you can also call hundreds of functions in Base SAS and pass in vectors for the parameters. It is awesome and convenient to be able to call the virtual smorgasbord of functions in Base SAS, such as probability function, string matching functions, trig function, financial functions, and more. Of course, there is no such thing as a free lunch, so I wondered about the overhead costs associated with calling a Base SAS function from SAS/IML. Base SAS functions typically are designed to operate on scalar values, so the IML language has to call the underlying function many times, once for each value of the parameter vector. It is more expensive to call a function a million times (each time passing in a scalar parameter) than it is to call a function one time and pass in a vector that contains a million parameters.

To determine the overhead costs, I decided to test the cost of calling the MISSING function in Base SAS. The IML language has a built-in syntax (b = (X=.)) for creating a binary variable that indicates which elements of a vector are missing. The call to the MISSING function (b = missing(X)) is equivalent, but requires calling a Base SAS many times, once for each element of x. The native SAS/IML syntax will be faster than calling a Base SAS function (TINSTAAFL!), but how much faster?

The following program incorporates many of my tips for measuring the performance of a SAS computation. The test is run on large vectors of various sizes. Each computation (which is very fast, even on large vectors) is repeated 50 times. The results are presented in a graph. The following program measures the performance for a character vector that contains all missing values.

/* Compare performance of IML syntax
   b = (X = " ");
   to performance of calling Base SAS MISSING function 
   b = missing(X);
*/
proc iml;
numRep = 50;                            /* repeat each computation 50 times */
sizes = {1E4, 2.5E4, 5E4, 10E4, 20E4};  /* number of elements in vector */
labl = {"Size" "T_IML" "T_Missing"};
Results = j(nrow(sizes), 3);
Results[,1] = sizes;
 
/* measure performance for character data */
do i = 1 to nrow(sizes);
   A = j(sizes[i], 1, " ");            /* every element is missing */
   t0 = time();
   do k = 1 to numRep;
      b = (A = " ");                   /* use built-in IML syntax */
   end;
   Results[i, 2] = (time() - t0) / numRep;
 
   t0 = time();
   do k = 1 to numRep;
      b = missing(A);                  /* call Base SAS function */
   end;
   Results[i, 3] = (time() - t0) / numRep;
end;
 
title "Timing Results for (X=' ') vs missing(X) in SAS/IML";
title2 "Character Data";
long = (sizes // sizes) || (Results[,2] // Results[,3]);   /* convert from wide to long for graphing */
Group = j(nrow(sizes), 1, "T_IML") // j(nrow(sizes), 1, "T_Missing"); 
call series(long[,1], long[,2]) group=Group grid={x y} label={"Size" "Time (s)"} 
            option="markers curvelabel" other="format X comma8.;";

The graph shows that the absolute times for creating a binary indicator variable is very fast for both methods. Even for 200,000 observations, creating a binary indicator variable takes less than five milliseconds. However, on a relative scale, the built-in SAS/IML syntax is more than twice as fast as calling the Base SAS MISSING function.

You can run a similar test for numeric values. For numeric values, the SAS/IML syntax is about 10-20 times faster than the call to the MISSING function, but, again, the absolute times are less than five milliseconds.

So, what's the cost of calling a Base SAS function from SAS/IML? It's not free, but it's very cheap in absolute terms! Of course, the cost depends on the number of elements that you are sending to the Base SAS function. However, in general, there is hardly any cost associated with calling a Base SAS function from SAS/IML. So enjoy the lunch buffet! Not only is it convenient and plentiful, but it's also very cheap!

The post Timing performance in SAS/IML: Built-in functions versus Base SAS functions appeared first on The DO Loop.

8月 192019
 

As you will have read in my last blog, businesses are demanding better outcomes, and through IoT initiatives big data is only getting bigger. This presents a clear opportunity for organisations to start thinking seriously about how to leverage analytics with their other investments. Demands on supply chains have also [...]

Can the artificial intelligence of things make the supply chain intelligent? was published on SAS Voices by Tim Clark

8月 182019
 

Have you ever thought of selling sand on the beach? Neither have I. To most people the mere idea is preposterous. But isn’t it how all great discoveries and inventions are made? Someone comes up with an outwardly crazy, outlandish idea, and despite all the skepticism, criticism, ostracism, ridicule and [...]

Selling sand at the beach was published on SAS Voices by Leonid Batkhan