3月 032018
 

You’ve probably heard the stats about the number of internet-connected devices, which make up the Internet of Things (IoT). Most likely, you’re part of the narrative. We're all connected in new ways and through more devices than ever before. Sometimes IoT impacts our daily lives – safety sensors on cars, [...]

Three energy trends that are actually IoT opportunities was published on SAS Voices by Alyssa Farrell

3月 012018
 

Let’s say that you are administering a SAS 9.4 environment that is working just fine. You’ve checked that your full backups are indeed happening and you’ve even tried restoring from one of your backups. You are prepared for anything, right? Well, I’d like to propose a scenario to you. You probably have users responsible for creating reports, maybe even very important reports. What if something happened to one of these reports? Perhaps the user wants to revert to an earlier version. Perhaps the report was accidentally deleted or even corrupted, what then? Restoring a full backup in this situation might help this one user but would likely inconvenience most other users. With a little more preparation, you could “magically” restore a single report if needed. Here’s what you need to do: create a backup of only these critical reports using the promotion tools.

The promotion tools include:

  • the Export SAS Package Wizard and the Import SAS Package Wizard available in SAS Management Console, SAS Data Integration Studio, and SAS OLAP Cube Studio.
  • the batch export tool and the batch import tool.

Note: Starting with the third maintenance of SAS 9.4, you can use the -disableX11 option to run the batch import and batch export tools on UNIX without setting the DISPLAY variable.

You can use the promotion tools on almost anything found in the SAS Folder tree, especially if you use SAS Management Console. If you use the wizards in SAS Data Integration Studio or SAS OLAP Cube Studio, those applications only allow you to access and export/import objects that pertain to that application, a subset of what is available in SAS Management Console.

You may be thinking that using an interactive wizard is not really the answer you are looking for and you may be right. The batch tools are a great solution if you want to schedule the exporting of some objects on a regular basis. If you are unfamiliar with the promotion tools, I would suggest you start with the interactive wizards. You will find that the log produced by the wizard includes the equivalent command line you would use. It’s a nice way to explore how to invoke the batch tools.

Creating the Export Package

How to invoke the Export SAS Package Wizard:

1.  Right-click on a folder or object in the SAS Folders tree and select Export SAS Package.

Selectively backing up metadata

2.  Enter the location and name of the package file to be created and set options as appropriate.

You can opt to Include dependent objects when retrieving initial collection of objects here or you can select specific dependent objects on the next screen.

Filtering offers some very interesting ways of selecting objects including:

  • By object name
  • By object type
  • By when objects were created
  • By when objects were last modified

3.  Select the objects to export. If you started the process with a folder, you will be presented with the folder and all of its contents selected by default. You can deselect specific objects as you like.

In this example, we only want the Marketing folder and its contents. Deselect the other folders. You want to be careful to not create a package file that is too big.

You can click on individual objects and explore what dependencies the object has, what other metadata objects use the current object, options and properties for the object.

In this example, the Marketing Unit Report is dependent on the MEGACORP table whose metadata is found in the /Shared Data/LASR Data folder. When you import this report, you will need to associate the report with the same or similar table in order for the report to be fully functional.

If you had selected Include dependent objects when retrieving initial collection of objects on the previous screen, all of the dependent objects would be listed and be selected for export by default.

Bonus things you get by default in the export package include:

  • Permissions set directly on the objects
  • For most object types, the export tools include both metadata and the associated physical content. For example, with reports you get both the metadata and associated report XML. For a complete list of physical content promoted with metadata objects, refer to:

    5.  When the export process is complete (hopefully without errors) review the log.

    At the top of the log, you can see the location of the log file in case you want to refer to it later.

    If you scroll to the end of the log, you’ll find the command line to invoke the batch export tool to create the same package.

    Considerations for Exporting

    Importing to the Rescue

    Let’s talk about what happens if and when you actually need to import some or all of the objects in a package file.
    Let’s take a look at what we would need to do to replace an accidentally deleted report, Marketing Unit Report.

    How to invoke the Import SAS Package Wizard:

    5.  Right-click on the same folder you started the export, SAS Folders folder in our example, and select Import SAS Package. It is important to initiate the import from the same folder you started the export if you want to end up with the same folder structure.

    6.  If needed, use the Browse functionality to locate the correct package file.

    Include access controls

    By default, Include access controls is not selected. This option will import permission settings directly applied to the objects in the package. It will not import any permissions if there were only inherited permissions on the object in the source environment.

    Since we are bringing the report back into the folder it originally came from, it makes sense to also include direct permissions, if there were any.

    If you do not check the Include access controls box and there are in face some direct permissions on objects being imported, you will get this warning later in the wizard:

    Select objects to import

    If you’re not sure whether to select to import All objects or New objects only, you can always start with all objects. You can use the Back buttons in the wizard to go back to previous prompts and change selections, at least before you kick off the actual import process.

    7.  If you selected import all objects on the first screen, you will see a listing of all objects. Each object will have an icon indicating if the object currently exists where you are doing the import or not. The red exclamation mark indicates the object currently exists and doing the import of this object will overwrite the current object with the copy from the package. The asterisk icon indicates that the object does not currently exist and will be created by the import process.

    In our example, the Marketing Unit Report does not currently exist in the Marketing folder but is in the package file so it is labeled with an asterisk. The other two reports are both in the folder and the package file so they are labeled with red exclamation marks.

    You’ll want to make the appropriate selections here. If you want all of the contents of the package to be written to the Marketing folder, overwriting the first two reports and adding the Marketing Unit Report, leave all objects selected. If one of the reports had become corrupted, you could use this method to overwrite the current copy with the version stored in the package file.

    If you just want to replace the missing Marketing Unit Report, make sure only that object is selected as below:

    By default, objects are imported into the same folder structure they were in when the export package was created.

    8.  Part of the import process is to establish associations between the objects you are importing and metadata not included in the package. You are first presented with a list of the metadata values you will need to select.

    9.  Set the target value(s) as needed.

    In our example, we definitely want the report to use the same table it used originally.
    If we were moving objects to a new folder or a new environment, you might want to associate the report with a different table.

    If you use the batch import tool, changing these associations would be done in a substitution properties file.

    10.  Review the import summary and initiate the import process.

    11.  Hopefully, the process completes without errors and you can review the log.

    12.  Finish things off by testing the content you imported. In this case, we would log in to SAS Visual Analytics and view the Marketing Unit Report.

    Considerations for Importing

    • If you initiated the export from the SAS Folders folder and try to import the package from another folder, Marketing for example, the wizard will recreate everything in the package, including a new Marketing subfolder which is probably not what you intended.

    Notice the new Marketing folder inside the current Marketing folder. In addition, all three reports are considered new since the new Marketing subfolder does not currently exist.

    • The account you use to do the import should have enough access to metadata and the operating system.

    Next Steps

    • Decide what you want to export, how often, and how long you want to keep a specific package file.
    • Once you’ve gotten comfortable with the wizards and you want to schedule an export (or several), you should try out the batch export and import tools. When you name the export package, you can consider customizing the package name to include the date to avoid overwriting the same package file each time.

    Review the documentation on both the wizards and batch tools in the Admin Notebook: Making the case for selectively backing up metadata was published on SAS Users.

2月 282018
 
Solve nonlinear equations in SAS

This article shows how to use SAS to solve a system of nonlinear equations. When there are n unknowns and n equations, this problem is equivalent to finding a multivariate root of a vector-valued function F(x) = 0 because you can always write the system as
f1(x1, x2, ..., xn) = 0
f2(x1, x2, ..., xn) = 0
. . .
fn(x1, x2, ..., xn) = 0
Here the fi are the nonlinear component functions, F is the vector (f1, f2, ..., fn), and x is the vector (x1, x2, ..., xn).

In two dimensions, the solution can be visualized as the intersection of two planar curves. An example for n = 2 is shown at the right. The two curves meet at the solution (x, y) = (1, 2).

There are several ways to solve a system of nonlinear equations in SAS, including:

  • In SAS/IML software, you can use the NLPLM or NLPHQN methods to solve the corresponding least-squares problem. Namely, find the value of x that minimizes || F(x) ||.
  • In SAS/ETS software, you can use the SOLVE statement in PROC MODEL to solve the system.
  • In SAS/STAT software, you can use the NLIN procedure to solve the system.
  • In SAS/OR software, you can use PROC OPTMODEL to solve the system.

When n = 1, the problem is one-dimensional. You can use the FROOT function in SAS/IML software to find the root of a one-dimensional function. You can also use the SOLVE function in conjunction with PROC FCMP.

This article shows how to find a root for the following system of three equations:
f1(x, y, z) = log(x) + exp(-x*y) - exp(-2)
f2(x, y, z) = exp(x) - sqrt(z)/x - exp(1) + 2
f3(x, y, z) = x + y - y*z + 5
You can verify that the value (x, y, z)=(1, 2, 4) is an exact root of this system.

Solve a system of nonlinear equations in SAS/IML

You can use the NLPLM or NLPHQN methods in SAS/IML to solve nonlinear equations. You need to define a function that returns the value of the function as a row vector. This is very important: the function must return a row vector! If the domain of any component of the function is restricted (for example, because of LOG or SQRT functions), you can define a linear constraint matrix. You then supply an initial guess and call the NLPLM routine to solve the least-squares problem that minimizes 1/2 (f12 + ... + fn2). Obviously the minimum occurs when each component is zero, that is, when (x,y,z) is a root of the vector-valued function. You can solve for the root as follows:

proc iml;
start Fun(var);
   x = var[1]; y = var[2]; z = var[3];
   f = j(1, 3, .);         /* return a ROW VECTOR */
   f[1] = log(x) + exp(-x*y) - exp(-2);
   f[2] = exp(x) - sqrt(z)/x - exp(1) + 2;
   f[3] = x + y - y*z  + 5;
return (f);
finish;
 
/*     x[1] x[2] x[3] constraints. Lower bounds in 1st row; upper bounds in 2nd row */
con = {1e-6  .  1e-6,   /* x[1] > 0 and x[3] > 0; no bounds on y */
         .   .    .};
x0 = {1 1 1};           /* initial guess */
optn = {3               /* solve least square problem that has 3 components */
        1};             /* amount of printing */
call nlphqn(rc, Soln, "Fun", x0, optn) blc=con;  /* or use NLPLM */
print Soln;
quit;
Solve nonlinear system of euations in SAS

The NLPHQN routine converges to the solution (1, 2, 4). Notice that the first element of the optn vector must contain n, the number of equations in the system.

Solve a system of nonlinear equations with PROC MODEL

If you have access to SAS/ETS software, PROC MODEL provides a way to solve simultaneous equations. You first create a SAS data set that contains an initial guess for the solution. You then define the equations in PROC MODEL and use the SOLVE statement to solve the system, as follows:

data InitialGuess;
   x=1; y=1; z=1;    /* initial guess for Newton's method */
run;
 
proc model data=InitialGuess;
   bounds 0 < x z;
   eq.one = log(x) + exp(-x*y) - exp(-2);
   eq.two = exp(x) - sqrt(z)/x - exp(1) + 2;
   eq.three = x + y - y*z  + 5;
   solve x y z / solveprint out=solved outpredict;
run;quit;
 
title "Solution from PROC MODEL in SAS/ETS";
proc print data=solved noobs;
   var x y z;
run;
Solve system of simultaneous equations with SAS

A nice feature of PROC MODEL is that it automatically generates symbolic derivatives and uses them in the solution of the simultaneous equations. If you want to use derivatives in PROC IML, you must specify them yourself. Otherwise, the NLP routines use numerical finite-difference approximations.

Solve a system of nonlinear equations with PROC NLIN

You can solve a system of equations by using only SAS/STAT software, but you need to know a trick. My colleague who supports PROC NLIN says he has "seen this trick before" but does not know who first thought of it. I saw it in a 2000 paper by Nam, Cho, and Shim (in Korean).

Because PROC NLIN is designed to solve regression problems, you need to recast the problem in terms of a response variable, explanatory variables, and parameters. Recall that ordinary least squares regression enables you to solve a linear system such as
0 = C1*v1 + C2*v2 + C3*v3
where the left-hand side is a response vector (the zero vector), the C_i are regression coefficients, and the v_i are explanatory variables. (You need three or more observations to solve this regression problem.) PROC NLIN enables you to solve more complex regression problems. In particular, the coefficients can be nonlinear functions of parameters. For example, if the parameters are (x,y,z), you can solve the following system:
0 = C1(x,y,z)*v1 + C2(x,y,z)*v2 + C3(x,y,z)*v3.

To solve this nonlinear system of equations, you can choose the explanatory variables to be coordinate basis functions: v1=(1,0,0), v2=(0,1,0), and v3=(0,0,1). These three observations define three equations for three unknown parameters. In general, if you have n equations in n unknowns, you can specify n coordinate basis functions.

To accommodate an arbitrary number of equations, the following data step generates n basis vectors, where n is given by the value of the macro variable numEqns. The BasisVectors data set contains a column of zeros (the LHS variable):

%let numEqns = 3;
data BasisVectors;
LHS = 0;
array v[&numEqns];
do i = 1 to dim(v);
   do j = 1 to dim(v);
      v[j] = (i=j);   /* 1 when i=j; 0 otherwise */
   end;
   output;
end;
drop i j;
run;
 
title "Solution from PROC NLIN in SAS/STAT";
proc nlin data=BasisVectors;
   parms x 1 y 1 z 1;                  /* initial guess */
   bounds 0 < x z;                     /* linear constraints */
   eq1 = log(x) + exp(-x*y) - exp(-2);
   eq2 = exp(x) - sqrt(z)/x - exp(1) + 2;
   eq3 = x + y - y*z  + 5;
   model LHS = eq1*v1 + eq2*v2 + eq3*v3;
   ods select EstSummary ParameterEstimates;
run;
Use SAS to solve nonlinear system of equations

The problem contains three parameters and the data contains three observations. Consequently, the standard errors and confidence intervals are not meaningful. The parameter estimates are the solution to the nonlinear simultaneous equations.

Solve a system of nonlinear equations with PROC OPTMODEL

With PROC OPTMODEL in SAS/OR software, you can express the system in a natural syntax. You can either minimize the objective function F = 0.5 * (f1**2 + f2**2 + f3**2) or solve the system directly by specifying constraints but not an objective function, as follows:

title "Solution from PROC OPTMODEL in SAS/OR";
proc optmodel;
   var x init 1, y init 1, z init 1;
   /*  -or-  var x >= 1e-6 init 1, y init 1, z >= 0 init 1;  to specify bounds */
   con c1: log(x) + exp(-x*y) = exp(-2);
   con c2: exp(x) - sqrt(z)/x = exp(1) - 2;
   con c3: x + y - y*z = -5;
   solve noobjective;
   print x y z;
quit;

The solution is (x,y,z)=(1,2,4) and is not shown.

Summary

In summary, there are multiple ways to solve systems of nonlinear equations in SAS. My favorite ways are the NLPHQN function in SAS/IML and the SOLVE statement in PROC MODEL in SAS/ETS. However, you can also use PROC NLIN in SAS/STAT software or PROC OPTMODEL in SAS/OR. When you need to solve a system of simultaneous nonlinear equations in SAS, you can choose whichever method is most convenient for you.

The post Solve a system of nonlinear equations with SAS appeared first on The DO Loop.

2月 282018
 

Goutam Chakraborty is a busy man. In addition to serving as SAS professor of marketing analytics at Oklahoma State University, Dr. Chakraborty is the director and founder of the SAS and Oklahoma State University MS in Business Analytics and an award winning author and professor. He teaches courses in such areas as business analytics, marketing analytics, data mining, marketing research, and web strategy, and has been preparing students to enter the workforce with advanced skills in marketing and analytics for more than 20 years. Throw in the regular consulting engagements he has with some of the world's top companies and it makes you wonder if Dr. Chakraborty has time to add anything else to his already full plate. Well, this year at least, you add SAS Global Forum 2018 Chair to the list - likely at the expense of a good night's sleep.

As the largest gathering of SAS users in the world, SAS Global Forum will attract more than 5,000 SAS professionals for several days of learning and networking. Recently, I sat down with Dr. Chakraborty to talk with him a bit about this year's conference, which takes place April 8-11, 2018 in Denver. I left excited about SAS Global Forum 2018 and, at the expense of losing credibility as a fair and balanced reporter, convinced that Dr. Chakraborty is one of the nicest individuals you'll ever meet.

Larry LaRusso: I know you've been preparing to chair SAS Global Forum 2018 for more than three years, but now that the event is only a few weeks away, how excited are you to kick this thing off?
Goutam Chakraborty: More excited than you know Larry. I've participated in many SAS Global Forums, but serving as chair gives you the ability to influence every aspect of the event, from speaker and content selection to charity-related events and networking opportunities. It's been a wonderful opportunity to give back to the SAS user community, one I'll never forget.

LL: What excites you most about this year's event?
GC: There are so many new things about this year's conference, all geared toward providing an enriching experience for all SAS users. I'll mention three that immediately come to mind.

One thing we've tried to do well this year is connect industry with academics. While we'll have a full program of events and talks specifically geared toward students and professors, this year we'll emphasize partnerships with industries in a new way. I might be most excited about Sunday's Talent Connection. This event brings students and SAS professionals together to network, discuss career opportunities and share knowledge, research and partnership opportunities that might exist with each other. I anticipate it being a great success for both students and industry looking to connect with young analytical talent.

Another strong focus for us is career development and learning for SAS users at all levels. We'll have a full menu of traditional training and certification opportunities for data scientists, business and data analysts and SAS programmers, but we're also providing opportunities to build on soft-skills development, such as networking, analytical story-telling and much more. We'll also have an on-site Learning Lab, available for several hours each day, where users can explore more than 25 e-learning courses for free.

Finally, I'll mention our volunteer opportunities. We'll have several ways for users to give back, but I'm particularly excited about our STEM-related charity event. During meals and evening networking receptions, both Monday and Tuesday, attendees will have the opportunity to work with RAFT Colorado (Resource Area For Teaching), and build STEM-inspired teaching kits for local teachers to use in their classrooms. Each kit will repurpose educational items RAFT has collected and make them available to teachers as creative tools for teaching STEM – inspiring the next generation of thinkers, innovators, problem-solvers and creators. It's an extraordinary opportunity to impact local area children.

LL: Speaking of extraordinary, this year's conference theme is "Inspire the Extraordinary." What does that theme mean to you?
GC: It means never accept "good enough." I always tell my students to push for something above and beyond what's expected of them, to be extra-ordinary. We expect the same for this year's SAS Global Forum. Knowing the event like I do, I feel confident we're going to deliver a SAS Global Forum that surprises and delights our users in a way they didn't expect.

LL: We all know that one of the best things about SAS Global Forum is its incredible content. What can you tell us about the content you’re putting together for this year’s event?
GC: Thanks to tons of hard work and research from a lot of SAS users, we've selected fantastic content from renowned speakers from across the world. Perhaps the best part of our content planning this year is the variety. Topics range from deep hard-core programming to high-level strategic thinking about data and analytics. From sessions that will help you to develop yourself personally as a better human-being to learning about optimizing Monday night NFL schedule for best viewership to thinking strategically about data as a currency – there is something of value for everyone.

SAS Global Forum 2018LL: SAS Global Forum is likely to attract more than 5,000 data scientists, analytics professionals and business leaders. Every year it amazes me how many of those users are attending SAS Global Forum for the first time. What advice would you give first-timers?
GC: First piece of advice: Have a plan and build a personalized agenda so you don’t get overwhelmed by the large number of available sessions. Second, take every opportunity to engage and network with other attendees. One of the best things about this conference is how willing veteran SAS users (regulars at this conference) are to help and welcome newcomers. So, take advantage of it. If you are sitting down for breakfast or lunch, take the time to introduce yourself to people around you. You may be surprised where it could lead. I'd also encourage attendees to take time to visit the Quad. The Quad is a casual and interactive space where attendees can network with other SAS enthusiasts, view demos and visit with experts from SAS and our sponsors. And, last but not the least, have some fun! Attend the social events we have planned, especially the Kick Back Party at Mile High Stadium on Tuesday evening.

LL: As an academician, I know you’re passionate about learning? What additional learning opportunities, beyond the session talks, are available to attendees?
GC: There are so many learning opportunities at SAS Global Forum that it is mind-numbing. Of course, the 20 and 50 minute session talks are the main modes of content delivery, but there are also e-posters, table talks and super demos in the Quad. We'll also have dozens of pre-conference tutorials, post-conference training, and all the activity in the Learning Labs, including hands-on workshops and the ability to take individual e-learning courses.

LL: Given your personal interests, I know one of your goals for this year’s conference is to increase participation in the event for students and professors. Can you tell me a little more about the special events you have planned for this audience?
GC: For starters, SAS Global Forum is completely “free” for students! As long as you are a full-time enrolled student of an accredited, degree-granting academic institution you can attend free of charge. There are credit hour minimums that must be reached to be eligible, so I'd encourage students to visit the website for complete details.

Programmatically, we have the Sunday afternoon sessions entirely dedicated to academics. We have a fantastic academic keynote speaker, Temple Grandin from Colorado State University, and special training sessions for professors interested in teaching analytics at their universities. For students, we offer a number of opportunities to network and special courses, such as how to best use social media for networking while looking for a job, to help them make a successful transition from student to working professional. We also encourage students, and really anyone who has an interest, to attend the presentations students make as winners of the SAS Global Forum Student Symposium Student Symposium. Though closed now, the Symposium provides an opportunity for teams of two to four students and a faculty adviser to showcase their skills and compete with other teams in the application of SAS Analytics in solving a big data problem. This year, more than 60 teams entered; the top eight will present 20-minute talks during the event.

LL: Dr. Chakraborty, I've taken a lot of your time, but is there anything else you'd like to share with our readers?
GC: Actually, I'd like to thank the many volunteers who have helped put this conference together. From serving on our SAS Global Users Group Executive Board to helping evaluate and select talks, to serving in our Presenter Mentor Program, hundreds of users have invested their time to make this conference the best one yet. SAS Global Forum is truly a user's conference and we depend on the user community to plan, promote and execute so many tasks and activities related to the event. Though I can't call them out by name, I would be remiss if I didn't mention their contributions and take a minute to thank them.

LL: Well let's hope they're reading! Dr. Chakraborty, I want to thank you again for your time. I look forward to seeing you in Denver in April.

Visit the SAS Global Forum 2018 website for more information and to register. Conference Proceedings will be available shortly before the event begins.

Continue the conversation: Join our live Tweetchat, Wednesday, March 7, 2018

How are you inspiring the extraordinary?

The next analytics extraordinary use case is just waiting to be discovered. We believe that in the hands of lifelong learners, the future of data is unlimited, especially when education and business join forces. That is why we are warming up to SAS Global Forum 2018 in Denver with a tweetchat on Wednesday 7th March (simply search #SASchat or #SASGF). We kick off at 6pm CET, 5pm UK, noon ET and 9am Pacific. Will you join us? The discussion will kick off with the following questions, posed to our expert panel:

  • Why is there more interest in lifelong learning now?
  • How does lifelong learning contribute to the analytics economy?
  • What are your favorite examples of analytics in the not-for-profit sector?
  • How is the education sector influencing the development of citizen data scientists?
  • What trends do you see in the consumption of analytics?

A conversation with SAS Global Forum 2018 Chair Goutam Chakraborty was published on SAS Users.

2月 282018
 

Chess has been intertwined with Computer Science since, well ... forever. Long before supercomputer Deep Blue, loaded with human-crafted chess algorithms, beat reigning world champion Garry Kasparov in 1997. Well before AlphaZero AI defeated all of human chess knowledge with just a few hours of study. Way before computers even [...]

The Inspiration of Things: technology shines at youth chess tournament was published on SAS Voices by Michael Thomas

2月 272018
 

“Quick response forecasting (QRF) techniques are forecasting processes that can incorporate information quickly enough to act upon by agile supply chains” explained Dr. Larry Lapide, in a recent Journal of Business Forecasting column. The concept of QRF is based on updating demand forecasts to reflect real and rapid changes in demand, both [...]

Is quick response forecasting a reality or just another buzzword? was published on SAS Voices by Charlie Chase

2月 262018
 

My article about the difference between CLASS variables and BY variables in SAS focused on SAS analytical procedures. However, the BY statement is also useful in the SAS DATA step where it is used to merge data sets and to analyze data at the group level. When you use the BY statement in the DATA step, the DATA step creates two temporary indicator variables for each variable in the BY statement. The names of these variables are FIRST.variable and LAST.variable, where variable is the name of a variable in the BY statement. For example, if you use the statement BY Sex, then the names of the indicator variables are FIRST.Sex and LAST.Sex.

This article gives several examples of using the FIRST.variable and LAST.variable indicator variables for BY-group analysis in the SAS DATA step. The first example shows how to compute counts and cumulative amounts for each BY group. The second example shows how to compute the time between the first and last visit of a patient to a clinic, as well as the change in a measured quantity between the first and last visit. BY-group processing in the DATA step is a fundamental operation that belongs in every SAS programmer's tool box.

Use FIRST. and LAST. variables to find count the size of groups

The first example uses data from the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. The data are distributed with SAS. The following DATA step extracts the Smoking_Status and Weight variables and sorts the data by the Smoking_Status variable:

proc sort data=Sashelp.Heart(keep=Smoking_Status Weight)
          out=Heart;
   by Smoking_Status;
run;

Because the data are sorted by the Smoking_Status variable, you can use the FIRST.Smoking_Status and LAST.Smoking_Status temporary variables to count the number of observations in each level of the Smoking_Status variable. (PROC FREQ computes the same information, but does not require sorted data.) When you use the BY Smoking_Status statement, the DATA step automatically creates the FIRST.Smoking_Status and LAST.Smoking_Status indicator variables. As its name implies, the FIRST.Smoking_Status variable has the value 1 for the first observation in each BY group and the value 0 otherwise. (More correctly, the value is 1 for the first record and for records for which the Smoking_Status variable is different than it was for the previous record.) Similarly, the LAST.Smoking_Status indicator variable has the value 1 for the last observation in each BY group and 0 otherwise.

The following DATA step defines a variable named Count and initializes Count=0 at the beginning of each BY group. For every observation in the BY group, the Count variable is incremented by 1. When the last record in each BY group is read, that record is written to the Count data set.

data Count;
   set Heart;                 /* data are sorted by Smoking_Status */
   BY Smoking_Status;         /* automatically creates indicator vars */
   if FIRST.Smoking_Status then
      Count = 0;              /* initialize Count at beginning of each BY group */
   Count + 1;                 /* increment Count for each record */
   if LAST.Smoking_Status;    /* output only the last record of each BY group */
run;
 
proc print data=Count noobs; 
   format Count comma10.;
   var Smoking_Status Count;
run;
Use FIRST.variable and LAST.variable to count the size of groups

The same technique enables you to accumulate values of a variable within a group. For example, you can accumulate the total weight of all patients in each smoking group by using the following statements:

if FIRST.Smoking_Status then
   cumWt = 0;
cumWt + Weight;

This same technique can be used to accumulate revenue from various sources, such as departments, stores, or regions.

Use FIRST. and LAST. variables to compute duration of treatment

Another common use of the FIRST.variable and LAST.variable indicator variables is to determine the length of time between a patient's first visit and his last visit. Consider the following DATA step, which defines the dates and weights for four male patients who visited a clinic as part of a weight-loss program:

data Patients;
informat Date date7.;
format Date date7. PatientID Z4.;
input PatientID Date Weight @@;
datalines;
1021 04Jan16  302  1042 06Jan16  285
1053 07Jan16  325  1063 11Jan16  291
1053 01Feb16  299  1021 01Feb16  288
1063 09Feb16  283  1042 16Feb16  279
1021 07Mar16  280  1063 09Mar16  272
1042 28Mar16  272  1021 04Apr16  273
1063 20Apr16  270  1053 28Apr16  289
1053 13May16  295  1063 31May16  269
;

For these data, you can sort by the patient ID and by the date of visit. After sorting, the first record for each patient contains the first visit to the clinic and the last record contains the last visit. You can subtract the patient's weight for these dates to determine how much the patient gained or lost during the trial. You can also use the INTCK function to compute the elapsed time between visits. If you want to measure time in days, you can simply subtract the dates, but the INTCK function enables you to compute duration in terms of years, months, weeks, and other time units.

proc sort data=Patients;
   by PatientID Date;
run;
 
data weightLoss;
   set Patients;
   BY PatientID;
   retain startDate startWeight;                 /* RETAIN the starting values */
   if FIRST.PatientID then do;
      startDate = Date; startWeight = Weight;    /* remember the initial values */
   end;
   if LAST.PatientID then do;
      endDate = Date; endWeight = Weight;
      elapsedDays = intck('day', startDate, endDate); /* elapsed time (in days) */
      weightLoss = startWeight - endWeight;           /* weight loss */
      AvgWeightLoss = weightLoss / elapsedDays;       /* average weight loss per day */
      output;                                         /* output only the last record in each group */
   end;
run;
 
proc print noobs; 
   var PatientID elapsedDays startWeight endWeight weightLoss AvgWeightLoss;
run;
Use FIRST.variable and LAST.variable to compute elapsed time and average quantities in a group

The output data set summarizes each patient's activities at the clinic, including his average weight loss and the duration of his treatment.

Some programmers think that the FIRST.variable and LAST.variable indicator variables require that the data be sorted, but that is not true. The temporary variables are created whenever you use a BY statement in a DATA step. You can use the NOTSORTED option on the BY statement to process records regardless of the sort order.

Summary

In summary, the BY statement in the DATA step automatically creates two indicator variables. You can use the variables to determine the first and last record in each BY group. Typically the FIRST.variable indicator is used to initialize summary statistics and to remember the initial values of measurement. The LAST.variable indicator is used to output the result of the computations, which often includes simple descriptive statistics such as a sum, difference, maximum, minimum, or average values.

BY-group processing in the DATA step is a common topic that is presented at SAS conferences. Some authors use FIRST.BY and LAST.BY as the name of the indicator variables. For further reading, I recommend the paper "The Power of the BY Statement" (Choate and Dunn, 2007). SAS also provides several samples about BY-group processing in the SAS DATA step, including the following:

The post How to use FIRST.variable and LAST.variable in a BY-group analysis in SAS appeared first on The DO Loop.

2月 212018
 

This article describes and implements a fast algorithm that estimates a median for very large samples. The traditional median estimate sorts a sample of size N and returns the middle value (when N is odd). The algorithm in this article uses Monte Carlo techniques to estimate the median much faster.

Of course, there's no such thing as a free lunch. The Monte Carlo estimate is an approximation. It is useful when a quick-and-dirty estimate is more useful than a more precise value that takes longer to compute. For example, if you want to find the median value of 10 billion credit card transactions, it might not matter whether the median $16.74 or $16.75. Either would be an adequate estimate for many business decisions.

Although the traditional sample median is commonly used, it is merely an estimate. All sample quantiles are estimates of a population quantity, and there are many ways to estimate quantiles in statistical software. To estimate these quantities faster, researchers have developed approximate methods such as the piecewise-parabolic algorithm in PROC MEANS or "probabilistic methods" such as the ones in this article.

Example data: A sample from a mixture distribution

Sample from mixture distribution showing sample median

Consider the following data set, which simulates 10 million observations from a mixture of an exponential and a normal distribution. You can compute the exact median for the distribution, which is 8.065. A histogram of the data and the population median are shown to the right.

Because the sample size is large, PROC MEANS takes a few seconds to compute the sample median by using the traditional algorithm which sorts the data (an O(N log N) operation) and then returns the middle value. The sample median is 8.065.

/* simulate from a mixture distribution. See
   https://blogs.sas.com/content/iml/2011/09/21/generate-a-random-sample-from-a-mixture-distribution.html */
%let N = 1e7;
data Have;
call streaminit(12345);
do i = 1 to &N;
   d = rand("Bernoulli", 0.6);
   if d = 0 then
      x = rand("Exponential", 0.5);
   else
      x = rand("Normal", 10, 2);
   output;
end;
keep x;
run;
 
/* quantile estimate */
proc means data=Have N Median;
   var x;
run;

A naive resampling algorithm

The basic idea of resampling methods is to extract a random subset of the data and compute statistics on the subset. Although the inferential statistics (standard errors and confidence interval) on the subset are not valid for the original data, the point estimates are typically close, especially when the original data and the subsample are large.

The simplest form of a resampling algorithm is to generate a random subsample of the data and compute the median of the subsample. The following example uses PROC SURVEYSELECT to resample (with replacement) from the data. The subsample is one tenth as large as the original sample. The subsequent call to PROC MEANS computes the median of the subsample, which is 8.063.

proc surveyselect data=Have seed=1234567 NOPRINT 
     out=SubSample
     method=urs samprate=0.1;                 /* 10% sampling with replacement */
run;
 
title "Estimate from 10% of the data";
proc means data=SubSample N Median;
   freq NumberHits;
   var x;
run;

It takes less time to extract a 10% subset and compute the median than it takes to compute the median of the full data. This naive resampling is simple to implement and often works well, as in this case. The estimate on the resampled data is close to the true population median, which is 8.065. (However, the standard error of the traditional estimate is smaller.)

You might worry, however, that by discarding 90% of the data you might discard too much information. In fact, there is a 90% chance that we discarded the middle data value! Thus it is wise to construct the subsample more carefully. The next section constructs a subsample that excludes data from the tails of the distribution and keeps data from the middle of the distribution.

A probabilistic algorithm for the median

This section presents a more sophisticated approximate algorithm for the median. My presentation of this Monte Carlo algorithm is based on the lecture notes of Prof. H-K Hon and a Web page for computer scientists. Unfortunately, neither source credits the original author of this algorithm. If you know the original reference, please post it in the comments.

Begin with a set S that contains N observations. The following algorithm returns an approximate median with high probability:

  1. Choose a random sample (with replacement) of size k = N3/4 from the data. Call the sample R.
  2. Choose lower and upper "sentinels," L and U. The sentinels are the lower and upper quantiles of R that correspond to the ranks k/2 ± sqrt(N). These sentinels define a symmetric interval about the median of R.
  3. (Optional) Verify that the interval [L, U] contains the median value of the original data.
  4. Let C be the subset of the original observations that are within the interval [L, U]. This is a small subset.
  5. Return the median of C.

The idea is to use the random sample only to find an interval that probably contains the median. The width of the interval depends on the size of the data. If N=10,000, the algorithm uses the 40th and 60th percentiles of the random subset R. For N=1E8, it uses the 49th and 51th percentiles.

The following SAS/IML program implements the Monte Carlo estimate for the median:

proc iml;
/* Choose a set R of k = N##(3/4) elements in x, chosen  at random with replacement. 
   Choose quantiles of R to form an interval [L,U].
   Return the median of data that are within [L,U]. */
start MedianRand(x);
   N = nrow(x);
   k = ceil(N##0.75);    /* smaller sample size */
   R = T(sample(x, k));  /* bootstrap sample of size k (with replacement) */
   call sort(R);         /* Sort this subset of data */
   L = R[ floor(k/2 - sqrt(N)) ]; /* lower sentinel */
   U = R[  ceil(k/2 + sqrt(N)) ]; /* upper sentinel */
   UTail = (x > L);       /* indicator for large values */
   LTail = (x < U);       /* indicator for small values */
   if sum(UTail) > N/2 & sum(LTail) > N/2 then do;
      C = x[loc(UTail & LTail)]; /* extract central portion of data */
      return ( median(C) );      
   end;
   else do;
      print "Median not between sentinels!";
      return (.);
   end;
finish;
 
/* run the algorithm on example data */
use Have; read all var {"x"}; close;
call randseed(456);
 
t1 = time();
   approxMed = MedianRand(x);  /* compute approximate median; time it */
tApproxMed = time()-t1;
 
t0 = time();
   med = median(x);            /* compute traditional estimate; time it */
tMedian = time()-t0;
 
results = (approxMed || med) // (tApproxMed|| tMedian);
print results[colname={"Approx" "Traditional"}
              rowname={"Estimate" "Time"} F=Best6.];

The output indicates that the Monte Carlo algorithm gives an estimate that is close to the traditional estimate, but produces it three times faster. If you repeat this experiment for 1E8 observations, the Monte Carlo algorithm computes an estimate in 2.4 seconds versus 11.4 seconds for the traditional estimate, which is almost five times faster. As the data size grows, the speed advantage of the Monte Carlo algorithm increases. See Prof. Hon's notes for more details.

In summary, this article shows how to implement a probabilistic algorithm for estimating the median for large data. The Monte Carlo algorithm uses a bootstrap subsample to estimate a symmetric interval that probably contains the true median, then uses the interval to strategically choose and analyze only part of the original data. For large data, this Monte Carlo estimate is much faster than the traditional estimate.

The post A Monte Carlo algorithm to estimate a median appeared first on The DO Loop.

2月 212018
 

Data protection with GDPRHere are some new tips for masking. The new EU General Data Protection Regulation (GDPR) requires your company to implement (quote) all necessary technical and organizational measures and to take into consideration the available technology at the time of the processing and technological developments. So, how can you comply with this requirement in the real world? In Part 1, we anonymized field content or replaced it with aliases. That can be sufficient, but it doesn’t have to be. That’s why we’ll cover beta functions in this article (the ideal solution for pseudonymization), personal data that has slipped through the cracks, and the exciting question of ...

Read part 1 of this series: Pseudonymagical: masking data to get up to speed with GDPR

How random can your birth be?

The exact date of your birth is important to you, naturally. The analytics experts working with your data, on the other hand, aren’t looking to send you birthday wishes anyway (missing opt-in?!). What they’re interested in is your approximate age, maybe even just the decade. The SQL code from Part 1 moves the date of birth randomly plus or minus five days. Someone who knows your birth date would therefore be unable to locate your records within a stolen database. Privacy risk abated!

But even that should be verified… with respect to providing proof of “appropriate measures,” in other words, cluster size. In our example of around 5,000 VIP customers, there is only one who is in their 20’s and has a postal code beginning with the numeral 1. The time required to indirectly identify the individual (Recital 21, GDPR) could be rather low here. In the worst case scenario, legally too low.

Enter the beta function: the ideal solution for pseudonymization

Luckily, Recital 29 of the General Data Protection Regulation tells us how to handle this problem. The information required to pinpoint an individual is simply stored separately. That can be accomplished using a key or a mathematical function, in other words a macro, with a secret key that I only use – but don’t know about the math hidden behind it. The law doesn’t tell us how tricky this logic has to be, though. This so-called beta function should satisfy two additional conditions from an analytical standpoint:

  • It must be invertible (a hash is not, for instance).
  • The result of the masking should be monotonic, which means: high original value = high new value (encryption doesn’t do this).

Why? Well, we don’t want to affect the analytic modelling too much - ideally, the function would output something linear or slightly exponential… Here is a √2 example I’ve kept simple:

CODE1

Mathematically, this is a coordinate transformation - or you can also think of it in terms of Star Trek: people are being beamed to an unfamiliar planet. There is a different gravity field than the earth there (a different coordinate system), but it applies to everyone equally — which means that lightweight visitors on the planet can still jump higher there than their heavyweight colleagues. The same applies accordingly to age etc.

CODE2

When using the birth date or the age, I, as an analytics expert, have no knowledge of how this beaming works technically, but trust that when I’m developing models (and later when scoring) them, that nothing about the behavior has changed. By the way, the computer and correlation don’t care anyway - neither have any concept of age. (It just feels a bit strange for humans.)

We don’t lose the “true” age. It can be re-calculated using another beta function. With what is known as the inverse, but it’s available only to authorized employees - for instance to fraud or legal people during data protection lawsuits. In these cases, your customer can safely be beamed back to earth, so to speak.

A complaint from my office mate

“But how do I explain to the boss my model behavior for these 300-year-olds?!” ... Well in this era of machine learning, neural networks are gaining in popularity and are as selective as they are indescribable. On our side, the math behind it is at least deterministic and explainable; good to know that this key code is no longer stored on your PC, not glued to its data source and target, but remote and safe – because of modern data protection to protect you and the data. And that’s a good thing.

Final aspect: the data for relevant columns has now been subjected to smart masking, the logic is in a central repository, and it’s working in secret. But what about those seemingly harmless fields way in the back, mostly empty and irrelevant, which then in the form of a sales memo or notice suddenly reveal the name of the wife, the second email address, or the former employer? The author who created them thought it was extremely practical, since they didn’t find anywhere else in the contract template where they could enter and save the information.

CODE

SAS Data Quality has pre-configured, transparent sets of rules that you can tweak as necessary to detect many of these types of cases using heuristics. That’s indispensable because if I don’t know about it, I can’t protect against it. (If I forget about the tiny basement window when installing the security system, I can be sure that the robbers won’t cooperate by breaking down the front door).

That is a prerequisite for an inventory of the data warehouse, the estimate of the GDPR implementation expense — and here an additional safeguard. Because in the code above, a firewall filter is applied to the data: if the name of a human being slips through the cracks, then only asterisks are displayed when it is output. The field “Note” is always replaced by the description of the category, such as “This is where a telephone number is hidden. After approval by the data protection officer, you may read it – but not for now.”

Are you ready for the GDPR? Learn how your peers are preparing in this global survey report.

Disclaimer: The author of this blog is not an attorney. None of the statements in this article can be construed as legal advice nor can they serve as a substitute for professional legal consultation. All code samples are for illustrative purposes only.

Beam your customers into invisibility: a data protection masked ball to get you up to speed with the GDPR was published on SAS Users.