2月 262018
 

My article about the difference between CLASS variables and BY variables in SAS focused on SAS analytical procedures. However, the BY statement is also useful in the SAS DATA step where it is used to merge data sets and to analyze data at the group level. When you use the BY statement in the DATA step, the DATA step creates two temporary indicator variables for each variable in the BY statement. The names of these variables are FIRST.variable and LAST.variable, where variable is the name of a variable in the BY statement. For example, if you use the statement BY Sex, then the names of the indicator variables are FIRST.Sex and LAST.Sex.

This article gives several examples of using the FIRST.variable and LAST.variable indicator variables for BY-group analysis in the SAS DATA step. The first example shows how to compute counts and cumulative amounts for each BY group. The second example shows how to compute the time between the first and last visit of a patient to a clinic, as well as the change in a measured quantity between the first and last visit. BY-group processing in the DATA step is a fundamental operation that belongs in every SAS programmer's tool box.

Use FIRST. and LAST. variables to find count the size of groups

The first example uses data from the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. The data are distributed with SAS. The following DATA step extracts the Smoking_Status and Weight variables and sorts the data by the Smoking_Status variable:

proc sort data=Sashelp.Heart(keep=Smoking_Status Weight)
          out=Heart;
   by Smoking_Status;
run;

Because the data are sorted by the Smoking_Status variable, you can use the FIRST.Smoking_Status and LAST.Smoking_Status temporary variables to count the number of observations in each level of the Smoking_Status variable. (PROC FREQ computes the same information, but does not require sorted data.) When you use the BY Smoking_Status statement, the DATA step automatically creates the FIRST.Smoking_Status and LAST.Smoking_Status indicator variables. As its name implies, the FIRST.Smoking_Status variable has the value 1 for the first observation in each BY group and the value 0 otherwise. (More correctly, the value is 1 for the first record and for records for which the Smoking_Status variable is different than it was for the previous record.) Similarly, the LAST.Smoking_Status indicator variable has the value 1 for the last observation in each BY group and 0 otherwise.

The following DATA step defines a variable named Count and initializes Count=0 at the beginning of each BY group. For every observation in the BY group, the Count variable is incremented by 1. When the last record in each BY group is read, that record is written to the Count data set.

data Count;
   set Heart;                 /* data are sorted by Smoking_Status */
   BY Smoking_Status;         /* automatically creates indicator vars */
   if FIRST.Smoking_Status then
      Count = 0;              /* initialize Count at beginning of each BY group */
   Count + 1;                 /* increment Count for each record */
   if LAST.Smoking_Status;    /* output only the last record of each BY group */
run;
 
proc print data=Count noobs; 
   format Count comma10.;
   var Smoking_Status Count;
run;
Use FIRST.variable and LAST.variable to count the size of groups

The same technique enables you to accumulate values of a variable within a group. For example, you can accumulate the total weight of all patients in each smoking group by using the following statements:

if FIRST.Smoking_Status then
   cumWt = 0;
cumWt + Weight;

This same technique can be used to accumulate revenue from various sources, such as departments, stores, or regions.

Use FIRST. and LAST. variables to compute duration of treatment

Another common use of the FIRST.variable and LAST.variable indicator variables is to determine the length of time between a patient's first visit and his last visit. Consider the following DATA step, which defines the dates and weights for four male patients who visited a clinic as part of a weight-loss program:

data Patients;
informat Date date7.;
format Date date7. PatientID Z4.;
input PatientID Date Weight @@;
datalines;
1021 04Jan16  302  1042 06Jan16  285
1053 07Jan16  325  1063 11Jan16  291
1053 01Feb16  299  1021 01Feb16  288
1063 09Feb16  283  1042 16Feb16  279
1021 07Mar16  280  1063 09Mar16  272
1042 28Mar16  272  1021 04Apr16  273
1063 20Apr16  270  1053 28Apr16  289
1053 13May16  295  1063 31May16  269
;

For these data, you can sort by the patient ID and by the date of visit. After sorting, the first record for each patient contains the first visit to the clinic and the last record contains the last visit. You can subtract the patient's weight for these dates to determine how much the patient gained or lost during the trial. You can also use the INTCK function to compute the elapsed time between visits. If you want to measure time in days, you can simply subtract the dates, but the INTCK function enables you to compute duration in terms of years, months, weeks, and other time units.

proc sort data=Patients;
   by PatientID Date;
run;
 
data weightLoss;
   set Patients;
   BY PatientID;
   retain startDate startWeight;                 /* RETAIN the starting values */
   if FIRST.PatientID then do;
      startDate = Date; startWeight = Weight;    /* remember the initial values */
   end;
   if LAST.PatientID then do;
      endDate = Date; endWeight = Weight;
      elapsedDays = intck('day', startDate, endDate); /* elapsed time (in days) */
      weightLoss = startWeight - endWeight;           /* weight loss */
      AvgWeightLoss = weightLoss / elapsedDays;       /* average weight loss per day */
      output;                                         /* output only the last record in each group */
   end;
run;
 
proc print noobs; 
   var PatientID elapsedDays startWeight endWeight weightLoss AvgWeightLoss;
run;
Use FIRST.variable and LAST.variable to compute elapsed time and average quantities in a group

The output data set summarizes each patient's activities at the clinic, including his average weight loss and the duration of his treatment.

Some programmers think that the FIRST.variable and LAST.variable indicator variables require that the data be sorted, but that is not true. The temporary variables are created whenever you use a BY statement in a DATA step. You can use the NOTSORTED option on the BY statement to process records regardless of the sort order.

Summary

In summary, the BY statement in the DATA step automatically creates two indicator variables. You can use the variables to determine the first and last record in each BY group. Typically the FIRST.variable indicator is used to initialize summary statistics and to remember the initial values of measurement. The LAST.variable indicator is used to output the result of the computations, which often includes simple descriptive statistics such as a sum, difference, maximum, minimum, or average values.

BY-group processing in the DATA step is a common topic that is presented at SAS conferences. Some authors use FIRST.BY and LAST.BY as the name of the indicator variables. For further reading, I recommend the paper "The Power of the BY Statement" (Choate and Dunn, 2007). SAS also provides several samples about BY-group processing in the SAS DATA step, including the following:

The post How to use FIRST.variable and LAST.variable in a BY-group analysis in SAS appeared first on The DO Loop.

2月 212018
 

This article describes and implements a fast algorithm that estimates a median for very large samples. The traditional median estimate sorts a sample of size N and returns the middle value (when N is odd). The algorithm in this article uses Monte Carlo techniques to estimate the median much faster.

Of course, there's no such thing as a free lunch. The Monte Carlo estimate is an approximation. It is useful when a quick-and-dirty estimate is more useful than a more precise value that takes longer to compute. For example, if you want to find the median value of 10 billion credit card transactions, it might not matter whether the median $16.74 or $16.75. Either would be an adequate estimate for many business decisions.

Although the traditional sample median is commonly used, it is merely an estimate. All sample quantiles are estimates of a population quantity, and there are many ways to estimate quantiles in statistical software. To estimate these quantities faster, researchers have developed approximate methods such as the piecewise-parabolic algorithm in PROC MEANS or "probabilistic methods" such as the ones in this article.

Example data: A sample from a mixture distribution

Sample from mixture distribution showing sample median

Consider the following data set, which simulates 10 million observations from a mixture of an exponential and a normal distribution. You can compute the exact median for the distribution, which is 8.065. A histogram of the data and the population median are shown to the right.

Because the sample size is large, PROC MEANS takes a few seconds to compute the sample median by using the traditional algorithm which sorts the data (an O(N log N) operation) and then returns the middle value. The sample median is 8.065.

/* simulate from a mixture distribution. See
   https://blogs.sas.com/content/iml/2011/09/21/generate-a-random-sample-from-a-mixture-distribution.html */
%let N = 1e7;
data Have;
call streaminit(12345);
do i = 1 to &N;
   d = rand("Bernoulli", 0.6);
   if d = 0 then
      x = rand("Exponential", 0.5);
   else
      x = rand("Normal", 10, 2);
   output;
end;
keep x;
run;
 
/* quantile estimate */
proc means data=Have N Median;
   var x;
run;

A naive resampling algorithm

The basic idea of resampling methods is to extract a random subset of the data and compute statistics on the subset. Although the inferential statistics (standard errors and confidence interval) on the subset are not valid for the original data, the point estimates are typically close, especially when the original data and the subsample are large.

The simplest form of a resampling algorithm is to generate a random subsample of the data and compute the median of the subsample. The following example uses PROC SURVEYSELECT to resample (with replacement) from the data. The subsample is one tenth as large as the original sample. The subsequent call to PROC MEANS computes the median of the subsample, which is 8.063.

proc surveyselect data=Have seed=1234567 NOPRINT 
     out=SubSample
     method=urs samprate=0.1;                 /* 10% sampling with replacement */
run;
 
title "Estimate from 10% of the data";
proc means data=SubSample N Median;
   freq NumberHits;
   var x;
run;

It takes less time to extract a 10% subset and compute the median than it takes to compute the median of the full data. This naive resampling is simple to implement and often works well, as in this case. The estimate on the resampled data is close to the true population median, which is 8.065. (However, the standard error of the traditional estimate is smaller.)

You might worry, however, that by discarding 90% of the data you might discard too much information. In fact, there is a 90% chance that we discarded the middle data value! Thus it is wise to construct the subsample more carefully. The next section constructs a subsample that excludes data from the tails of the distribution and keeps data from the middle of the distribution.

A probabilistic algorithm for the median

This section presents a more sophisticated approximate algorithm for the median. My presentation of this Monte Carlo algorithm is based on the lecture notes of Prof. H-K Hon and a Web page for computer scientists. Unfortunately, neither source credits the original author of this algorithm. If you know the original reference, please post it in the comments.

Begin with a set S that contains N observations. The following algorithm returns an approximate median with high probability:

  1. Choose a random sample (with replacement) of size k = N3/4 from the data. Call the sample R.
  2. Choose lower and upper "sentinels," L and U. The sentinels are the lower and upper quantiles of R that correspond to the ranks k/2 ± sqrt(N). These sentinels define a symmetric interval about the median of R.
  3. (Optional) Verify that the interval [L, U] contains the median value of the original data.
  4. Let C be the subset of the original observations that are within the interval [L, U]. This is a small subset.
  5. Return the median of C.

The idea is to use the random sample only to find an interval that probably contains the median. The width of the interval depends on the size of the data. If N=10,000, the algorithm uses the 40th and 60th percentiles of the random subset R. For N=1E8, it uses the 49th and 51th percentiles.

The following SAS/IML program implements the Monte Carlo estimate for the median:

proc iml;
/* Choose a set R of k = N##(3/4) elements in x, chosen  at random with replacement. 
   Choose quantiles of R to form an interval [L,U].
   Return the median of data that are within [L,U]. */
start MedianRand(x);
   N = nrow(x);
   k = ceil(N##0.75);    /* smaller sample size */
   R = T(sample(x, k));  /* bootstrap sample of size k (with replacement) */
   call sort(R);         /* Sort this subset of data */
   L = R[ floor(k/2 - sqrt(N)) ]; /* lower sentinel */
   U = R[  ceil(k/2 + sqrt(N)) ]; /* upper sentinel */
   UTail = (x > L);       /* indicator for large values */
   LTail = (x < U);       /* indicator for small values */
   if sum(UTail) > N/2 & sum(LTail) > N/2 then do;
      C = x[loc(UTail & LTail)]; /* extract central portion of data */
      return ( median(C) );      
   end;
   else do;
      print "Median not between sentinels!";
      return (.);
   end;
finish;
 
/* run the algorithm on example data */
use Have; read all var {"x"}; close;
call randseed(456);
 
t1 = time();
   approxMed = MedianRand(x);  /* compute approximate median; time it */
tApproxMed = time()-t1;
 
t0 = time();
   med = median(x);            /* compute traditional estimate; time it */
tMedian = time()-t0;
 
results = (approxMed || med) // (tApproxMed|| tMedian);
print results[colname={"Approx" "Traditional"}
              rowname={"Estimate" "Time"} F=Best6.];

The output indicates that the Monte Carlo algorithm gives an estimate that is close to the traditional estimate, but produces it three times faster. If you repeat this experiment for 1E8 observations, the Monte Carlo algorithm computes an estimate in 2.4 seconds versus 11.4 seconds for the traditional estimate, which is almost five times faster. As the data size grows, the speed advantage of the Monte Carlo algorithm increases. See Prof. Hon's notes for more details.

In summary, this article shows how to implement a probabilistic algorithm for estimating the median for large data. The Monte Carlo algorithm uses a bootstrap subsample to estimate a symmetric interval that probably contains the true median, then uses the interval to strategically choose and analyze only part of the original data. For large data, this Monte Carlo estimate is much faster than the traditional estimate.

The post A Monte Carlo algorithm to estimate a median appeared first on The DO Loop.

2月 212018
 

Data protection with GDPRHere are some new tips for masking. The new EU General Data Protection Regulation (GDPR) requires your company to implement (quote) all necessary technical and organizational measures and to take into consideration the available technology at the time of the processing and technological developments. So, how can you comply with this requirement in the real world? In Part 1, we anonymized field content or replaced it with aliases. That can be sufficient, but it doesn’t have to be. That’s why we’ll cover beta functions in this article (the ideal solution for pseudonymization), personal data that has slipped through the cracks, and the exciting question of ...

Read part 1 of this series: Pseudonymagical: masking data to get up to speed with GDPR

How random can your birth be?

The exact date of your birth is important to you, naturally. The analytics experts working with your data, on the other hand, aren’t looking to send you birthday wishes anyway (missing opt-in?!). What they’re interested in is your approximate age, maybe even just the decade. The SQL code from Part 1 moves the date of birth randomly plus or minus five days. Someone who knows your birth date would therefore be unable to locate your records within a stolen database. Privacy risk abated!

But even that should be verified… with respect to providing proof of “appropriate measures,” in other words, cluster size. In our example of around 5,000 VIP customers, there is only one who is in their 20’s and has a postal code beginning with the numeral 1. The time required to indirectly identify the individual (Recital 21, GDPR) could be rather low here. In the worst case scenario, legally too low.

Enter the beta function: the ideal solution for pseudonymization

Luckily, Recital 29 of the General Data Protection Regulation tells us how to handle this problem. The information required to pinpoint an individual is simply stored separately. That can be accomplished using a key or a mathematical function, in other words a macro, with a secret key that I only use – but don’t know about the math hidden behind it. The law doesn’t tell us how tricky this logic has to be, though. This so-called beta function should satisfy two additional conditions from an analytical standpoint:

  • It must be invertible (a hash is not, for instance).
  • The result of the masking should be monotonic, which means: high original value = high new value (encryption doesn’t do this).

Why? Well, we don’t want to affect the analytic modelling too much - ideally, the function would output something linear or slightly exponential… Here is a √2 example I’ve kept simple:

CODE1

Mathematically, this is a coordinate transformation - or you can also think of it in terms of Star Trek: people are being beamed to an unfamiliar planet. There is a different gravity field than the earth there (a different coordinate system), but it applies to everyone equally — which means that lightweight visitors on the planet can still jump higher there than their heavyweight colleagues. The same applies accordingly to age etc.

CODE2

When using the birth date or the age, I, as an analytics expert, have no knowledge of how this beaming works technically, but trust that when I’m developing models (and later when scoring) them, that nothing about the behavior has changed. By the way, the computer and correlation don’t care anyway - neither have any concept of age. (It just feels a bit strange for humans.)

We don’t lose the “true” age. It can be re-calculated using another beta function. With what is known as the inverse, but it’s available only to authorized employees - for instance to fraud or legal people during data protection lawsuits. In these cases, your customer can safely be beamed back to earth, so to speak.

A complaint from my office mate

“But how do I explain to the boss my model behavior for these 300-year-olds?!” ... Well in this era of machine learning, neural networks are gaining in popularity and are as selective as they are indescribable. On our side, the math behind it is at least deterministic and explainable; good to know that this key code is no longer stored on your PC, not glued to its data source and target, but remote and safe – because of modern data protection to protect you and the data. And that’s a good thing.

Final aspect: the data for relevant columns has now been subjected to smart masking, the logic is in a central repository, and it’s working in secret. But what about those seemingly harmless fields way in the back, mostly empty and irrelevant, which then in the form of a sales memo or notice suddenly reveal the name of the wife, the second email address, or the former employer? The author who created them thought it was extremely practical, since they didn’t find anywhere else in the contract template where they could enter and save the information.

CODE

SAS Data Quality has pre-configured, transparent sets of rules that you can tweak as necessary to detect many of these types of cases using heuristics. That’s indispensable because if I don’t know about it, I can’t protect against it. (If I forget about the tiny basement window when installing the security system, I can be sure that the robbers won’t cooperate by breaking down the front door).

That is a prerequisite for an inventory of the data warehouse, the estimate of the GDPR implementation expense — and here an additional safeguard. Because in the code above, a firewall filter is applied to the data: if the name of a human being slips through the cracks, then only asterisks are displayed when it is output. The field “Note” is always replaced by the description of the category, such as “This is where a telephone number is hidden. After approval by the data protection officer, you may read it – but not for now.”

Are you ready for the GDPR? Learn how your peers are preparing in this global survey report.

Disclaimer: The author of this blog is not an attorney. None of the statements in this article can be construed as legal advice nor can they serve as a substitute for professional legal consultation. All code samples are for illustrative purposes only.

Beam your customers into invisibility: a data protection masked ball to get you up to speed with the GDPR was published on SAS Users.

2月 212018
 

Masking data for GDPRYou work with data. Data about your customers. It's likely that your customers' identity could be determined from the data you’ve collected. Starting in May 2018, a new data protection law will be in effect. This means you’re going to have to document which technical measures you’ve implemented to prevent your customers from being identified. That’s tricky, but isn’t everything already pseudonymized or anonymous or something? Won’t the IT department just take care of it? No. They won’t. This blog post gives you a detailed breakdown of what’s possible and where standard software can help with masking data for GDPR.

Read part 2 of this series: Beam your customers into invisibility: a data protection masked ball to get you up to speed with the GDPR

In the past, compliance has been relatively easy. The data protection officer was sure to take care of everything, documenting the steps taken together with IT and the compliance team. It felt like the powerful data protection law was there to protect people in charge - like you – from the data’s owners out there. Too abstract for you? Here’s an analogy:

Let’s say my car’s inspection is up in May. I have placed my trust in a copy of the traffic regulations that rests in my glove box. I tell the inspector, this is how I’ve always done things! And my perfectly maintained car has been running great for years. Wait a minute, the new mechanic seems to have missed something last time... and now you want my documentation?! Huh? What documentation?

Data protection by design and by default (Art. 25 GDPR)

But let’s go back to (your) data, which you obtained from the data warehouse (DWH), received from generous colleagues’ collections, or harvested fresh from the data lake ... you know, the stuff that self-service big data analytics is made of. Mix it all together, and the computer is able to guess the customer’s behavior. Lawyers are less inclined to wax poetic here, referring instead to “profiling” (GDPR Article 4), which can quickly extend beyond the initial purpose of the processing. The bad part? If the customer doesn’t reasonably expect this kind of processing of their information, they can submit a complaint and even demand their data be deleted at once (Article 22 and Recital 71 GDPR).

What to do? Well, get rid of personally identifiable information right from the start! So, just select fewer data points? Yes. But if you need them for your analytics, write down in advance exactly how the data will be processed ... that’ll give your data scientist a headache. Even fragments with very minimal data could still be traced back to single out the “wife of a dentist, aged 30 to 40, living in Smallsville.” So it’s better to properly “pseudonymize” or anonymize from the start.

In other words, you have to replace all dates with asterisks or a random number or encrypt them or hash them - then, in the best-case scenario, the record is anonymous. But this procedure, unfortunately, also makes it useless for analysis in most cases. Or even worse, completely wrong with respect to segmentation, model development, or scoring. More about that later. Here’s an example of “common” customer data:

Let’s say we have a file that contains a great deal of personally identifiable information. It could even be lurking in comment fields or notes - without ever being listed in a data dictionary or the records of processing activities. But it’s hiding in there, in unassuming properties, selected out of curiosity or by mistake, and thus stored. And simply storing it brings the GDPR into play. Oops!

Totally anonymous: ensure compliance by making data useless?

So how exactly do you make data unidentifiable without completely destroying all of its value? When masking, you should deploy standard functions that use quality-controlled logic in such a way that users who processes the data are unable to view the algorithmn. In SQL, it may look something like this:

CREATE VIEW pdp_de_demo.Team_Alpha.CRM_CUSTOMERBASE_GDPR_VIEW AS SELECT
     SYSCAT.DM.MASK ( ‘ENCRYPT’, PUT(A.CustomerNo , 8.), ‘alg’, ‘AES’, ‘key’, ‘12345)
          AS CustomerNr_encrypt,
     SYSCAT.DM.MASK ( ‘HASH’, A.IBAN , ‘alg’, ‘SHA256’, ‘key’, ‘12345)
          AS IBAN_hash,
     SYSCAT.DM.MASK ( ‘TRANC’,A.”IBAN” , ‘FROM’, ‘1234567890’, ‘TO’, ‘XXXXXXXXXX’,
               ‘START’, 10 , ‘LENGTH’, 9 )
          AS IBAN_tranc,
     PUT(SYSCAT.DM.MASK ( ‘RANDATE’, A.Bithdate, ‘VARY’, 5, ‘UNITS’, ‘DAY), DDMMYYP10.)
          AS Birthdate,
     SYSCAT.DM.MASK ( ‘RANDOM’, A.Balance, ‘VARY’, 100 )
          AS Balance,
     (CASE WHEN ( SYSPROC.DQ.DQUALITY.DQEXTRACT (
                               A.COMMENTFIELD, ‘PDP - Personal Data (Core)’, ‘Individual’,’DEDEU’ ) ne “)
               THEN* * *’ ELSE A.COMMENTFIELD END)
          AS Commentfield_without_name,
     SYSPROC.DQ.DQUALITY.DQIDENTIFY ( A.ANNOTATION, ‘PDP - Personal Data (Core)’, ‘DEDEU’ )
          AS ANNOTATION_IDENTIFY
FROM pdp_de_demo.data.CRM_CUSTOMERBASE AS A

The results look appealing.

But the following methodological problem quickly arises. Who in the corporate data flow carries out this step, and where should they store the results? Depending on the type of processing, the technician or analytics expert may not be permitted to even view the data in the first place, making it impossible for them to apply masking on top! How about IT (or the application developers) doing a generic anonymization of the entire DWH in one go? Well, that’s a sure way to secure the data, and a sure way to make it useless for a lot of other things, too!

Who’s masking who?

The solution: no one touches those official tables (see note at bottom). The user just no longer (physically) accesses them directly, but rather via dynamic views, which only read specified data. These views handle masking “on-the-fly,” in real time or from the cache – without destroying the original data. In the simplest cases (as in our example code above), the content of the respective fields is modified “for all selectors.” Or depending on a control table, they’re flipped off like a switch the instant a user communicates their opt-out.

The SAS Federation Server hosts these types of views, gently inserting itself between business users, analytics experts, and internal consumers of reports. Once logged in, the server knows the user’s groups, roles, and permissions. This can be elegantly leveraged in the design of the views. The very same view can allow that team with the VIP flag to also see the balance, whereas the fraud team sees everything, while the rest get harmless asterisks in place of sensitive information. All from the same physical data, controlled through a central platform.

New on the inside, proven on the outside: controlled protection for switching on and off

Access to these types of centrally administered views is accomplished via ODBC / JDBC / API. Or, for SAS users, via the trusty libname statement. There, instead of “ORACLE,” “FEDSVR” is your keyword. Everything else looks just like it always has. The tables (some of which are not actual tables) appear, the code is ready to run. Such libref can, of course, also be pre-assigned in the metadata context.

LIBNAME mydwh FEDSVR DSN=dwh1 SERVER=”demo.sas.com” PORT=24141 SCHEMA=Team_Alpha;

A double benefit: the IT team can take its time rebuilding the physical layer (DBMS portings, modifications to the DDL, switching to a different DB, etc.) without the phones ringing off the hook with important departments calling to complain. Plus, every access can be logged. The SAS Federation Server can optionally record queries as well, regardless of whatever sneaky macro data step was used - everything is processed and logged. And users cannot get around it by deleting their SAS logs.

This activity should not be misconstrued as surveillance. The GDPR is what demands proof of who, how, where, and when ... and for data breaches, within 72 hours. Like when someone at the company burns a CD with sensitive tax information or commits other internal fraud. And as with the deletion approach, it’s already a big step in the right direction to be able to say, “yes, that’s how it would work — should we turn it on?”

Summary of the first part

The General Data Protection Regulation requires your company to implement the “appropriate technical and organisational measures” and “taking into consideration the available technology at the time of the processing and technological developments.” This article has given you some suggestions for things you can discuss with colleagues. And what’s more, the software is already out there - there’s no need to patch together a solution yourself. Instead, you can use that valuable time to generate the documentation for compliance instead.

P.S. There’s no time to lose to comply with the GDPR. Learn how to get ready in this global survey report.

Attachment: Legal texts

Profiling GDPR Art. 4, Par. 4:
‘profiling’ means any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyze or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behavior, location or movements;

The GDPR is more specific with anonymous pseudonymization (Recital 26 and 29):
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments. The principles of data protection should therefore not apply to anonymous information, namely (...) to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.

In order to create incentives to apply pseudonymization when processing personal data, measures of pseudonymization should, whilst allowing general analysis, be possible within the same controller when that controller has taken technical and organisational measures necessary to ensure, for the processing concerned, that this Regulation is implemented, and that additional information for attributing the personal data to a specific data subject is kept separately. The controller processing the personal data should indicate the authorised persons within the same controller.

A final note: To ensure the separate “right to be forgotten” in the data warehouse (and countless source systems) as per Art. 17 GDPR, special measures must be taken. But this is not what we’re talking about above — deleted records are “gone,” and can no longer be processed or pseudonymized by means of profiling. Note: In the event that they are not intended (or permitted) to disappear from the relevant database immediately, approaches such as the one described could offer a temporary solution.

Disclaimer: The author of this blog is not an attorney. None of the statements in this article can be construed as legal advice nor can they serve as a substitute for professional legal consultation. All code samples are for illustrative purposes only.

Pseudonymagical: masking data to get up to speed with GDPR was published on SAS Users.

2月 202018
 

Unless you live under a rock, you've probably seen news reports that Russian trolls have been posting on social media to allegedly conduct what they called "information warfare against the United States, with the stated goal of spreading distrust towards the candidates and the political system in general." NBC recently [...]

The post Examining 200k+ Russian troll tweets appeared first on SAS Learning Post.

2月 202018
 

Start off the New Year by brushing up your SAS programming skills! Begin your goal to become SAS certified or explore these New SAS books and other SAS Press titles, many of which will be making their bookshelf debut at SAS® Global Forum 2018 in Denver, CO! Want to be notified when a new book [...]

The post Six new SAS books for 2018 appeared first on SAS Learning Post.

2月 202018
 

When using conventional methods to access and analyze data sets from Teradata tables, SAS brings all the rows from a Teradata table to SAS Workspace Server. As the number of rows in the table grows over time, it adds to the network latency to fetch the data from a database management system (DBMS) and move it to SAS Workspace Server. Considering big data, SAS Workspace Server may not have enough capacity to hold all the rows from a Teradata table.

SAS In-Database processing can help solve the problem of returning too much data from the database. SAS In-Database processing allows you to perform data operations inside the DBMS and use the distributed processing over multiple Access Module Processors (AMPs). Select SAS procedures take advantage of Teradata SQL functionality, and in some cases leverage SAS functions deployed inside the DBMS. The goal of in-database processing is to reduce the I/O required to transfer the data from Teradata to SAS.

SAS® In-Database Processing in Teradata

Using SAS In-Database processing, you can run scoring models, some SAS procedures, DS2 threaded programs, and formatted SQL queries inside the Teradata database.

The list of SAS In-Database features supported for Teradata include:

  • Format publishing and SAS_PUT()function
  • Scoring Models
  • Select BASE SAS® Procedures ( FREQ, RANK, REPORT, SORT, SUMMARY/MEAN , TABULATE)
  • Select SAS/STAT® Procedures (CORR, CANCORR, DMDB, DMINE, DMREG, FACTOR, PRINCOMP,
  • REG, SCORE, TIMESERIES, VARCLUS )
  • DS2 Threaded programs
  • Data quality operations
  • Extract and transform data

SAS In-Database Deployment Package for Teradata

The in-database deployment package for Teradata includes the following:

  • The SAS formats library, accelterafmt-######.rpm, installs a SAS formats library on the Teradata server. By having a SAS formats library on your Teradata system, you can publish SAS formats in Teradata, which enables you to process SAS statements with SAS formats in the Teradata database. This also enables you to publish SAS PUT functions to Teradata as a SAS_PUT() function. This software can be found in your SAS Install folder under /SAS-install-directory/SASFormatsLibraryforTeradata/3.1/TeradataonLinux/.
  • The SAS® Embedded Process package, sepcoretera-######.rpm, installs SAS Embedded Process in the Teradata database. This is the core package of in-database components. This software can be found in your software depot under folder /depot/standalone_installs/SAS_Core_Embedded_Process_Package_for_Teradata/13_0/Teradata_on_Linux.
  • The SASEPFUNC package, sasepfunc-#####.x86_64.tar.gz, installs SAS Embedded Process support functions on Teradata. SAS Embedded Process support functions are Teradata stored procedures that generate SQL to interface with SAS Embedded Process. The script from the package creates a Teradata database named SAS_SYSFNLIB with a list of tables, views, functions, and procedures to support SAS Embedded Process. The same script also adds a list of functions in the TD_SYSFNLIB database. The package can be obtained from the Teradata support group.

The following figure shows the list of objects from the SAS_SYSFNLIB database to support SAS Embedded Process:

The following shows the list of objects from the TD_SYSFNLIB database to support SAS Embedded Process:

  • The SAS® Quality Knowledge Base package, sasqkb_ci-27.#####.noarch.rpm, installs SAS Quality Knowledge Base on the Teradata server. This is an optional package to SAS Embedded Process. This package is needed along with SAS® Quality Accelerator, if you are planning to run data cleansing operations in the Teradata database. The package can be downloaded from the SAS support site.
  • The SAS Quality Accelerator package. There are two scripts (dq_install.sh and dq_grant.sh) located under SAS-Install-directory to install the data quality accelerator at Teradata. This is an optional package to SAS Embedded Process, and needed only if you are planning to run data cleansing operations in Teradata. The software install files can be found in the folder /SAS-install-directory/SASDataQualityAcceleratorforTeradata/9.4/dqacctera/sasmisc/. As a part of script execution, it adds a list of objects (procedures, functions) to the SAS_SYSFNLIB database.

Sample list of data quality related objects from the SAS_SYSFNLIB database.

Examples of running DS2 Code to perform data quality, data extract, and transform operations in Teradata:

The following example describes the execution of DS2 code by using SAS Data Quality Accelerator and SAS Quality Knowledge Base to match and extract a data set from the Teradata database. The log shows that both Threads program and Data program ran in the Teradata database as in-database program execution.

CODE

Stay tuned in for the next part of the SAS In-Database Processing in Teradata blog series. Coming up is one about publishing SAS format in Teradata.

SAS In-Database Processing in Teradata DBMS was published on SAS Users.

2月 192018
 

Your statistical software probably provides a function that computes quantiles of common probability distributions such as the normal, exponential, and beta distributions. Because there are infinitely many probability distributions, you might encounter a distribution for which a built-in quantile function is not implemented. No problem! This article shows how to numerically compute the quantiles of any probability distribution from the definition of the cumulative distribution (CDF).

In SAS, the QUANTILE function computes the quantiles for about 25 distributions. This article shows how you can use numerical root-finding methods (and possibly numerical integration) in SAS/IML software to compute the quantile function for ANY continuous distribution. I have previously written about related topics and particular examples, such as the following:

The quantile is the root of an integral equation

Quantiles are the solutions to the equation CDF(x)-p=0, where p is a probability

Computing a quantile would make a good final exam question for an undergraduate class in numerical analysis. Although some distributions have an explicit CDF, many distributions are defined only by a probability density function (the PDF, f(x)) and numerical integration must be used to compute the cumulative distribution (the CDF, F(x)). A canonical example is the normal distribution. I've previously shown how to use numerical integration to compute a CDF from a PDF by using the definition F(x) = ∫ f(t) dt, where the lower limit of the integral is –∞ and the upper limit is x.

Whether the CDF is defined analytically or through numerical integration, the quantile for p is found implicitly as the solution to the equation F(x) = p, where p is a probability in the interval (0,1). This is illustrated by the graph at the right.

Equivalently, you can define G(x; p) = F(x) – p so that the quantile is the root of the equation G(x; p) = 0. For well-behaved densities that occur in practice, a numerical root is easily found because the CDF is monotonically increasing. (If you like pathological functions, see the Cantor staircase distribution.)

Example: Create a custom distribution

SAS/IML software provides the QUAD subroutine, which provides numerical integration, and the FROOT function, which solves for roots. Thus SAS/IML is an ideal computational environment for computing quantiles for custom distributions.

As an example, consider a distribution that is a mixture of an exponential and a normal distribution:
F(x) = 0.4 Fexp(x; 0.5) + 0.6 Φ(x; 10, 2),
where Fexp(x; 0.5) is the exponential distribution with scale parameter 0.5 and Φ(x; 10, 2) is the normal CDF with mean 20 and standard deviation 2. In this case, you do not need to use numerical integration to compute the CDF. You can compute the CDF as a linear combination of the exponential and normal CDFs, as shown in the following SAS/IML function:

/* program to numerically find quantiles for a custom distribution */
proc iml;
/* Define the cumulative distribution function here. */
start CustomCDF(x); 
   F = 0.4*cdf("Exponential", x, 0.5) +
       0.6*cdf("Normal", x, 10, 2);
   return F;
finish;

The previous section shows the graph of the CDF on the interval [0, 16]. The vertical and horizontal lines correspond to the first, second and third quartiles of the distribution. The quartiles are close to the values Q1 ≈ 0.5, Q2 ≈ 8, and Q3 ≈ 10.5. The next section shows how to compute the quantiles.

Compute quantiles for an arbitrary distribution

As long as you can define a function that evaluates the CDF, you can find quantiles. For unbounded distributions, it is usually helpful to plot the CDF so that you can visually estimate an interval that contains the quantile. (For bounded distributions, the support of the distribution contains all quantiles.) For the mixture distribution in the previous section, it is clear that the quantiles are in the interval [0, 16].

The following program finds arbitrary quantiles for whichever CDF is evaluated by the CustomCDF function. To find quantiles for a different function, you can modify the CustomCDF and change the interval on which to find the quantiles. You do not need to modify the RootFunc or CustomQuantile functions.

/* Express CDF(x)=p as the root for the function CDF(x)-p. */
start RootFunc(x) global(gProb);
   return CustomCDF(x) - gProb;  /* quantile for p is root of CDF(x)-p */
finish;
 
/* You need to provide an interval on which to search for the quantiles. */
start CustomQuantile(p, Interval) global(gProb);
   q = j(nrow(p), ncol(p), .);              /* allocate result matrix */
   do i = 1 to nrow(p)*ncol(p);             /* for each element of p... */
      gProb = p[i];                         /*    set global variable   */
      q[i] = froot("RootFunc", Interval);   /*    find root (quantile)  */ 
   end;
   return q;
finish;
 
/* Example: look for quartiles in interval [0, 16] */
probs = {0.25 0.5 0.75};         /* Q1, Q2, Q3 */
intvl = {0 16};                  /* interval on which to search for quantiles */
quartiles = CustomQuantile(probs, intvl);
print quartiles[colname={Q1 Q2 Q3}];
Quartiles for mixture of exponential and normal distributions

In summary, you can compute an arbitrary quantile of an arbitrary continuous distribution if you can (1) evaluate the CDF at any point and (2) numerically solve for the root of the equation CDF(x)-p for a probability value, p. Because the support of the distribution is arbitrary, the implementation requires that you provide an interval [a,b] that contains the quantile.

The computation should be robust and accurate for non-pathological distributions provided that the density is not tiny or zero at the value of the quantile. Although this example is illustrated in SAS, the same method will work in other software.

The post Compute the quantiles of any distribution appeared first on The DO Loop.

2月 172018
 

Do a quick search on “data scientist” on any of the popular job boards and there’s no denying the global shortage of data scientists is a real one. And, whether you’re looking for the salary commensurate with the prestigious title, a fast pass to the C-suite, or simply want to [...]

The post SAS Academy for Data Science creates top-rate analytical professionals appeared first on SAS Learning Post.