4月 162018
 

A colleague and I recently discussed how to generate random permutations without encountering duplicates. Given a set of n items, there are n! permutations My colleague wants to generate k unique permutations at random from among the total of n!. Said differently, he wants to sample without replacement from the set of all possible permutations.

In SAS, you can use the RANPERM function to generate ranom permutations. For example, the statement p = ranperm(4) generates a random permutation of the elements in the set {1, 2, 3, 4}. However, the function does not ensure that the permutations it generates will be unique.

I can think of two ways to generate random permutations without replacement. The first is to generate all permutations and then choose a random subset of size k without replacement from the set of n! possibilities. This technique is efficient when n is small or when k is a substantial proportion of n!. The second technique is to randomly generate permutations until you get a total of k that are unique. This technique is efficient when n is large and k is small relative to n!.

Generate all permutations, then sample without replacement

For small sets (say, n ≤ 8), an efficient way to generate random permutations is to generate all permutations and then extract a random sample without replacement. In the SAS/IML language you can use the ALLPERM function to generate all permutations of a set of n items, where n ≤ 18. The ALLPERM function returns a matrix that has n! rows and n columns. Each row is a permutation. You can then use the SAMPLE function to sample without replacement from among the rows, as follows:

proc iml;
call randseed(123);
n = 4;                       /* permutations of n items */
k = 3;                       /* want k unique permutation */
 
p = allperm(n);               /* matrix has n! rows; each row is a permutation */
rows = sample(1:n, k, "WOR"); /* random rows w/o replacement */
ranPermWOR = p[rows, ];       /* extract rows */
print ranPermWOR;

Generate random permutations, then check for uniqueness

The matrix of all permutations has n! rows, which gets big fast. If you want only a small number of permutations from among a huge set of possible permutations, it is more efficient to use the RANPERM function to generate permutations, then discard duplicates. Last week I showed how to eliminate duplicate rows from a numeric matrix so that the remaining rows are unique. The following SAS/IML statements use the UniqueRows function, which is defined in the previous post. You must define or load the module before you can use it.

/* <define or load the DupRows and UniqueRows modules HERE> */
 
n = 10;                      /* permutations of n items */
k = 5;                       /* want k unique permutation; k << n! */
p = ranperm(n, 2*k);         /* 2x more permutations than necessary */
U = UniqueRows(p);           /* get unique rows of 2k x n matrix */
if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
else do;
   U = UniqueRows( U // ranperm(n, 10*k) ); /* generate more... repeat as necessary */
   if nrow(U) >= k then U = U[1:k, ];  /* extract the first k rows */
end;
print U;

Notice in the previous statements that the call to RANPERM requests 2*k random permutations, even though we only want k. You should ask for more permutations than you need because some of them might be duplicates. The factor of 2 is ad hoc; it is not based on any statistical consideration.

If k is much smaller than n!, then you might think that the probability of generating duplicate a permutation is small. However, the Birthday Problem teaches us that duplicates arise much more frequently than we might expect, so it is best to expect some duplicates and generate more permutations than necessary. When k is moderately large relative to n!, you might want to use a factor of 5, 10, or even 100. I have not tried to compute the number of permutations that would generate k unique permutations with 95% confidence, but that would be a fun exercise. In my program, if the initial attempt generates fewer than k unique rows, it generates an additional 10*k permutations. You could repeat this process, if necessary.

In summary, if you want to generate random permutations without any duplicates, you have at least two choices. You can generate all permutations and then extract a random sample of k rows. This method works well for small values of n, such as n ≤ 8. For larger values of n, you might want to generate random permutation (more than k) and use the first k unique rows of the matrix of permutations.

The post Random permutations without duplicates appeared first on The DO Loop.

4月 142018
 

SAS Visual Forecasting 8.2 effectively models and forecasts time series in large scale. It is built on SAS Viya and powered by SAS Cloud Analytic Services (CAS).

In this blog post, I will build a Visual Forecasting (VF) Pipeline, which is a process flow diagram whose nodes represent tasks in the VF Process.  The objective is to show how to perform the full analytics life cycle with large volumes of data: from accessing data and assigning variable roles accurately, to building forecasting models, to select a champion model and overriding the system generated forecast. In this blog post I will use 1,337 time series related to a chemical company and will illustrate the main steps you would use for your own applications and datasets.

In future posts, I will work in the Programming application, a collection of SAS procedures and CAS actions for direct coding or access through tasks in SAS Studio, and will develop and assess VF models via Python code.

In a VF pipeline, teams can easily save forecast components to the Toolbox to later share in this collaborative environment.

Forecasting Node in Visual Analytics

This section briefly describes what is available in SAS Visual Analytics, the rest of the blog discusses SAS Visual Forecasting 8.2.

In SAS Visual Analytics the Forecast object will select the model that best fits the data out of these models: ARIMA, Damped trend exponential smoothing, Linear exponential smoothing, Seasonal exponential smoothing, Simple exponential smoothing, Winters method (additive), or Winters method (multiplicative). Currently there are no diagnostic statistics (MAPE, RMSE) for the model selected.

You can do “what-if analysis” using the Scenario Analysis and Goal Seeking functionalities. Scenario Analysis enables you to forecast hypothetical scenarios by specifying the future values for one or more underlying factors that contribute to the forecast. For example, if you forecast the profit of a company, and material cost is an underlying factor, then you might use scenario analysis to determine how the forecasted profit would change if the material cost increased by 10%. Goal Seeking enables you to specify a target value for your forecast measure, and then determine the values of underlying factors that would be required to achieve the target value. For example, if you forecast the profit of a company, and material cost is an underlying factor, then you might use Goal Seeking to determine what value for material cost would be required to achieve a 10% increase in profit.

Another neat feature in SAS Visual Analytics is that one can apply different filters to the final forecast. Filters are underlying factors or different levels of the hierarchy, and the resulting plot incorporates those filters.

Data Requirements for Forecasting

There are specific data requirements when working in a forecasting project, a time series dataset that contains at least two variables: 1) the variable that you want to forecast which is known as the target of your analysis, for example, Revenue and 2) a time ID variable that contains the time stamps of the target variable. The intervals of this variable are regularly spaced. Your time series table can contain other time-varying variables. When your time series table contains more than one individual series, you might have classification variables, as shown in the photo below. Distribution Center and Product Key are classification variables. Optionally, you can designate numerical variables (ex: Discount) as independent variables in your models.

You also have the option of adding a table of attributes to the time series table. Attributes are categorical variables that define qualities of the time series. Attribute variables are similar to BY variables, but are not used to identify the series that you want to forecast. In this post, the data I am using includes Distribution Center, Supplier Name, Product Type, Venue and Product Category. Notice that the attributes are time invariant, and that the attribute table is much smaller than the time series table.

The two data sets (SkinProduct and SkinProductAttributes) used in this blog contain 1,337 time series related to a chemical company. This picture shows a few rows of the two data sets used in this post, note that DATE intervals are regularly spaced weeks. The SkinProduct dataset is referred as the Time Series table in the SkinProductAttributes dataset as the Attribute Data.

Introduction SAS Visual Forecasting 8.2

Developing Models in SAS Visual Forecasting 8.2

Step One: Create a Forecasting Project and Assign Variables

From the SAS Home menu select the action Build Models that will take you to SAS Model Studio, where you select New Project, and enter 1) the Name of your project, 2) the Type of project, make sure you enter “Forecasting” and 3) the data source.

Introduction SAS Visual Forecasting 8.2

In the Data tab,  assign the variables roles by using the icon in the upper right corner.

The roles assigned in this post are typical of role assignments in forecasting projects. Notice these variables are in the Time Series table. Also, notice that the classification variables are ordered from highest to lowest hierarchy:

Time Variable: Date
Dependent Variable: Revenue
Classification Variables: Distribution Center and Product Key

In the Time Series table, you might have additional variables you’d like to assign the role “independent” that should be considered for model generation. Independent variables are the explanatory, input, predictor, or causal variables that can be used to model and forecast the dependent variable. In this post, the variable “Discount” is assigned the role “independent”. To do this assignment: right click on the variable, and select Edit Variables.

To bring in the second dataset with the Attribute variables, follow the steps in this photo:

 

Step two: Automated Modeling with Pipelines

The objective in this step is to select a champion model.  Working in the Pipelines tab, one explores the time series plots, uses the code editor to modify the default model, adds a 2nd model to the pipeline, compares the models and selects a champion model.

After step one is completed, select the Pipelines Tab

The first node in the VF pipelines is the Data node. After right-clicking and running this node, one can see the time series by selecting Explore Time Series. Notice that one can filter by the attribute variables, and that the table shows the exact historical data values.

Auto-forecasting is the next node in the default pipeline. Remember that we are modeling 1,332 time series. For each time series, the Auto-forecasting node automatically diagnoses the statistical characteristics of the time series, generates a list of appropriate time series models, automatically selects the model, and generates forecasts. This node evaluates and selects for each time series the ARIMAX and exponential smoothing models.

One can customize and modify the forecasting models (except the Hierarchical Forecasting model) by editing the model’s code. For example, to add the class of models for intermittent demand to the auto- forecasting node, one could open the code editor for that node and replace these lines
rc = diagspec.setESM();
rc = diagspec.setARIMAX();
with:
rc = diagspec.setIDM();

To open the code editor see photo below. After changes, save code and close the editor.

At this point, you can run the Auto Forecasting node, and after looking at its results, save it to the toolbox, so the editing changes are saved and later reused or shared with team members.

By expanding the Nodes pane and the Forecasting Modeling pane on the left, you can select from several models and add a 2nd modeling node to the pipeline

The next photo shows a pipeline with the Naïve Forecast as the second model. It was added to the pipeline by dropping its node into the parent node (data node). This is the resulting pipeline:

After running the Model Comparison node, compare the WMAE (Weighted Mean Absolute Error) and WMAPE (Weighted Mean Absolute Percent Error) and select a champion model.

You can build several pipelines using different model strategies. In order to select a champion model from all the models developed in the pipelines one uses the Pipeline Comparison tab.

Before you work on any overrides for your forecasting project, you need to make sure that you are working with the best pipeline and modeling node for your data. SAS Visual Forecasting selects the best fit model in each pipeline. After each pipeline is run, the champion pipeline is selected based on the statistics of fit that you chose for the selection criteria. If necessary, you can change the selected champion pipeline.

Step Three: Overrides

The Overrides tab is used to manually adjust the forecasts in the future. For example, if you want to account for some promotions that your company and its competitors are running and that are not captured by the models.

The Overrides module allows users to select subsets of time series at the aggregate level by selecting attribute values in the attribute table that you defined in the data tab. The filters based on the attributes are highly customizable and do not restrict you to use the hierarchy that was used for the modeling. The section of a filter using attribute is often referred to as faceted search. Whenever you create a new filter based on a selection of values of the attributes (also known as facets), the aggregate for all series that match the facets will be displayed on your main panel.

There is a wealth of information in the Overrides overview: 1) a list of the BY variables, as well as attribute variables, available to use as filters.

2) a Plot of the Time Series Aggregation and Overrides displaying historical and forecast data, and

3) a Forecast and Overrides table which can be used to create, edit and submit, override values for a time series based on external factors that are not included in the forecast models

Conclusion

Using SAS Visual Forecasting 8.2 you can effectively model and forecast time series in large scale. The Visual Forecasting Pipeline greatly facilitates the automatic forecasting of large volumes of data, and provides a structured and robust method with efficient and flexible processes.

References

An introduction to SAS Visual Forecasting 8.2 was published on SAS Users.

4月 132018
 

The release of SAS Viya 3.3 has brought some nice data quality features. In addition to the visual applications like Data Studio or Data Explorer that are part of the Data Preparation offering, one can leverage data quality capabilities from a programming perspective.

For the time being, SAS Viya provides two ways to programmatically perform data quality processing on CAS data:

  • The Data Step Data Quality functions.
  • The profile CAS action.

To use Data Quality programming capabilities in CAS, a Data Quality license is required (or a Data Preparation license which includes Data Quality).

Data Step Data Quality functions

The list of the Data Quality functions currently supported in CAS are listed here and below:

SAS Data Quality 3.3 programming capabilities

They cover casing, parsing, field extraction, gender analysis, identification analysis, match codes and standardize capabilities.

As for now, they are only available in the CAS Data Step. You can’t use them in DS2 or in FedSQL.

To run in CAS certain conditions must be met. These include:

  • Both the input and output data must be CAS tables.
  • All language elements must be supported in the CAS Data Step.
  • Others.

Let’s look at an example:

cas mysession sessopts=(caslib="casuser") ;
 
libname casuser cas caslib="casuser" ;
 
data casuser.baseball2 ;
   length gender $1 mcName parsedValue tokenNames lastName firstName varchar(100) ;
   set casuser.baseball ;
   gender=dqGender(name,'NAME','ENUSA') ;
   mcName=dqMatch(name,'NAME',95,'ENUSA') ;   
   parsedValue=dqParse(name,'NAME','ENUSA') ;
   tokenNames=dqParseInfoGet('NAME','ENUSA') ;
   if _n_=1 then put tokenNames= ;
   lastName=dqParseTokenGet(parsedValue,'Family Name','NAME','ENUSA') ;
   firstName=dqParseTokenGet(parsedValue,'Given Name','NAME','ENUSA') ;
run ;

Here, my input and output tables are CAS tables, and I’m using CAS-enabled statements and functions. So, this will run in CAS, in multiple threads, in massively parallel mode across all my CAS workers on in-memory data. You can confirm this by looking for the following message in the log:

NOTE: Running DATA step in Cloud Analytic Services.
NOTE: The DATA step will run in multiple threads.

I’m doing simple data quality processing here:

  • Determine the gender of an individual based on his(her) name, with the dqGender function.
  • Create a match code for the name for a later deduplication, with the dqMatch function.
  • Parse the name using the dqParse function.
  • Identify the name of the tokens produced by the parsing function, with the dqParseInfoGet function.
  • Write the token names in the log, the tokens for this definition are:
    Prefix,Given Name,Middle Name,Family Name,Suffix,Title/Additional Info
  • Extract the “Family Name” token from the parsed value, using dqParseTokenGet.
  • Extract the “Given Name” token from the parsed value, again using dqParseTokenGet.

I get the following table as a result:

Performing this kind of data quality processing on huge tables in memory and in parallel is simply awesome!

The dataDiscovery.profile CAS action

This CAS action enables you to profile a CAS table:

  • It offers 2 algorithms, one is faster but uses more memory.
  • It offers multiple options to control your profiling job:
    • Columns to be profiled.
    • Number of distinct values to be profiled (high-cardinality columns).
    • Number of distinct values/outliers to report.
  • It provides identity analysis using RegEx expressions.
  • It outputs the results to another CAS table.

The resulting table is a transposed table of all the metrics for all the columns. This table requires some post-processing to be analyzed properly.

Example:

proc cas; 
   dataDiscovery.profile /
      algorithm="PRIMARY"
      table={caslib="casuser" name="product_dim"}
      columns={"ProductBrand","ProductLine","Product","ProductDescription","ProductQuality"}
      cutoff=20
      frequencies=10
      outliers=5
      casOut={caslib="casuser" name="product_dim_profiled" replace=true}
   ;
quit ;

In this example, you can see:

  • How to specify the profiling algorithm (quite simple: PRIMARY=best performance, SECONDARY=less memory).
  • How to specify the input table and the columns you want to profile.
  • How to reduce the number of distinct values to process using the cutoff option (it prevents excessive memory use for high-cardinality columns, but might show incomplete results).
  • How to reduce the number of distinct values reported using the frequencies option.
  • How to specify where to store the results (casout).

So, the result is not a report but a table.

The RowId column needs to be matched with

A few comments/cautions on this results table:

  • DoubleValue, DecSextValue, or IntegerValue fields can appear on the output table if numeric fields have been profiled.
  • DecSextValue can contain the mean (metric #1008), median (#1009), standard deviation (#1022) and standard error (#1023) if a numeric column was profiled.
  • It can also contain frequency distributions, maximum, minimum, and mode if the source column is of DecSext data type which is not possible yet.
  • DecSext is a 192-bit fixed-decimal data type that is not supported yet in CAS, and consequently is converted into a double most of the time. Also, SAS Studio cannot render correctly new CAS data types. As of today, those metrics might not be very reliable.
  • Also, some percentage calculations might be rounded due to the use of integers in the Count field.
  • The legend for metric 1001 is not documented. Here it is:

1: CHAR
2: VARCHAR
3: DATE
4: DATETIME
5: DECQUAD
6: DECSEXT
7: DOUBLE
8: INT32
9: INT64
10: TIME

A last word on the profile CAS action. It can help you to perform some identity analysis using patterns defined as RegEx expressions (this does not use the QKB).

Here is an example:

proc cas; 
   dataDiscovery.profile /
      table={caslib="casuser" name="customers"}
      identities={
         {pattern="PAT=</span>?999[<span style=" />-]? ?999[- ]9999",type="USPHONE"}, 
         {pattern= "PAT=^99999[- ]9999$",type="ZIP4"}, 
         {pattern= "PAT=^99999$",type="ZIP"}, 
         {pattern= "[^ @]+@[^ @]+\.[A-Z]{2,4}",type="EMAIL"}, 
         {pattern= "^(?i:A[LKZR]|C[AOT]|DE|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[AT]|W[AIVY])$",type="STATE"}
      }
      casOut={caslib="casuser" name="customers_profiled" replace="true"}
   ;
quit ;

In this example that comes from

I hope this post has been helpful.

Thanks for reading.

An overview of SAS Data Quality 3.3 programming capabilities was published on SAS Users.

4月 132018
 

As a follow on from my previous blog post, where we looked at the different use cases for using Kerberos in SAS Viya 3.3, in this post will delve into more details on the requirements for use case 4, where we use Kerberos authentication through-out both the SAS 9.4 and SAS Viya 3.3 environments. We won’t cover the configuration of this setup as that is a topic too broad for a single blog post.

As a reminder the use case we are considering is shown here:

SAS Viya 3.3 Kerberos Delegation

Here the SAS 9.4 Workspace Server is launched with Kerberos credentials, the Service Principal for the SAS 9.4 Object Spawner will need to be trusted for delegation. This means that a Kerberos credential for the end-user is available to the SAS 9.4 Workspace Server. The SAS 9.4 Workspace Server can use this end-user Kerberos credential to request a Service Ticket for the connection to SAS Cloud Analytic Services. While SAS Cloud Analytic Services is provided with a Kerberos keytab and principal it can use to validate this Service Ticket. Validating the Service Ticket authenticates the SAS 9.4 end-user to SAS Cloud Analytic Services. The principal for SAS Cloud Analytic Services must also be trusted for delegation. We need the SAS Cloud Analytic Services session to have access to the Kerberos credentials of the SAS 9.4 end-user.

These Kerberos credentials made available to the SAS Cloud Analytic Services are used for two purposes. First, they are used to make a Kerberized connection to the SAS Viya Logon Manager, this is to obtain the SAS Viya internal OAuth token. As a result, the SAS Viya Logon Manager must be configured to accept Kerberos connections. Secondly, the Kerberos credentials of the SAS 9.4 end-user are used to connect to the Secure Hadoop environment.

In this case, since all the various principals are trusted for delegation, our SAS 9.4 end-user can perform multiple authentication hops using Kerberos with each component. This means that through the use of Kerberos authentication the SAS 9.4 end-user is authenticated into SAS Cloud Analytic Services and out to the Secure Hadoop environment.

Reasons for doing it…

To start with, why would we look to use this use case? From all the use cases we considered in the previous blog post this provides the strongest authentication between SAS 9.4 Maintenance 5 and SAS Viya 3.3. At no point do we have a username/password combination passing between the SAS 9.4 environment and the SAS Viya 3.3. In fact, the only credential set (username/password) sent over the network in the whole environment is the credential set used by the Identities microservice to fetch user and group information for SAS Viya 3.3. Something we could also eliminate if the LDAP provider supported anonymous binds for fetching user details.

Also, this use case provides true single sign-on from SAS 9.4 Maintenance 5 to SAS Viya 3.3 and all the way out to the Secured Hadoop environment. Each operating system run-time process will be launched as the end-user and no cached or stored username/password combination is required.

High-Level Requirements

At a high-level, we need to have both sides configured for Kerberos delegated authentication. This means both the SAS 9.4 Maintenance 5 and the SAS Viya 3.3 environments must be configured for Kerberos authentication.

The following SAS components and tiers need to be configured:

  • SAS 9.4 Middle-Tier
  • SAS 9.4 Metadata Tier
  • SAS 9.4 Compute Tier
  • SAS Viya 3.3 SAS Logon Manager
  • SAS Viya 3.3 SAS Cloud Analytic Services

Detailed Requirements

First let’s talk about Service Principal Names. We need to have a Service Principal Name (SPN) registered for each of the components/tiers in our list above. Specifically, we need a SPN registered for:

  • HTTP/<HOSTNAME> for the SAS 9.4 Middle-Tier
  • SAS/<HOSTNAME> for the SAS 9.4 Metadata Tier
  • SAS/<HOSTNAME> for the SAS 9.4 Compute Tier
  • HTTP/<HOSTNAME> for the SAS Viya 3.3 SAS Logon Manager
  • sascas/<HOSTNAME> for the SAS Viya 3.3 SAS Cloud Analytic Services

Where the <HOSTNAME> part should be the fully qualified hostname of the machines where the component is running. This means that some of these might be combined, for example if the SAS 9.4 Metadata Tier and Compute Tier are running on the same host we will only have one SPN for both. Conversely, we might require more SPNs, if for example, we are running a SAS 9.4 Metadata Cluster.

The SPN needs to be registered against something. Since our aim is to support single sign-on from the end-user’s desktop we’ll probably be registering the SPNs in Active Directory. In Active Directory we can register against either a user or computer object. For both the SAS 9.4 Metadata and Compute Tier the registration can be performed automatically if the processes run as the local system account on a Microsoft Windows host and will be against the computer object. Otherwise, and for the other tiers and components, the SPN must be registered manually. We recommend, that while you can register multiple SPNs against a single object, that you register each SPN against a separate object.

Since the entire aim of this configuration is to delegate the Kerberos authentication from one tier/component onto the next we need to ensure the objects, namely users or computer objects, are trusted for delegation. The SAS 9.4 Middle-Tier will only support un-constrained delegation, whereas the other tiers and components support Microsoft’s constrained delegation. If you choose to go down the path of constrained delegation you need to specify each and every Kerberos service the object is trusted to delegate authentication to.

Finally, we need to provide a Kerberos keytab for the majority of the tiers/components. The Kerberos keytab will contain the long-term keys for the object the SPN is registered against. The only exceptions being the SAS 9.4 Metadata and Compute Tiers if these are running on Windows hosts.

Conclusion

You can now enable Kerberos delegation across the SAS Platform, using a single strong authentication mechanism across that single platform. As always with configuring Kerberos authentication the prerequisites, in terms of Service Principal Names, service accounts, delegation settings, and keytabs are important for success.

SAS Viya 3.3 Kerberos Delegation from SAS 9.4M5 was published on SAS Users.

4月 122018
 

The financial services industry has witnessed considerable hype around artificial intelligence (AI) in recent months. We’re all seeing a slew of articles in the media, at conference keynote presentations and think-tanks tasked with leading the revolution. AI indeed appears to be the new gold rush for large organisations and FinTech [...]

AI for fraud detection: beyond the hype was published on SAS Voices by Sundeep Tengur

4月 112018
 

Sometimes it is important to ensure that a matrix has unique rows. When the data are all numeric, there is an easy way to detect (and delete!) duplicate rows in a matrix.

The main idea is to subtract one row from another. Start with the first row and subtract it from every row beneath it. If any row of the difference matrix is identically zero, then you have found a row that is identical to the first row. Then do the same thing for the second row: subtract it from the third and higher rows and see if you obtain a row of zeros. Repeat this process for the third and higher rows.

The following SAS/IML program implements this algorithm for an arbitrary numeric matrix. It returns a binary indicator variable (a column vector). The i_th row of the returned vector is 1 for rows that are duplicates of some previous row:

proc iml;
/* return binary vector that indicates which rows in a 
   numerical matrix are duplicates */
start DupRows(A);
   N = nrow(A);
   dupRows = j(N, 1, 0);         /* binary indicator matrix */
   do i = 1 to N-1;
      if dupRows[i] = 0 then do; /* skip rows that are known to be duplicates */
         r = i+1:N;              /* remaining rows */ 
         M = A[r, ]-A[i, ];      /* subtract current row from remaining rows */
         b = M[ ,##];            /* sum of squares = 0 iff duplicate row */
         idx = loc( b=0 );       /* any duplicate rows for current row? */
         if ncol(idx) > 0 then
            dupRows[r[idx]] = 1; /* if so, flag them */
      end;      
   end;
   return dupRows;
finish;

To test the function, consider the following 6 x 3 matrix. You can see by inspection that the matrix has three duplicate rows: the third, fifth, and sixth rows. You can call the DupRows function and print the matrix adjacent to the binary vector that indicates the duplicate rows:

A = {1 1 1,
     1 2 3,
     1 1 1,  /* duplicate row */
     3 2 1,
     3 2 1,  /* duplicate row */
     1 1 1};  /* duplicate row */
DupRows = DupRows(A);
print A DupRows;

You can use the DupRows function to write a function that excludes the duplicate rows in a matrix and returns only the unique rows, as follows:

/* return the unique rows in a numerical matrix */
start UniqueRows(A);
   uniqRows = loc(DupRows(A)=0);
   return(A[uniqRows, ]);   /* return rows that are NOT duplicates */
finish;
 
U = UniqueRows(A);
print U;

I like this algorithm because it uses subtraction, which is a very fast operation. However, This algorithm has a complexity of O(n(n-1)/2), where n is the number of rows in the matrix, so it is best for small to moderate values of n.

For long and skinny matrices (or for character data), it might be more efficient to sort the matrix as I showed in a previous article about how to find (and count) the unique rows of a matrix. However, the sorting algorithm requires that you sort the data by ALL columns, which can be inefficient for very wide and short data tables. For small and medium-sized data, you can use either method to find the unique rows in a matrix.

This article discusses how to remove duplicates records for numerical matrices in SAS/IML. In Base SAS, you can use PROC SORT or PROC SQL to remove duplicate records from a SAS DATA set, as shown in Lafler (2017) and other references.

The post Find the unique rows of a numeric matrix appeared first on The DO Loop.

4月 102018
 

SAS has never been able to sit idle and watch from the sidelines as crises ensue. Why? Because we know that good can be made possible if data is put to work. SAS Global Forum opening session spotlighted four real-life examples of how SAS is showing up and improving the [...]

What can be achieved when passion, curiosity and analytics come together? was published on SAS Voices by Anjelica Cummings

4月 092018
 

Bright lights, a packed house, loud music and high energy – these are the elements that come to life every year in the  SAS Global Forum opening session. But it’s not just about the entertainment and energy. Each year, the Sunday night opening session sets the stage for conference attendees. [...]

Ask the question, and watch new discoveries unfold was published on SAS Voices by Anjelica Cummings