Data cardinality is an important topic for for data analysis. In mathematical terms, cardinality is simply the number of elements in a set. But in data applications, *cardinality* signifies the number of unique values for a given field of data. Related terms include *number of levels* (thank you, PROC FREQ and NLEVELS), *percent unique*, as well as *discrete data* -- data with a finite number of different values. (That's not to be confused with *discreet* data, which is when you're careful not to share too much personal information. For example, we can count the Kardashians in **discrete** increments. But that doesn't mean that they behave **discreetly**.)

#### But first, this

Since I've been working with cardinality recently, I've developed this persistent earworm that I just can't shake. I decided to write it down and present it to you. (You're welcome.)

**(I've got) Cardinality** (Apologies to Lloyd Price)

>> Click here for suggested tune.

Oh-oh-over and over

I'll iterate through each row

Over and over, tell me why it's so

Over and over, each value that's unique

But oh-oh-over and over

I wish I ran my PROC FREQ

*'Cause I've got - (cardinality)
High - (cardinality) N - (cardinality)
Discrete - (cardinality) HUGE - (cardinality)
Unique - (cardinality)
'Cause I got a big data set
So over ... and over
There's a high percent unique
Well, well, well over and over
I wish my data were discrete
*

#### Cardinal importance of cardinality

SAS Visual Analytics -- which I've been using for recent projects -- makes a big deal out of cardinality. The displayed list of variables shows the cardinality values front-and-center (see the inset screenshot image). Knowing how many different values a column can take on should influence the type of visualization you design. For example, it might make sense to group a set of charts by GENDER, which typically has a cardinality of 2 (or at most 3, if "unknown" is permitted). However, it would make less sense to group your charts by CUSTOMER_ID when that field has a cardinality in the thousands.

Many SAS users already know how to calculate cardinality. **sasCommunity.org** hosts a useful reference on "Cardinality Ratio" (more-or-less curated by Ron Fehd), complete with sample programs, SAS macros, and explanations.

I have a variation of my own, developed mostly by my colleague Hans-Joachim Edert. It produces a report like this:

What can you learn from this report? In this example, we can see that **Origin** (where the model of car comes from) has just 3 possible values, making it less than 1% unique across the 428 records. That might make it a good grouping variable. **Model**, on the other hand, has a unique value for *almost* every record. The fact that it's so close to 100% unique might indicate a data quality problem. (Do we expect all Model names to be unique?) And the numeric variable **Cylinders** has only 8 different values, which might indicate that it's more of a category than a measure. (Would you want to SUM or AVERAGE the **Cylinders** across these records?)

Here's a link to the program code if you want to try it yourself. Here's a sample call to the macro in the program:

%getcardinality(SASHELP.CARS, WORK.CARDS, 1);

#### Coming up...

In a future article I'll present a SAS Enterprise Guide custom task that makes the "check cardinality" process easy -- no programming. (Here's a sneak peak.) But you do not have to wait to learn more about your data: you can already glean much of this (and more) by using the Characterize Data task.