Next generation business intelligence and visualisation tools such as SAS Visual Analytics are revolutionising insight discovery by offering a truly self service platform powered by sophisticated visualisations and embedded analytics. It has never been easier to get hold of vast amounts of data, visualise that data, uncover valuable insights and make important business decisions, all in a single day’s work.
On the flip side, the speed and ease of getting access to data, and then uncovering and delivering insights via powerful charts and graphs have also exasperated the issue around data quality. It is all well and good when the data being used by analysts is clean and pristine. More often than not, when the data being visualisation is of poor quality, the output and results can be telling and dramatic, but in a bad way.
Let me give you an example from a recent customer discussion to illustrate the point (I have, of course synthesised the data here to protect the innocent!).
Our business analyst in ACME bank has been tasked with the job of analysing customer deposits to identify geographically oriented patterns, as well as identifying the top 20 customers in terms of total deposit amount. These are simple but classic questions that are perfectly suited for a data visualisation tool such as SAS Visual Analytics.
We will start with a simple cross-tab visualisation to display the aggregated deposit amount across the different Australian states:
Oops, the problem around non-standardised state values means that this simple crosstab view is basically unusable. The fact that New South Wales (a state in Australia) is represented nine different ways in our STATE field presents a major problem whenever the state field is used for the purpose of aggregating a measure.
In addition, the fact that the source data only contain a full address field (FULL_ADDR) means that we are also unable to build the next level of geographical aggregation using city as it is embedded into the FULL_ADDR free form text field.
It would be ideal if the FULL_ADDR was parsed out and street number, street name and city are all individual, standardised fields that can used as additional fields in a visualisation.
How about our top 20 customers list table?
Whilst a list table sorted by deposit amount should easily give us what we need, a closer inspection of the list table reveals troubling signs that we have duplicated customers (with names and addresses typed slightly differently) in our customer table. A major problem that will prevent us from building a true top 20 customers list table unless we can match up all the duplicated customers confidently and work out what their true total deposits are with the bank.
All in all, you probably don’t want to share these visualisations with key executives using the dataset you were given by IT. The scariest thing is that these are the data quality issues that are very obvious to the analyst. Without a thorough data profiling process, other surprises may just be around the corner.
One of two things typically happens from here on. Organisations might find it too difficult and give up on the dataset, the report or the data visualisation tool all together. The second option typically involves investing significant cost and effort in hiring an army of programmers and data analysts in order to code their way out of their data quality problems. Something that is often done without detailed understanding of the true cost involved in building a scalable and maintainable data quality process.
There is however, a third and better way. In contrast to other niche visualisation vendors, SAS has always believed in the importance of high quality data in analytics and data visualisation. SAS offers mature and integrated Data Quality solutions within its comprehensive Data Management portfolio that can automate data cleansing routines, minimise the costs involved in delivering quality data and ultimately unleash the true power of visualised data.
There is however, a third and better way.
Whilst incredibly powerful and flexible, our Data Quality Solution is also extremely easy to pick up by business users with minimum training and detailed knowledge around data cleansing techniques. Without the need to code or program, powerful data cleansing routines can be built and deployed in minutes.
I built a simple data quality process using our solution to illustrate how easy it is to identify and resolve data quality issues described in this example.
Here is the basic data quality routine I built using the SAS Data Management Studio. The data cleansing routine essentially involves a series of data quality nodes that resolve each of the data quality issues we identified above via pre-built data quality rules and a simple drag and drop user interface.
For example, here is the configuration for the "Address standardisation" data quality node. All I had to do was define which locale to use (English Australia in this case), which input fields I want to standard (STATE, DQ_City) ), which data quality definitions to use (City - City/State and City) and what the output fields should be called (DQ_State_Std and DQ_City_Std). The other nodes take a similar approach to automatically parse the full address field, and match similar customers using their name and address to create a new cluster ID field called DQ_CL_ID (we’ll get to this in a minute)
I then loaded the newly cleansed data into SAS Visual Analytics to try tackle the questions that I was tasked to answer in the first place.
The cross-tab now looks much better and I now know (for sure), the best performing state from a deposit amount point of view is New South Wales (now standardised as NSW), followed by Victoria and Queensland.
As a bonus for getting clean, high quality address data, I am also now able to easily visualise the geo based measures on a map, down to the city level since we now have access to the parsed out, standardised city field! Interestingly, our customers are spread out quite evenly across the state of NSW, something I wasn’t expecting in the first place.
As for the top 20 customer list table, I can now use the newly created cluster field called DQ_CL_ID to group similar customers together and add their total deposit to work out who my top 20 customers really are. As it turns out, a number of our customers have multiple deposit accounts with us and go straight to the top of the list when their various accounts are combined.
I can now clearly see that Mr. Alan Davies is our number one customer with a combined deposit amount of $1,621,768 followed Mr. Philip McBride, both of which will get the special treatment they deserve whenever they are targeted for marketing campaigns.
All in all, I can now comfortably share my insights and visualisations with business stakeholders with the knowledge that any decision made are using sound, high quality data. And I was able to do all this with minimum support and all in a single day’s work!
Is your poor quality data holding you back in data visualisation projects? Interested in finding out more about SAS Data Quality solutions? Come join us at the Data Quality hands on workshop and discover how you can easily tame your data and unleash its true potential.