Hadoop may have been the buzzword for the last few years, but streaming seems to be what everyone is talking about these days. Hadoop deals primarily with big data in stationary and batch-based analytics. But modern streaming technologies are aimed at the opposite spectrum, dealing with data in motion and […]
Data Management has been the foundational building block supporting major business analytics initiatives from day one. Not only is it highly relevant, it is absolutely critical to the success of all business analytics projects.
Emerging big data platforms such as Hadoop and in-memory databases are disrupting traditional data architecture in the way organisations store and manage data. Furthermore, new techniques such as schema on-read and persistent in-memory data store are changing how organisations deliver data and drive the analytical life cycle.
This brings us to the question of how relevant data management is in the era of big data? At SAS, we believe that data management will continue to be the critical link between traditional data sources, big data platforms and powerful analytics. There is no doubt that the WHERE and HOW big data will be stored will change and evolve overtime. However that doesn’t affect the need for big data to be subject to the same quality and control requirements as traditional data sources.
Fundamentally, big data cannot be used effectively without proper data management
Data has always been more valuable and powerful when it is integrated and this will remain to be true in the era of big data.
It is a well known fact that whilst Hadoop is being used as a powerful data storage repository for high volume, unstructured or semi-structure information, most corporate data are still locked in traditional RDBMs or data warehouse appliances. The true value of weblog traffic or meter data stored in Hadoop can only be unleashed when they are linked and integrated with customer profile and transaction data that are stored in existing applications. The integration of high volume, semi-structured big data with legacy transaction data will provide powerful business insights that can be game changing.
Data has always been more valuable and powerful when it is integrated and this will continue to be true in the era of big data.
Big data platforms provide an alternative source of data within an organisation’s enterprise data architecture today, and therefore must be part of an organization integration capability.
Just because data lives and comes from a new data source and platform doesn’t mean high levels of quality and accuracy can be assumed. In fact, Hadoop data is known to be notoriously poor in terms of its quality and structure simply because of the lack of control and ease of how data can get into a Hadoop environment.
Just like traditional data sources, before raw Hadoop data can be used, it needs to be profiled and analysed. Often issues such as non-standardised fields and missing data become glaringly obvious when analysts try to tap into Hadoop data sources. Automated data cleansing and enrichment capabilities within the big data environment are critical to make the data more relevant, valuable and most importantly, trustworthy.
As Hadoop gains momentum as a general purpose data repository, there will be increasing pressure to adopt traditional data quality processes and best pracrices.
It should come as no surprise that policies and practices around data governance will need to be applied to new big data sources and platforms. The requirements of storing and manage metadata, understanding lineage and implementing data stewardship do not go away simply because the data storage mechanism has changed.
Furthermore, the unique nature of Hadoop as a highly agile and flexible data repository also brings new challenges around privacy and security around how data needs to be managed, protected and shared. Data Governance will play an increasingly important role in the era of big data as the need to better align IT and business increases.
Data Governance will play an increasingly important role in the era of big data as the need to better align IT and business increases
Whilst the technology underpinning how organisations store their data is going through tremendous change, the need to integrate, govern and manage the data itself have not changed. If anything, the changes to the data landscape and the increase in types and forms of data repositories will make the tasks around data management more challenging than ever.
SAS recognises the challenge faced by our customers and has continued to investment in our extensive Data Management product portfolio by embracing big data platforms from leading vendors such as Cloudera and Hortonworks as well as supporting new data architecture and data management approaches.
As this recent NY Times article appropriately called out, a robust and automated data management platform within a big data environment is critical to empower data scientists and analyst so that they can be freed from doing “Data Janitor” work and focus on the high value activities.
I have been on a whirlwind tour locally here in Australia visiting existing SAS customers where the focus of discussions have centered around SAS and Hadoop. I am happy to report that during these discussions, customers have been consistently surprised and excited about what we are doing around SAS on Hadoop! Three things in particular stood out and have resonated well with our wonderful SAS users community that I thought I share them here for the benefit of the broader community.
1. All SAS products are Hadoop enabled today
Whilst some of our newer products such as Visual Analytics and In-Memory Statistics for Hadoop were built from day one with Hadoop in mind, you might not be aware that in fact all of our current SAS products have been Hadoop enabled and can take advantage of Hadoop today.
Our mature and robust SAS/Access interface to Hadoop technology allows SAS users today to easily connect to Hadoop data sources using any SAS applications. A key point here is being able to do this without having to understand any of the underlying technology or write a single line of MapReduce code. Furthermore, the SAS/Access interface for Hadoop has been optimised and can push SAS procedures into Hadoop for execution, thereby allowing developers to tap into the power of Hadoop and improving the performance of basic SAS operations.
2. SAS does Analytics in Hadoop
The SAS R&D team have worked extremely hard with our Hadoop distribution partners to take full advantage of the powerful technologies within the Hadoop ecosystem. We are driving integration deep into the heart of the Hadoop ecosystem with technologies such as HDFS, Hive, MapReduce, Pig and YARN.
The SAS users I have been speaking to have been pleasantly surprised by the depth of our integration with Hadoop and excited about what it means for them as end users. Whether it’s running analytics in our high performance in-memory servers within a Hadoop cluster or pushing analytics workload deep into the Hadoop environment, SAS is giving users the power and flexibility in deciding where and how they want to run their SAS workloads.
“Integrating SAS HPA and LASR with Apache Hadoop YARN provides tremendous benefits to customers using SAS products and Hadoop. It is a great example of the tremendous openness and vision shown by SAS”
3. Organisations are benefiting from SAS on Hadoop today
With Hadoop being THE new kid on the block, you might be wondering if there are any customers that are already taking advantage of SAS and Hadoop now. One such customer is Rogers Media – They’ve been doing some pretty cool stuff with SAS and Hadoop to drive real business value and outcomes!
In a chat with Dr. Goodnight during SAS Global Forum this year, Chris Dingle from Rogers Media shared how they are using SAS and Hadoop to better understand their audience. I was fortunate enough to be there in person myself, and I must say the keynote session on Hadoop and Rogers Media was a highlight for many people there and definitely got the masses thinking what they should be doing around SAS and Hadoop. For those of you who are interested in more details, here is a recap of the presentation explaining the SAS/Hortonworks integration as well as more details on the Rogers Media case study.
We are working with a number of organisations around the world on exciting SAS on Hadoop projects so watch this space!
All in all, it’s a great time to be a SAS user and it has never been easier to take advantage of the power of Hadoop as a SAS user. I encourage you find out more, reach out to us or leave comments here as we would love to hear about how you plan to leverage the power of SAS and Hadoop!
As a Data Management expert, I am increasingly being called upon to talk to risk and compliance teams about their specific and unique data management challenges. It’s no secret that high quality data has always been critical to effective risk management and SAS’ market leading Data Management capabilities have long been an integrated component of our comprehensive Risk Management product portfolio. Having said that, the amount of interest, project funding and inquiries around data management for risk have reached new heights in the last twelve months and are driving a lot of our conversation with customers.
It seems that not only are organisations getting serious about data management, governments and regulators are also getting into the act in terms of enforcing good data management practices in order to promote stability of the global financial system and to avoid future crisis.
As a customer of these financial institutions, I am happy knowing that these regulations will make these organisations more robust and stronger in the event of future crisis by instilling strong governance and best practices around how data is used and managed.
On the other hand, as a technology and solution provider to these financial institutions, I can sympathise with their pain and trepidation as they prepare and modernise their infrastructure in order to support their day to day operations and at the same time be compliant to these new regulations.
Globally, regulatory frameworks such as BCBS 239 is putting the focus and attention squarely on how quality data needs to be managed and used in support of key risk aggregation and reporting.
Locally in Australia, APRA's CPG-235 in which the regulator has provided principles based guidance has outlined the types of roles, internal processes and data architectures needed in order to have a robust data risk management environment and to manage data risk effectively.
Now I must say as a long time data management professional, this latest development is extremely exciting to me and long overdue. Speaking to some of our customers in the risk and compliance departments, the same enthusiasm is definitely not shared by those charged with implementing these new processes and capabilities.
Whilst the overall level of effort involved in terms of process, people and technology cannot be underestimated in these compliance related projects, there are things that organisations can do to accelerate their effort in order to get ahead of the regulators. One piece of good news is that a large portion of the compliance related data management requirements map well with traditional data governance capabilities. Most traditional data governance projects have focused around the following key deliverables:
• Monitoring of key data quality dimensions
• Data lineage reporting and auditing
These are also the very items that the regulators are asking organisations to deliver today. SAS’ mature and proven data governance capabilities have been helping organisation with data governance projects and initiatives over the years and are now helping financial institutions tackle risk and compliance related data management requirements quickly and cost effectively.
Incidentally, our strong data governance capabilities along with our market leading data quality capabilities were cited as the main reasons SAS was selected as a category leader in Chartis Research’s first Data Management and Business Intelligence for Risk report
The combination of our risk expertise and proven data management capabilities means we are in a prime position to help our customers with these emerging data management challenges. Check out the following white papers to get a better understanding of how SAS can help you on this journey.
Next generation business intelligence and visualisation tools such as SAS Visual Analytics are revolutionising insight discovery by offering a truly self service platform powered by sophisticated visualisations and embedded analytics. It has never been easier to get hold of vast amounts of data, visualise that data, uncover valuable insights and make important business decisions, all in a single day’s work.
On the flip side, the speed and ease of getting access to data, and then uncovering and delivering insights via powerful charts and graphs have also exasperated the issue around data quality. It is all well and good when the data being used by analysts is clean and pristine. More often than not, when the data being visualisation is of poor quality, the output and results can be telling and dramatic, but in a bad way.
Let me give you an example from a recent customer discussion to illustrate the point (I have, of course synthesised the data here to protect the innocent!).
Our business analyst in ACME bank has been tasked with the job of analysing customer deposits to identify geographically oriented patterns, as well as identifying the top 20 customers in terms of total deposit amount. These are simple but classic questions that are perfectly suited for a data visualisation tool such as SAS Visual Analytics.
We will start with a simple cross-tab visualisation to display the aggregated deposit amount across the different Australian states:
Oops, the problem around non-standardised state values means that this simple crosstab view is basically unusable. The fact that New South Wales (a state in Australia) is represented nine different ways in our STATE field presents a major problem whenever the state field is used for the purpose of aggregating a measure.
In addition, the fact that the source data only contain a full address field (FULL_ADDR) means that we are also unable to build the next level of geographical aggregation using city as it is embedded into the FULL_ADDR free form text field.
It would be ideal if the FULL_ADDR was parsed out and street number, street name and city are all individual, standardised fields that can used as additional fields in a visualisation.
How about our top 20 customers list table?
Whilst a list table sorted by deposit amount should easily give us what we need, a closer inspection of the list table reveals troubling signs that we have duplicated customers (with names and addresses typed slightly differently) in our customer table. A major problem that will prevent us from building a true top 20 customers list table unless we can match up all the duplicated customers confidently and work out what their true total deposits are with the bank.
All in all, you probably don’t want to share these visualisations with key executives using the dataset you were given by IT. The scariest thing is that these are the data quality issues that are very obvious to the analyst. Without a thorough data profiling process, other surprises may just be around the corner.
One of two things typically happens from here on. Organisations might find it too difficult and give up on the dataset, the report or the data visualisation tool all together. The second option typically involves investing significant cost and effort in hiring an army of programmers and data analysts in order to code their way out of their data quality problems. Something that is often done without detailed understanding of the true cost involved in building a scalable and maintainable data quality process.
There is however, a third and better way. In contrast to other niche visualisation vendors, SAS has always believed in the importance of high quality data in analytics and data visualisation. SAS offers mature and integrated Data Quality solutions within its comprehensive Data Management portfolio that can automate data cleansing routines, minimise the costs involved in delivering quality data and ultimately unleash the true power of visualised data.
There is however, a third and better way.
Whilst incredibly powerful and flexible, our Data Quality Solution is also extremely easy to pick up by business users with minimum training and detailed knowledge around data cleansing techniques. Without the need to code or program, powerful data cleansing routines can be built and deployed in minutes.
I built a simple data quality process using our solution to illustrate how easy it is to identify and resolve data quality issues described in this example.
Here is the basic data quality routine I built using the SAS Data Management Studio. The data cleansing routine essentially involves a series of data quality nodes that resolve each of the data quality issues we identified above via pre-built data quality rules and a simple drag and drop user interface.
For example, here is the configuration for the "Address standardisation" data quality node. All I had to do was define which locale to use (English Australia in this case), which input fields I want to standard (STATE, DQ_City) ), which data quality definitions to use (City - City/State and City) and what the output fields should be called (DQ_State_Std and DQ_City_Std). The other nodes take a similar approach to automatically parse the full address field, and match similar customers using their name and address to create a new cluster ID field called DQ_CL_ID (we’ll get to this in a minute)
I then loaded the newly cleansed data into SAS Visual Analytics to try tackle the questions that I was tasked to answer in the first place.
The cross-tab now looks much better and I now know (for sure), the best performing state from a deposit amount point of view is New South Wales (now standardised as NSW), followed by Victoria and Queensland.
As a bonus for getting clean, high quality address data, I am also now able to easily visualise the geo based measures on a map, down to the city level since we now have access to the parsed out, standardised city field! Interestingly, our customers are spread out quite evenly across the state of NSW, something I wasn’t expecting in the first place.
As for the top 20 customer list table, I can now use the newly created cluster field called DQ_CL_ID to group similar customers together and add their total deposit to work out who my top 20 customers really are. As it turns out, a number of our customers have multiple deposit accounts with us and go straight to the top of the list when their various accounts are combined.
I can now clearly see that Mr. Alan Davies is our number one customer with a combined deposit amount of $1,621,768 followed Mr. Philip McBride, both of which will get the special treatment they deserve whenever they are targeted for marketing campaigns.
All in all, I can now comfortably share my insights and visualisations with business stakeholders with the knowledge that any decision made are using sound, high quality data. And I was able to do all this with minimum support and all in a single day’s work!
Is your poor quality data holding you back in data visualisation projects? Interested in finding out more about SAS Data Quality solutions? Come join us at the Data Quality hands on workshop and discover how you can easily tame your data and unleash its true potential.
In using social media, I must admit I'm actually a late-comer. I've always wondered what can actually be said in 140 characters on Twitter. However, as I started using social media and embracing it, I've come to realise the power of social media as a communication platform and what the future holds as the technology becomes more mainstream.
As I slowly move up the social media learning curve (and it can be a steep one!) and move beyond tweeting from Twitter and updating my status from Facebook, I've also realised the flexibility such open platforms offer. The huge amount of innovation around social media means that there are now countless ways of working and interacting with social media platforms. In addition, new (and sometimes crazy!) ways of using social media platforms are being discovered and invented every day.
An example of the flexibility and options I am talking about played out as I was trying to send a social media message to my HUGE group of loyal followers across all the different social media platforms a few days ago. I was amazed (and troubled) by the number of considerations I had to make:
- Who should I share it with? All, work related, family and friends, customers, specific subset of people, should I exclude certain people?
- Do I make it public or private? Do I want to whole world to know about my message?
- What platform should I use? Twitter, Linkedin, Google+, Facebook?
- What tags (or hashtag) should I use with the message? If and what keywords do I want to emphasise in my message?
- What account should I use to share it? Should I tweet as myself or use our corporate Twitter account?
- What tools should I use to send the message? At last count, I have at least 20 tools/apps (across the different devices) I can use to send a tweet as I sit at my desk.
- When should I share it? Let's face it, power tweeters don't stay up 24 hours tweeting to you! There are tonnes of tools to help you schedule or buffer your social media message to get your message across throughout the day!
- Should I geo-tag the message? Do I want people to know where I am sending the message from?
Granted, not everyone goes through this many considerations when they are trying to tell the world what they had for breakfast, but as more people use social media platforms for content sharing and exerting influence, I think it will be the norm to go through many of the considerations I went through.
So what does that mean for organisations or brands who are trying to use Social Media to better understand their customers or prospects?
Simply put, every single one of the above factors is an important variable that reveals more about my intentions, influence and level of engagement beyond the 140 text characters in the message itself. If there is one thing I learnt about statistics, it's that it’s always good to capture as much information as you can! The ability to mine and better understand your customers increases as you take into consideration more of these signals.
Next time you read another 140 character Twitter or Facebook message, see what else you can work out about the author or the message itself!
For the many years that I have been involved in the area of enterprise information management, I have seen organisations struggle with the issue of data quality over and over again. I have seen the IT departments struggling with the delivery of so called “Data Quality” projects, and I have also seen businesses struggling and complaining about not being able to get access to “Quality Data”.
Seeing that Data Quality technologies have matured over the years and SI’s have become reasonably good at delivering data quality projects, what exactly is the problem?
Among many different factors, I believe the two main reasons that organisations are still struggling with trusted data today are:
“Data Quality technology is not the only component needed to build and deliver trusted data at an enterprise level.
In order to gain trusted data, Business needs to be more involved in the Process along with IT”
That’s where Data Governance comes to the rescue. What data governance provides organisations is a more holistic view and framework in how they manage, control and leverage their data assets so their value can be maximised. It is the missing layer that links the necessary underlying data quality technology to the ultimate goal of trusted data. Specifically the layer that data governance inserts includes the people and process aspects that have been missing in the IT driven, pure data quality projects of the past.
What organisations have come to realise is that Trusted data depends on having a robust data governance framework and that a robust data governance framework will need a flexible, proven set of data quality tools to enforce the processes and rules. You can not have one without the other as they are intrinsically linked to each other.
There is no question that the detail of such undertakings and initiatives can be complex and extensive. If we just focused on the people aspects, things that organisations needs to come to grip now with include:
- The right level of executive/board level support
- The right organisational structure to support the initiatives
- The identification and assignments of data stewards
As a starting point for anyone in charge of delivering any data governance initiatives, the people element is perhaps the most critical and important one to even get the projects off ground. Here are a couple of whitepapers that goes into more detail to help you get started.
- Enterprise Data Governance: The Human Element
- Advancing the Data Agenda: Roles and Responsibilities for Middle Managers
I believe that the shift from data quality to data governance is a positive one. It has elevated the discussion to the executive level and is allowing organisations to think about important elements that were missing in previous discussions or projects.
With the right foundational components and the involvement of business through the appropriate process, I believe organisations will be one step closer to delivering trusted data throughout the enterprise.
According to the latest Gartner Magic Quadrant for Data Quality, DataFlux (a wholly owned subsidiary of SAS) continues to be the market leader when it comes to Data Quality. I say “Continue” because DataFlux has been in that same leadership position for the last 3 years!
Why is this important and what does it mean for our customers? Well, to say that Data Quality is the foundation of everything BI or Analytics related is more or less a given and old news these days. The challenges around delivering trusted data to downstream reporting or analytical processes have always been there and are becoming more critical as the need for more accurate reporting and timely insight intensifies. What’s more, many organisations are also undertaking new initiatives that all needs to be built on the foundation of sound Data Quality processes and capabilities.
1. Data Governance
Organisations that recognise the importance of data as a strategic asset and a key competitive advantage almost inevitably kick off “Data Governance” projects or initiatives. Sometimes it's because off the back of good, proactive executive leadership, sometimes off the back of looming regulatory and compliance needs. While a sound data governance program should involved the combination of people, process, methodology, and technology, Data Quality should form the foundational component of any technology discussion. A good, robust data governance framework is critical for overall long-term success but should also deliver at least some key outcomes in the form of specific Data Quality metrics that data stewards can work with.
2. Customer Centricity
As industries and markets mature, the ability to win, keep and profit from your existing customers becomes more important than ever. Customer centricity is often quoted as the number 1 strategic initiatives by CIOs today. It doesn’t take a genius to work out that customer centricity requires an organisation to obtain a single consolidated view of their customer across multiple business and product lines. Are some organisations there? Yes, but dare I say most organisations are not. Data Quality and the ability to consolidate entities are the foundation to customer centricity, a key reason why they often form the foundation of any Master Data Management solutions.
3. Application migration/consolidation
This one is often forced upon the IT team, but unfortunately it is more than likely happening in your organisation as we speak. The fact that the these projects often have an absolute deadline that can not be moved means they're often high risk. All too often, the systems that need to be moved or consolidated often have little documentation (that is if there are still people who know the system at all!). Managing risk requires the organisation to understand what information is actually in these systems. Data Quality offers a way to automate the process of cleansing the data to be loaded; not only does only correct data get loaded but projects can also be delivered on time.
At SAS, we believe that Data Quality is critical in helping our customers solve complex problems and challenges such as the ones mentioned above. It's one of the reasons why it's a foundation piece of our extensive data management portfolio. It's great that Gartner agrees that we have the best tool in the market, but it's about more than that. We also have the right people with the right know-how, something that's critical in helping customers solve their most complex business problems.