If a data lake isn’t a data warehouse, as I proposed in my last post, then it behooves us to better understand more about this “new” data lake structure. In the fifth and final post in this series titled, Big Data Cheat Sheet on Hadoop, we’ll highlight some of the pros and cons of a data lake using a SWOT diagram.
Question 5: What are some of the pros and cons of a data lake?
This discussion comes from an online debate I had earlier this year with my colleague, Anne Buff, where we discussed the pros and cons of a data lake in context of this resolution: The data lake is essential for any organization that wants to take full advantage of its data. I took the Pro stance, while Anne took the Con stance.
Even though our online debate was focused on the data lake, it forced us to address the larger discussion of managing growing volumes of data in a big data world. With the onslaught of big data technologies in recent years—the most popular being the open source project, Apache Hadoop—organizations are having to look once again at the underlying technologies supporting their data collection, processing, storage, and analysis activities.
The Hadoop-based data lake happens to be a popular option right now. The SWOT diagram below identifies some of the key factors when considering a data lake. Keep in mind that this is just a quick snapshot (with brief explanations following), and not a comprehensive list:
- Lower costs. A Hadoop-based data lake is largely dependent on open source software and is designed to run on low-cost commodity hardware. So from a software and hardware standpoint, there’s a huge cost savings that cannot be ignored.
- One-stop data shopping. Hadoop is no respecter of data. It will store and process it all – structured, semi-structured, and unstructured—at a fraction of the cost and time of your existing, traditional systems. There’s much to be gained from having all (or much of) your data in one place – mixing and matching data sets like never before.
- Data management. We can get hung up talking about the volume, variety, and velocity of (big) data, but equally important to this discussion is being able to govern and manage all of it, regardless of the underlying technologies. For a Hadoop-based data lake, both open source projects and vendor products continue to mature/be developed to support this increasing demand. We’re moving in the right direction—rapidly—but we’re not quite there yet.
- Security. Hadoop-based security has been a long-time issue, but there’s significant effort and progress being made by the open source community and vendors to support an organization’s security and privacy requirements. While it’s easy to finger wag at this particular “weakness,” it’s important to recognize that the weekly (and almost daily) reports we hear about this-&-that data breach are primarily attacks on existing traditional systems, not these newer big data systems.
- Discovery. This feature allows users to discover the “unknown unknowns.” Unlike existing data warehouses where users are limited with both the questions and answers they can ask and get answers for, with a Hadoop-based data lake, the sky’s the limit. A user can go to the data lake with the same set of questions she had for the data warehouse and get the same, or even better, answers. But she can also discover previously-unknown questions, thus driving her to more answers, and ideally, better insights.
- Advanced analytics. A lot of software apps include descriptive analytics, showing a user pretty visuals about what’s happened. We’ve had this capability for decades. With big data, however, organizations need advanced analytics—such as prescriptive, predictive, and diagnostic—to really get ahead of the game (and one could even argue to stay in the game). A Hadoop-based data lake provides that opportunity.
- Status quo. This is not a new threat, especially for software vendors, but it’s a very real threat. The cost and time required to migrate towards these newer big data technologies is not insignificant. This is not a case of hot-swapping technologies while no one is looking. It will also impact the people, processes, and the culture in your organization—if done right.
- Skills. There is no question that there is a skills shortage for these big data technologies. Even though this shortage can be viewed as a threat to Hadoop adoption, it shouldn’t be seen as a negative. These big data technologies are new, they’re evolving, and there’s a lot of experimentation going on to figure out what’s needed, what’s not, what should stick, what shouldn’t, etc. Thus, it should be no surprise that as our technologies evolve, so will the skills required. We have an opportunity to take what we have and know to a new level and help prepare the next generation to excel in our data-saturated society.
The bottom line. There are well-known weaknesses and threats associated with a data lake—some of which I have highlighted here—and we cannot ignore these. But there are also significant strengths and opportunities to explore. I believe an organization can take full advantage of its data if there’s a way for them to bring it all together without breaking the bank. A data lake can help make this dream a reality.
This is the final post in a 5-part series, "Big Data Cheat Sheet on Hadoop." This spin-off series for marketers was inspired by a popular big data presentation I delivered to executives and senior management at a recent SAS Global Forum Executive Conference.
If you did not read the previous posts in this series, I encourage you to read those as well. Tamara's goal here has been to enable you to have an informed view of how this area of technology can support your marketing strategy. Armed with these perspectives, hopefully you can partner even more closely with I.T. and operations to deliver the best possible customer experience.
Once you're comfortable with Hadoop and want to delve deeper into analytically-driven marketing solutions, start with our Customer Intelligence home page at: www.sas.com/customerjourney.
And as always, thank you for following!