Growing Into Big Data: The Exploration of Data Lakes

Data Lake. Hadoop. Descriptive Analytics.  Machine Learning. We’ve all heard these terms before. They sound exciting, fresh, and new.  Like a crisp new $100-dollar bill, you want to use it right away.

James Dixon coined the term data lake in late 2010, October 14, 2010 to be exact, in one of his many blog posts on Rentaho, Hadoop, and Data Lakes. The concept is not new. Yet, the recent buzz would suggest otherwise.

Debunking the Myth Behind Data Lakes

It’s a bit unclear what exactly a data lake is and how it actually adds business value, right? You’ve likely heard the term, or watched its placement in a seemingly thoughtful email or air dropped into a meeting touted as the solution for solving the problem at hand. The myriad of definitions adds to the confusion, so let’s clarify it. The definitions are vast which is one reason that the clarity over the term remains elusive.

Let’s start by identifying the common industry definitions, which you’ll hear day-to-day.

Common Definitions:

  • Data Lake – A data lake is a massive, easily accessible, centralized repository of large volumes of structured and unstructured data (Techopedia)
  • Data Lake – The data lake architecture is a store-everything approach to big data. Data are not classified, unprocessed, and ungoverned when they are stored in the repository, as the value of the data is not clear at the outset (Techopedia)
  • Data Lake – A massive, easily accessible data repository built on (relatively) inexpensive computer hardware for storing “big data”. Unlike data marts, which are optimized for data analysis by storing only some attributes and dropping data below the level aggregation, a data lake is designed to retain all attributes, especially so when you do not yet know what the scope of data or its use will be (wiktionary)
  • Data lake – place that all of the data in the enterprise can be dropped, so every system, every partner and every external source flows into a single repository ready for use. There is no restriction on schema or plan at this stage it is data in its raw and native form (Cap Gem)
  • Data lake – the data is not pre-categorized at the point of entry, which therefore does not dictate how it’s going to be analyzed.

A data lake supports all data types, all users, all data sources, and all changes. Sound good? It’s great, in theory. Who doesn’t want to have a data lake over data that trickles to form a stream, a river, or even an estuary of data: when an exciting data lake is available. There is a reason the concept sounds mythical, because well – it’s a myth. This is what sells.  A fantasy lake where all data funnels in and all users are satisfied.

What if the next time you wanted a glass of water, I said, “we don’t have bottled water here, but 5 miles away we have a lake full of fresh water.” Wouldn’t that just be delightful? Refreshing even. No, it wouldn’t. This unfortunately is the realization most company’s experience, after attempting to build a lake.

Accepted definition: a large data repository, typically with unfiltered, structured and unstructured data

Reality Check

Here’s what you’ll hear: no integration costs, one location for all data consumers, no data conversion required, no upfront costs for data ingestion (like transformation), and data is immediately available. If your data is in silos today, tomorrow it can all be accessible (this is the old problem data lakes attempt to solve according to Gartner). The future problem data lakes attempt to solve (according to Gartner), is big data projects receive a lot of unknown information and if it’s not being consumed immediately a data lake can offer a repository for data storage, until your organization figures out what to do with it.

What you won’t hear: you need highly skilled business users with solid data analytical skills, metadata won’t be consistent for sharing with vendors, and you’ve just delayed data governance.

If you’re a business leader who owns an information technology organization, you won’t be suggesting the introduction of a data lake into your organization any time soon. They do immediately benefit IT.  Why? There is limited integration like a trash dumping site: all data is dumped there, unorganized, raw, multiple formats, and with virtually no data transformation. The kicker is no business problem is being solved. The real problem is pushed off until – well tomorrow or maybe months later. Data lakes have no data governance.

As a CIO does that make you feel secure? Are you able to effectively deliver that message to your board? Doubtful, if you plan to be in the role years and not months. Ungoverned free flow repositories quickly do not meet fundamental business needs: welcome data governance that is soon to arrive. The demand for governance need will start with standardization, a helpless grasp for consistent data for stable reporting.  Tools won’t work the same against general-purpose data stores. Data will be accessible but not at the same speed as optimized and fit-for-purpose infrastructure. Slow report generation, is a visible problem that winds up back on the front door of  the information technology organization.

Masked Benefits

All said data lakes do have some clear benefits. For one they can be used as a good staging location for a conventional data warehouse. Second, if you’re not concerned about data filtering (specific data for specific user groups), data generalist of data scientist can easily tap stream data for real-time analytics. Third multiple consumers are able to use the lake for discovery and ideation. Fourth, a data lake preserves data lineages because they accept data from various sources in various formats.

However, by far the biggest advantage of data lake is the ability for massive scalability, low-cost storage of data files regardless of format that can leverage cluster computing technologies (on or off the cloud).

Benefits depend on your purpose. The list below highlights six additional benefits that are achievable built on the right architecture.

  1. Store Massive Data Sets
  2. Mix Disparate Data Sources
  3. Ingest Bulk Data / Ingest High Velocity Data
  4. Apply Structure to Unstructured/Semi-Structured Data
  5. Make Data Available for Fast Processing
  6. Achieve Data Integration

Ask the Right Questions

‘Data lake’ is an overused term, that really doesn’t say much, very similar to the term ‘data analytics.’ It’s what you’ll hear when the business problem is not defined.

Separate buzzwords, from strategic direction. Below are key questions, rarely addressed when the data lake concept is introduced:

  • How is governance handled?
    • Data accountabity, lineage, data definitions, and change management 
  • Is there a Data Information Lifecycle management process?
    • Data storage and managing and addressing changes over time 
  • How is security addressed?
    • Thinking about risk of data lost (storage and transmission) 
  • Structured vs. unstructured data – customer records and twitter replies, same value of data?
    • Are we concerned with transaction and reference data swimming around in the same lake? 
  • How are data sources managed?
    • Internal vs. 3rd party, fit-for-use, limited metadata repository 
  • Who’s consuming the data?
    • Do they know it’s not in a usable form?

Words such as transform, disrupt, create, high-quality, right-size data, remove distractions, multi-faceted support, buy-in, buckets, straw man, real meaning, and directionally correct all add zero value to the discussion.  They also are a sign that the ‘what’ we’re doing is not supported by a ‘why’ we are doing it.

The next time you’re on a conference call or in a meeting and hear the term ‘data lake’ – ask some questions.

Transparency, trust and discipline are essential for implementing a data strategy and for building business credibility.  These three organizational characteristics are not going to be found swimming naked in a data lake.

——

References

Curran, C., Pearl, M., & Morrison, A. (2014). Rethinking integration: Data Lakes and the promise of unsiloed data: PwC. Retrieved January 25, 2016, from http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/data-lakes.html Gartner. (2014).

Gartner Says Beware of the Data Lake Fallacy. Retrieved January 26, 2016, from http://www.gartner.com/newsroom/id/2809117 Gartner. (2016). Information

Life Cycle Management (ILM) – Gartner IT Glossary. Retrieved January 25, 2016, from http://www.gartner.com/it-glossary/information-life-cycle-management-ilm

——

Peter Nichol, empowers organizations to think different for different results. You can follow Peter on Twitter or on his blog. Peter can be reached at pnichol [dot] spamarrest.com.

Previous articleThe Network Effect: How Platforms Lift Value
Next articleImpact of User Fees for State Exchanges
Peter is a technology executive with over 20 years of experience, dedicated to driving innovation, digital transformation, leadership, and data in business. He helps organizations connect strategy to execution to maximize company performance. He has been recognized for Digital Innovation by CIO 100, MIT Sloan, Computerworld, and the Project Management Institute. As Managing Director at OROCA Innovations, Peter leads the CXO advisory services practice, driving digital strategies. Peter was honored as an MIT Sloan CIO Leadership Award Finalist in 2015 and is a regular contributor to CIO.com on innovation. Peter has led businesses through complex changes, including the adoption of data-first approaches for portfolio management, lean six sigma for operational excellence, departmental transformations, process improvements, maximizing team performance, designing new IT operating models, digitizing platforms, leading large-scale mission-critical technology deployments, product management, agile methodologies, and building high-performance teams. As Chief Information Officer, Peter was responsible for Connecticut’s Health Insurance Exchange’s (HIX) industry-leading digital platform transforming consumerism and retail-oriented services for the health insurance industry. Peter championed the Connecticut marketplace digital implementation with a transformational cloud-based SaaS platform and mobile application recognized as a 2014 PMI Project of the Year Award finalist, CIO 100, and awards for best digital services, API, and platform. He also received a lifetime achievement award for leadership and digital transformation, honored as a 2016 Computerworld Premier 100 IT Leader. Peter is the author of Learning Intelligence: Expand Thinking. Absorb Alternative. Unlock Possibilities (2017), which Marshall Goldsmith, author of the New York Times No. 1 bestseller Triggers, calls "a must-read for any leader wanting to compete in the innovation-powered landscape of today." Peter also authored The Power of Blockchain for Healthcare: How Blockchain Will Ignite The Future of Healthcare (2017), the first book to explore the vast opportunities for blockchain to transform the patient experience. Peter has a B.S. in C.I.S from Bentley University and an MBA from Quinnipiac University, where he graduated Summa Cum Laude. He earned his PMP® in 2001 and is a certified Six Sigma Master Black Belt, Masters in Business Relationship Management (MBRM) and Certified Scrum Master. As a Commercial Rated Aviation Pilot and Master Scuba Diver, Peter understands first hand, how to anticipate change and lead boldly.