Recharge your knowledge of the modern data warehouse

Data warehousing is evolving from centralized repositories to logical data warehouses leveraging data virtualization and distributed processing. Make sure you’re not using old terminology to explain new initiatives.

Are you comfortable with source systems feeding ETL processes into operational data stores or master reference data through an enterprise service bus with the product, supply chain and business operational reports dumped into a presentation layer with soft analytics, dashboards, alerts and scorecards? That was yesterday.

Don’t get caught, explaining your new data warehouse initiative with old terminology.

Traditional vs. modern data warehouses

Data warehouses are not designed for transaction processing. Modern data warehouses are structured for analysis. In data architecture Version 1.0, a traditional transactional database was funneled into a database that was provided to sales. In data architecture Version 1.1, a second analytical database was added before data went to sales, with massively parallel processing and a shared-nothing architecture. The challenge was that this resulted in slow writes and fast reads. In data architecture Version 2.0, the transactional database populated a second database which flowed into a third analytical database, which connected to the presentation layer (business intelligence). In data architecture Version 2.1, multiple transactional databases fed the core database which provided information downstream to data stores (sales, marketing, finance) that connected to a business intelligence engine. At this point, traditional database structures end and modern structures begin: data architecture Version 3.0.

The two below examples highlight the difference between a traditional data warehouse and a data a modern data warehouse (using Hadoop for this example).

Traditional data warehouse:

  1. Operational systems: CRM, ERM, financial, billing.
  2. ETL: decision analysis model and data.
  3. Enterprise data warehouse: operational, customers and IT data marts.
  4. BI platform: KPI summary, monthly and quarterly reporting and daily summaries.
  5. Automatic customer value analysis: interactive data queries, static data analysis, and OLAP.
  6. BI collaboration portal: wholesale, OEM, sales, employees and external.

Modern data warehouse:

  1. HDFS: Hadoop Distributed File System.
  2. HCatalog: a metadata and table and storage management layer system.
  3. HBase: key-value database, columnar storage.
  4. MapReduce: a flexible parallel data processing framework for large data sets.
  5. Oozie: A MapReduce job scheduler.
  6. ZooKeeper: distributed hierarchical key-value store enabling synchronization across a cluster.
  7. Hadoop: open-source software framework to support data-intensive distributed applications (storage, processing of big data sets).
  8. Hive: A high-level language built on top of MapReduce for analyzing large data sets.
  9. Pig: Enables the analysis of large data sets using Pig Latin. Pig Latin is a high-level language compiled into MapReduce for parallel data processing.

Most database designs cover four functions: 1) data sources, 2) infrastructure, 3) applications and 4) analytics. This principle of design does apply to both traditional data warehouses and modern architectures. The design thinking, however, is different. In a modern data warehouse, there are four core functions: 1) object storage, 2) table storage, 3) computation and processing, and 4) programming languages.

Re-establish the structure for success

The lack of data governance, inadequately trained staff, weak security and non-existent business cases each factor into why data warehouse or business intelligence initiatives fail to achieve the desired outcomes. Keep your data warehouse program on track.

Start by strengthening your framework for business intelligence. If it’s been more than six months since you looked at your end-to-end operational state, it’s a good idea to revisit the original thinking and revalidate assumptions.

The three-tier structure outlined here can help guide your discussions and the assessment.

Primary functions

  1. Program management: portfolio, process, quality, change and services management.
  2. Business requirements: business management, information services, communities, capabilities, service levels and value.
  3. Development: data warehouse and database, services, data integration, systems, monitoring systems and business analytics.
  4. Business value: culture; data quality, analytics and data utilization; data evolution; and value measurements.

Secondary functions

  1. Intelligence architecture: data, integration, information, technology and organizational.
  2. Data governance: accountabilities, roles, people, processes, resources and outcomes
  3. Operations: data center operations (SaaS, DaaS, PaaS, IaaS), technology operations, application support and services delivery
  4. Intelligence applications: strategic intelligence, customer intelligence, financial intelligence, risk intelligence, operations intelligence and workforce intelligence

Tertiary functions

  1. Data integration: data consolidation, data quality and master data management.
  2. Informational resources: data management, informational access and metadata management.
  3. Informational delivery: query and reporting, monitoring and management, and business analytics.

Shrinking budgets, pressure to deliver and expanding data sources all encourage us as CIOs to accelerate progress. Much of this acceleration comes at the cost of not thinking. The modern data warehouse is being designed differently. This means we as leaders need a block of time to think. This time also allows us to upgrade our understanding of how modern data warehouses are planned, refresh the core elements of the progressive data ecosystem and upgrade our terminology.

Evaluating your current data warehouse initiative

Start by asking the following questions to determine if you’re running a modern data warehouse.

  1. Does our environment quickly handle diverse data sources and a variety of subject areas?
  2. Can we handle excessive volumes of data (social, sensor, transactional, operational, analytical)?
  3. Are we using structures such as data lakes, Hadoop and NoSQL databases, or are we running relational data mart structures?
  4. Do we support a multiplatform architecture to maximize scalability and performance?
  5. Do we utilize Lambda architecture (more about data processing than data storage) for near real-time analysis of high-velocity data?
  6. Have we leveraged new capabilities like data virtualization (cloud services) in additional to data integration?
  7. Has the organization applied data warehouse automated orchestration for improved agility, consistency and speed through the release life cycle?
  8. Is our organization running a bimodal business intelligence environment?
  9. If we asked our primary business sponsors, would they know where the data catalog is located to document business terminology?
  10. Are the BI development tools decoupled from the agile deployment models?
  11. Have we clearly defined how we certify enterprise BI and analytical environments?

Modern data warehouses are comprised of multiple platforms impervious to users. Polyglot persistence encourages the most suitable data storage technology based on your data. This “best-fit engineering” aligns multi-structure data into data lakes and considers NoSQL solutions for XML or JSON formats. Pursuing a polyglot persistence data strategy benefits from virtualization and takes advantage of the diverse infrastructure.

If you’re well into the modern data warehouse journey but have not seen the benefits initially forecasted, don’t fear, there is still hope. Allow me to share a few tips to uncover the underlying challenges preventing successful adoption. First, define all the data storage and compression formats in use today. There are many options, and each one offers benefits depending on the type of applications your organization is running. Second, look at the degree of multi-tenacy supported in your BI environment. Using a single instance of software to serve multiple customers improves cost savings, makes upgrades easy and simplifies customizations. Third, review the schema or schema-less nature of your databases and the data you’re storing. Understanding how data is loaded, processed and analyzed can help to determine how to optimize the schemas of objects stored in systems. Fourth, metadata management, while often overlooked, can be almost more important as the data itself.

Upgrading your team’s understanding of data warehouses will move your organization toward agile deliveries, measured in weeks not months. 

Previous articleThe quiet revolution: the internet of data structures with IPFS
Next articleDrone certification: how to earn your wings
Peter is a technology executive with over 20 years of experience, dedicated to driving innovation, digital transformation, leadership, and data in business. He helps organizations connect strategy to execution to maximize company performance. He has been recognized for Digital Innovation by CIO 100, MIT Sloan, Computerworld, and the Project Management Institute. As Managing Director at OROCA Innovations, Peter leads the CXO advisory services practice, driving digital strategies. Peter was honored as an MIT Sloan CIO Leadership Award Finalist in 2015 and is a regular contributor to CIO.com on innovation. Peter has led businesses through complex changes, including the adoption of data-first approaches for portfolio management, lean six sigma for operational excellence, departmental transformations, process improvements, maximizing team performance, designing new IT operating models, digitizing platforms, leading large-scale mission-critical technology deployments, product management, agile methodologies, and building high-performance teams. As Chief Information Officer, Peter was responsible for Connecticut’s Health Insurance Exchange’s (HIX) industry-leading digital platform transforming consumerism and retail-oriented services for the health insurance industry. Peter championed the Connecticut marketplace digital implementation with a transformational cloud-based SaaS platform and mobile application recognized as a 2014 PMI Project of the Year Award finalist, CIO 100, and awards for best digital services, API, and platform. He also received a lifetime achievement award for leadership and digital transformation, honored as a 2016 Computerworld Premier 100 IT Leader. Peter is the author of Learning Intelligence: Expand Thinking. Absorb Alternative. Unlock Possibilities (2017), which Marshall Goldsmith, author of the New York Times No. 1 bestseller Triggers, calls "a must-read for any leader wanting to compete in the innovation-powered landscape of today." Peter also authored The Power of Blockchain for Healthcare: How Blockchain Will Ignite The Future of Healthcare (2017), the first book to explore the vast opportunities for blockchain to transform the patient experience. Peter has a B.S. in C.I.S from Bentley University and an MBA from Quinnipiac University, where he graduated Summa Cum Laude. He earned his PMP® in 2001 and is a certified Six Sigma Master Black Belt, Masters in Business Relationship Management (MBRM) and Certified Scrum Master. As a Commercial Rated Aviation Pilot and Master Scuba Diver, Peter understands first hand, how to anticipate change and lead boldly.