I was making the circuit this week through some of my favorite data management sites, and was catching up on some articles on Wayne Eckerson’s blog. Two articles there caught my eye: The Demise of the Data Warehouse and a rebuttal The Data Warehouse is Still Alive. Stephen Smith was arguing that technology has enabled Data Lakes to replace the traditional data warehouse. Dave Wells’ counterpoint was that the need for a data warehouse hasn’t gone away, and that organizations should modernize rather than throw their data warehouse to the curb.
What struck me wasn’t so much the “Data warehousing is/not dead” argument (Bill Inmon gave the best answer back in 2013 with his Architecture vs Technology article), as it was the point that a Data Lake strategy rests heavily on effective metadata management.
Rather than bringing together disparate raw data and physically integrating it in a single location, a data lake is a large pool of co-located, non-integrated raw data that gets cleansed, transformed, and integrated through downstream semantic layers or user-specific reporting systems. Technology advances in storage and processing (MPP systems), allow the magic to happen. Rather than writing ETL code and running batch ETL jobs, a series of keys and pointers bring the data together for consumption at the point when it’s needed.
MPP technology platforms aren’t the only magic that’s needed for an effective data lake strategy. To make downstream transformation and integration feasible, the raw data still needs to be described in detail. Keys and relationships must be identified and maintained. Each file in the lake still needs to be defined and every data element still needs a format and a description. In other words, the strategy needs metadata in order to work. And because transformation and integration can be handled by many downstream teams (technical and business), the need for complete and accurate metadata that is accessible to a very wide/large audience is perhaps greater than it’s ever been.
When data is in silos (even the data warehouse silo), teams can get by with partial metadata solutions. Field names and formats with business definitions may meet their needs. When someone needs to see data lineage, she pulls up the ETL code. Report writers keep metadata in the reporting tools. Project teams keep their documentation on SharePoint and get their work done.
As technology continues to evolve (more capable, cheaper), more organizations will be migrating from traditional data warehouses to data lake architectures. The data lake can manage more data and make that data accessible to a much wider audience – both technical and business – who can extract even more value from the data. But the key to making the data lake strategy work is a solid solution for metadata management: business glossary, formats and relationships, technical and business definitions, reference data (valid values), lineage and impact analysis. And that metadata foundation must be owned, curated, and accessible to a wide cast of data users across the organization.