How to build the next generation of data warehousing is currently being planned by many, either as cloud services or layered on top of conventional on-premise technologies. In this article, I aim to use my hands-on knowledge to provide advice for successfully developing a new data warehouse.
Modern data warehouse, data lake, data platform and data hub. These are some of the names used for a data system that collects a wide range of data from different sources into one location for harmonisation, storage and redistribution. The applications for such a system include advanced analytics, conventional reporting and integrations.
New cloud technologies, in particular, make it possible to work more flexibly, as well as store and process enormous masses of data at a more reasonable cost. At Solita, we have discovered that the challenges of data warehouses founded on new technology are largely similar to those of conventional data warehouses. Data warehousing is decades old at this point and many goodbest practices have been developed.
1. Functional management governance model
Documentation is a common pain threshold pitfall in older data warehouses. Meticulous documentation is required to control the ecosystem of data and integrations that forms around a data warehouse.
The changing of key personnel is usually the final blow that buries the design decisions made over a system’s lifespan, if they have not been documented.
The data warehouse may be described in a Word document spanning dozens of pages that was created during the implementation project and has not been updated in a while. This documentation is not trusted, and its use is avoided, meaning that things must be inferred by interpreting the source code of applications. This is a very slow and arduous process.
The building of a new data warehouse should be approached with improved discipline to ensure that agreements and decisions are documented from the start.
- All information data should include metadata to the agreed level. For example, every database table should include comments directly in the table and on the cells columns. Descriptions also need to be made for locations outside the databases, such as AWS S3 or Azure Blob Storage. With working descriptions, these components can be easily turned into a data catalogue, which can be used to communicate the solution’s potential uses.
- Incoming and outgoing data stream pipelines should be described at an appropriately high level during implementation and the descriptions updated as necessary. This information allows system dependencies to be checked at any time, and it gives an overview of the position of the data warehouse in relation to other systems. The data stream pipeline descriptions should focus on establishing a general view and leave the details of frequently changing factors, such as table and field level definitions and timing scheduling’s, to metadata.
- An annual clock approach should be used to ensure sufficiently regular reviews of the documentation, e.g. at least once every quarter. Other important aspects, like database users, should be reviewed in conjunction with these checks to quickly address any problems.
The management governance model must not be so heavy as to be a burden to the organisation. However, it must be robust enough to allow both current and new requirements for data to be fulfilled – the new EU General Data Protection Regulation is an excellent example. With sufficient documentation, it becomes rather straightforward to work out where personal data is processed and stored in a solution.
2. High-quality data streams pipelines ready for production
The troubleshooting of data warehouse data stream pipelines is a major part of normal maintenance, as the changes and outages of source systems will easily affect the data warehouse. The modern data platform poses the same old challenges and may even add new ones: instead of servers located next to each other in a data centre, future data may be created in one cloud and transferred over a public network onto another cloud, introducing many points of failure.
The conventional approach does not pay a great deal of attention to the automatic recovery of data streams. The capabilities of tools have been limited, and the implementation models chosen may require manual effort like the restarting of data streams.
This should change in the implementation of a new platform – the interruption of data streams is no longer acceptable.
Modern implementations can recover from a variety of faults on their own, leaving their administrators to monitor trivial faults and rectify major ones. On the other hand, the mechanisms must know when human intervention is needed: data corruption due to excessive automation is also unacceptable.
3. Sufficient data modelling
Once data moves to a raw layer in the cloud, the temptation arises to load the raw data into a relational database as-is and then read the data into an analytics tool directly. This presents a number of risks.
Dimensional information in particular, e.g. clients, agreements and so on, are subject to a host of business rules that exist in a conventional data warehouse. For example, the concept of a customer used in a CRM system may be very different from what is desired in the analytics layer, and the data of a single customer may be scattered among a number of CRM systems and further spread into multiple customer accounts.
Sufficient data modelling can be used to ensure working interpretation, giving the analytics tools the same business rules.
If interpretation is carried out on demand, the various analytics applications will require individual adjustment as the rules change. On the other hand, the solution must allow for a culture of experimentation and prototyping, where it is acceptable to load raw data for experimental number-crunching. After the prototype stage, the application is brought into production and made to read shared data models.
As the data content is created in other systems, the data platform is completely at the mercy of any changes in the IT architecture. If a system must be replaced, this changes the data content. A sufficiently flexible method, such as data vault modelling, allows IT system changes to be concealed from the analytics and reporting layers. This allows data integration changes to be made once, and the analytics applications may become insensitive to changes in the IT architecture.
Cloud technologies are a breath of fresh air in the world of data warehousing, but modern technical solutions still require data warehouse management. Administration and development work still call for people, who
- know the intricacies of data warehousing and data management
- respond quickly when the data warehouse malfunctions
- support analytics work through their expert knowledge of data contents
- are able to develop the data warehouse into an analytics multitool based on business requirements.