Traditionally, data warehouses were mainly used for business management or regulatory reporting, requiring a high level of quality and accuracy. They are based upon ‘relational databases’ in an organised hierarchical structure, usually as database tables and include things such as linkages to other tables (called foreign key references), granular data encryption, and detailed user access management.

Because of the sequential steps required to load batches of raw data from multiple data sources, process and store it the production cycle often takes one or more days to complete.

There is now so much unstructured and semi-structured data now available (often called ‘big data’) that is being used in a much more dynamic way, such as real-time monitoring, decisioning and customer interactions, data warehouses can have access and data extraction bottlenecks that cause problems for statutory and regulatory reporting, risk and fraud analysis, and customer experience and credit scoring.

Data transformation in Extract, Transform and Load (ETL) stage includes cleaning, normalising or standardising the raw data, joining data and storing it in a way so that it can be accessed by users or systems. It is not limited to transforming data for storage in the data warehouse; it involves governance and management of the process and the quality of the data being loaded into the warehouse. Because of the problems with data from legacy source systems that includes inconsistent and incorrect data, ETL must be able to deal with data that arrives in the wrong format.

The following diagram provides a simple flow where data is extracted from several data sources, transformed and loaded into a quality data warehouse. Data is extracted from source systems using automated software extraction tools or API interfaces (particularly is it is data from an external data source).

Loading the data warehouse