A simple definition of a Data Warehouse would call it a collection of databases. This oversimplified version, while easy to use to explain to a layman, is not entirely correct. While a Data Warehouse will contain data from a multitude of databases, it does not exactly stand to the definition of collection of databases.
Definition of a Data Warehouse:
A Data Warehouse is a historical collection of data made available at a single location, to enable easy analysis. It is usually generated by integrating data from one or more different sources. It is a central data repository created so that the higher management and decision makers can analyse trends. The source could be Operational Data Stores or it could be from disparate Transactional Data Sources (OLTP Databases).
The hierarchy of a Data Warehouse implementation is usually as below:
A data warehouse has already been introduced at the beginning. Let me explain the rest of the terms used above.
This is the database where all the transactions actually take place. Assuming the case of an online store, this will have the data related to customers who have registered, the purchases that have been made, the available inventory etc. All these data is updated frequently as per customers joining and leaving, making purchases, inventory being updated etc. These databases are usually in a highly normalized form. The normalization reduces the space used, and also speeds up the updation process. At the same time, normalization causes the process of fetching data to be slower, specially for analysis purposes which would require a lot of aggregations etc.
Operational Data Store (ODS):
An operational data store is a database which is used to collate information from various disparate OLTP or transactional sources. One of the uses of such a system is to integrate information from various sources to enable operational analysis. The data could be either in the same structure as in OLTP, but with some cleanups to remove duplication and redundancy or in a slightly modified structure for ease of reporting and updation. This data is usually real-time or near-real-time. It seldom holds historic data. One of the applications of and ODS is for companies with multiple stores or warehouses, where there are individual databases and then an ODS system that collates all this data and makes it available centrally. Another use of this is as a layer between an OLTP system and a Data Warehouse. While it is not mandatory that an ODS has a data warehouse to feed, this is usally the case.
Having an ODS is benefitial for the Data Warehouse in that the process only needs to run on this single system, rather than all the disparate OLTP sources that will be the basic source.
A staging layer is a database where data is dumped from all the disparate systems that form the source for the data warehouse. Since the data is often in different formats and also possibly obtained as a flat file, a staging layer helps by bringing this all together for easy processing. From here, the transformations are carried out and the data is loaded into a data warehouse. Often, a staging layer is used to load data into the ODS as well.
A Data Warehouse usually contains all the data for organization. While having it together might be really good at the CxO level, most other managers have their regions/department was requirements. Rather than loading the Data Warehouse with all these queries, we form subsets of the data from the Data Warehouse and split it based on conditions like department, region etc. These reduced data are stored in what is called a Data Mart. I most cases the Data Mart is structured similar to Data Warehouse, with occasional deviations (mostly additions) to enable the sub level analysis. Sometimes each department is given complete ownership of their respective Data Marts, in which case they can alter or modify it to any extent without impacting the other marts or the warehouse. In other scenarios, there are restrictions in modifications where a confirmed dimension for shared information come into play.
Most of the analysis is usually carried out at the data mart layer while the data warehouse acts as the central repository. The data mart and data warehouse do not usually contain real-time data, and concentrate on historic data. How much back in history the data goes depends on the requirement of the organization and the storage limitations. Also, the historic data stored in the marts can be lower than that of the warehouse.