We continue our discussion on data warehouse. We look into the "design" of the interface from operational systems. The design is actually heuristic. The requirements for a warehouse are not known until it is partially populated and in use. So a phase by phase approach is used instead. One portion of data is populated. It is then used and scrutinized by the DSS analyst. Then there is feedback from the user which leads to data modifications. Then another portion of the data warehouse is built. This feedback loop continues until through out the life of the warehouse.
On the other hand, anticipating requirements is still important, so there is a trade-off.
Design usually begins with operational data, the source for the warehouse. However, it's not extract and load operation. There are several transformations involved. Operational data comes from different applications They are often not integrated. Data when it makes its way into the warehouse has to be integrated across applications. Integration means standardizing the conventions across systems, keeping the syntax and semantics same on merging, and correctly mapping source fields to destination.
Three types of loads are made into the data warehouse from the operational environments:
archival data
data currently in the operational environment
periodic refreshes from updates to the data in the operational environment
Archival data is often skipped or one-time.
Loading current data is not disruptive because the data is loaded only once and usually via downloaded files.
Loading periodic refreshes on the other hand is a major concern because we have to find what's already present and what's not. Some common techniques used to limit the scanning of data are as follows:
1) scan the time-stamped data of the source and exclude the ones that are not in range. However source may not have timestamps
2) scan the delta file that contains only the changes made to the application as a result of the transactions.
3) scan the log file or the audit file that's generated a by-product of the transaction processing.
4) modify the application code for managing the amount of data scanned, however application code can be legacy, old and fragile.
5) the last option is literally a last resort in that it takes a snapshot before and after image of the operational file and dedicating large amount of resources
On the other hand, anticipating requirements is still important, so there is a trade-off.
Design usually begins with operational data, the source for the warehouse. However, it's not extract and load operation. There are several transformations involved. Operational data comes from different applications They are often not integrated. Data when it makes its way into the warehouse has to be integrated across applications. Integration means standardizing the conventions across systems, keeping the syntax and semantics same on merging, and correctly mapping source fields to destination.
Three types of loads are made into the data warehouse from the operational environments:
archival data
data currently in the operational environment
periodic refreshes from updates to the data in the operational environment
Archival data is often skipped or one-time.
Loading current data is not disruptive because the data is loaded only once and usually via downloaded files.
Loading periodic refreshes on the other hand is a major concern because we have to find what's already present and what's not. Some common techniques used to limit the scanning of data are as follows:
1) scan the time-stamped data of the source and exclude the ones that are not in range. However source may not have timestamps
2) scan the delta file that contains only the changes made to the application as a result of the transactions.
3) scan the log file or the audit file that's generated a by-product of the transaction processing.
4) modify the application code for managing the amount of data scanned, however application code can be legacy, old and fragile.
5) the last option is literally a last resort in that it takes a snapshot before and after image of the operational file and dedicating large amount of resources
No comments:
Post a Comment