Thursday, December 26, 2013

We return to the book on building the data warehouse:
We will now look into external data for data warehouse. Up to now, all the data we have discussed comes from internal systems i.e. internal to the corporation. This has already been processed into a regularly occurring format.  There are several issues that come from external data but first lets look at why the data warehouse is the right place to store external data.  The external data should not invade the system. It should not lose the information of its source and this is best done in the single centralized data warehouse. Further, if it doesn't come to the warehouse, it could come in formats and channels that are not managed. Lastly, we want the tracking of external data which is facilitated by a warehouse.
On the other hand, external data is difficult to deal with for the following reasons:
The external data doesn't have a frequency of availability. It's not within the Corporation's control. It may even call for constant monitoring and alerts.
The other problem is that the external data is totally undisciplined. There may be several reformatting and structuring required that breaks ever so often.
The third factor that makes the external data hard to capture is its unpredictability. External data may come from practically any source at any time.
External data can be of the following types:
Records of excellent data collected by some source
External data from random reports, articles and other sources.
There are several ways to capture and store external data. Near-line storage can help with making the external data accessible but costing huge amounts of money to store. Indexes can be created on the external data and they can be kept on disk to alleviate the traffic to the external data.
Another technique for handling external data is to create two stores - one to store all the data and another to store only a subset.

No comments:

Post a Comment