Cluster computing

Saturday, December 21, 2013

We now look at the granularity in the data warehouse.We choose this based on what matters to the environments that depend on the warehouse data as well as what is efficient for the system given the volume of the data. By setting the granularity at the right level and spilling the excess volume of data to overflow storage, other aspects of the design and implementation can flow smoothly.
To determine the volume of data, we estimate the number of rows. If there are a few thousand rows, any granularity will do. If there are millions of row, a low level of granularity can be chosen. If there are billions of rows, a higher level of granularity is required.
The rule of thumb cited in the book suggests the following:
1) For each known table:
we find how big a row can be and find both the biggest and smallest estimates
For the 1-year horizon, we find the number of rows possible
For the 5-year horizon, we find the number of rows possible
for each key of the table, we find the size in bytes.i
With the number of rows from above, we calculate the max 1 year space as the biggest row times the number of 1-year max rows. We calculate the min 1 year space as the smallest row times the 1-year min rows plus index space.
2) We repeat this for all the known tables.
This way we can estimate the volume of data to manage. The author has some interesting caveats here :
one that the projections of the size of the database are usually low. and second, the growth rate of the warehouse is usually faster than the projection.
These row estimates and the number of direct access storage devices helps with the planning process. Note that more than the size of the records, the number of records matters. In fact, as long as all rows are reasonably sized, the calculation of index entries is entirely dependent on the number of rows and these together affect the choice of granularity and the overflow storage.

Cluster computing

Saturday, December 21, 2013

No comments:

Post a Comment