Tuesday, December 31, 2013

We now look at advanced topics of the data warehouse. Some of these don't fit well with others so they are covered here. We start with the requirements. Where do these come from ? They come from the common data model.The end user requirements influence the model. Their role is indirect. The model has different levels - high, mid and low-levels. The high levels consist of the subject areas. The mid level identifies the keys, attributes and relationships and other aspects of the details of the data warehouse.  The low level is where the physical database design is done including partitioning, foreign key relationships and indexes. When the relational database is ready, the end user requirements are used to show how the data is to be reshaped.
Resource contentions in the data warehouse often occur due to statistical processing. If they occur periodically, the statistical processing can be interspersed. If there is no room for the statistical processing, it is better to build out a special warehouse called exploration data warehouse for the statistical processing. The purpose of an exploration warehouse is to provide a foundation for heavy statistical analysis.
Another reason for separation of the exploration warehouse from the main is that the statistical processing requires very different environment from the regular analytic processing. Another reason is database design. For example , a convenience field is often added for the purpose of streamlining the statistical analysis and to recast the data warehouse. The exploration warehouse is seldom a direct copy of the data found in the data warehouse.
There is a difference between the exploration warehouse and the data mining warehouse. They are similar but the difference is that the exploration warehouse is used to create assertions, hypothesis and observations while the data mining warehouse is used to prove the strength of the truth of the hypothesis. The exploration warehouse must provide a breadth of information. The data mining warehouse needs to have depth.
The exploration warehouse is not refreshed with the most current detailed data unlike the data warehouse. This is because the processing on the exploration warehouse requires the data to be frozen.
One other factor in the differences between the main data warehouse and the exploration warehouse is that the external data fits easily into the exploration warehouse. External results and internal results often provide interesting comparisons. On the other hand, the external data is difficult to integrate with the fine detailed structure of the warehouse.
Another topic is whether data marts or data warehouses should share the same processor. There are several reasons why : First, the processors are expensive, separating out the data marts onto separate machines brings the costs down. Second the workloads are different so they are more manageable. Third, different departments can take ownership of the data marts.
There is a life cycle of data as it enters the corporation and then is used. It is first captured, then edited, then loaded into the application, then online data is integrated, and passed into the warehouse, where the frequency of usage drops, then sent to near line storage and subsequently retired to archival storage.
We also mention testing and the data warehouse. A test environment is a good staging ground.  This means that the data changes can be made without affecting the public before being made visible. The size of the data in the warehouse could be just enough to test the changes.
Tracing the flow of data through the warehouse is another practice.  This means there the data is traced as it flows through an initial capture, an ETL enterprise integration, a Datamart transformation, an exploration warehouse or data mining and near-line storage . Tracing gives information on the data such as what revenue when there may be mutliple definitions or unclear origin or lineage. One common issue with tracing is the name change or value change. Another less common issue is the recalculation of the unit of data. Both these data and metadata changes need to be considered when tracing.
Another factor related to the tracing is the velocity. The velocity of data refers to the speed at which it passes through the warehouse from the initial capture to the point of use by an analyst. Integration may hinder velocity
Data may also be pushed from the operational environment and into the data warehouse or pulled by data marts from the warehouse. Both affect its velocity.
An important application of data warehouse is eBusiness. From the start eBusiness systems were kept separate from the operational environment. With the data warehouse and ODS, its easier to interface with the web and corporate information.


Monday, December 30, 2013

We talk about the differences between the relational and the star join models as the basis for data warehouse design. The first difference is in terms of flexibility and performance. The relational model has high flexibility but is not optimized for performance. The multidimensional model targets a user community and hence is more performant. When new requirements are added to the star join model, the optimizations for performance are moot. When the relational model is used, there is no particular optimization for performance one way or another but the data is stored at the lowest level of granularity. If performance is required by the data model, it is better to pull data out of the relational and then optimize access for performance.
The relational model is shaped by a  pure data model while the multidimensional model is stored by processing requirements which leads to several consequences. The first consequence is in terms of serviceability. As discussed, this is more flexible and performant when the data is accessed indirectly. The multidimensional model is good for direct access of data. The second consequence is in terms of level of abstraction and origins. The relational model organizes data at a higher level of abstraction and can work with different usages while the dimensional model uses the processes to suit only some workflows.
The relational model supports shaping and reshaping of many different models of data. For example, merging is easy because data is stored at the most granular or normalized levels.  With this granularity, it is easy to meet future unknown needs because it is easy to combine the atomic data into more meaningful representations when the need becomes known. There is yet another advantage of the relational model which is the ability to change gracefully. The relational model is designed to be used in an indirect fashion. This means that there is a lag between the direct uses of data and the changes to the data.
The multidimensional model is aligned with the what is referred to as the independent data mart approach.  A data mart is a data structure that is dedicated to serving the analytical needs of one group of people. With independent data marts, its easier and less expensive to build systems for the growth of the organization's data. A dependent data mart on the other hand is built from the data coming from the warehouse. The dependent data mart does not depend on legacy or operational data for its source. It depends on only the data warehouse for its source of data. The dependent data mart requires fore thought and investment.
However independent data marts as they crop up also tend to have redundancies between the systems With the progression of time, this data may be unincorporated, uncorrelated and even worse un-synchronized. There is an exponential redundancy in data. The momentum for the independent data mart is now fully established. This affects the long term perspective and maintenance.Consider the warehouse on the other hand. The data would be more reusable, limited number of interface programs and reconcile-ability. Further the architectural problems from the independent data marts would not have arisen.
Hence the recommendation against the independent data marts and in favor of a data warehouse albeit the challenges encountered.

Sunday, December 29, 2013

In this post, we will discuss the really large data warehouse. As we have seen, data warehouses grow at a rapid rate- from a few gigabytes to tera bytes in a few years.This is primarily due to the following contributors:
1) history - no where else in the system, do we keep historical data as we do in the warehouse. Application developers are quick to remove historical data because it affects performance.
2) granularity - The data warehouses collect data at the most granular level.  This lets the developers to look at data like no one before.
3) integration - The data warehouse brings together data from different operational systems. Its the database for the entire enterprise. There are common data model, common keys, common reference tables, common definitions,
These are referred to in the book as :
HistoricalData_DiverseData_DetailedData = Lots_of_data
The impact from this data is that there are 1) associated costs, 2) value in terms of usefulness of the data and 3) maintenance in terms of the rules of data management.
Basic data management activities include loading, creating an index etc.
The real costs of storage include the costs for the processor (DBMS + OS) software, communications controller, the disk controller and the data store.
When data grows, the percentage of data that is used actually reduces. For example, at 50GB almost all the data is used, at 500GB only 75% is used, at 2 TB, 35% is used, and at 10tb, only 5% is used. This is important to understand because it enables efficient planning of storage technology.
To calculate this usage of data , we first determine the size of the data access. We calculate this based on
Number of users_Queries per day_Bytes per query_200 days
We could also divide by the overlap factor that indicates the same amount of data pulled more than once.
Typically numbers are 250GB of data servicing 600 users
Next the usage of data is calculated with a ratio as
Usage_ratio = Actual bytes used / Total data warehouse bytes
Note that the separation of frequently used and infrequently used data is different for different systems. In OLTP environment, there is random access and interspersal. In DSS environment, there is a much more sharper division between the frequent and the infrequent data.


Saturday, December 28, 2013

We now look at unstructured data and warehouse. Combining structured data and unstructured data is like integrating two different worlds. CRM is an example. In CRM, we have structured data corresponding to the customer demographics and unstructured data corresponding to the documents and  communications from the customer.
Text is common to both worlds and with this, we establish match the communications with the customer. However, text is susceptible to misspelling, mismatched context, same names, nicknames, incomplete names and word stems.
Using stop words and stemming, we can do a probabilistic where all the data that intersects is used to make a match. The strength of the match can be depicted on a numeric scale.
A themed match is another way to make a match where the unstructured data is organized based on themes. such as sales, marketing, human resources, engineering, accounting, distribution. Once the collection of industrially recognized themes have been gathered, the unstructured data is passed against all the themes for a tally.
Linkages between themes and theme words can be done via a raw match of the data.
Another way to link two documents is based on the metadata in the structured environment.
A two tiered data warehouse is typically used for the usage of unstructured data in the data warehouse.
one tier is the structured data and another tier is the unstructured data.
Unstructured data can be further subdivided for example comunications and documents.
The relationship between communications and unstructured data is formed on the basis of identifiers.
Unstructured data visualization is very different from the that for structured data which is more business intelligence. In unstructured data, we create self organizing map where clusters are depicted.
Volumes of data is an issue with every warehouse whether it has unstructured or structured data.
The two environments are fit together based on identifiers. In the structured environment we keep both metadata and data pertaining to the unstructured environment. the metadata has information about the repository. The records have information about the data, identifier and close identifier. close identifiers are those where there is a good indication that a match has been made.

When discussing the migration of data to the architected environment during the build phase of warehouse, it is important to discuss the feedback loop. First the data warehouse is populated from existing environments. Then the DSS analyst uses it for analysis and gives new requirements to the architect. The architect works on the existing systems to meet the requirements from the warehouse. This is the feedback loop.
The loop requires some data to be available in the warehouse for the DSS analyst to come up with new requirements. The shorter the loop the more successful the warehouse. The larger the volume, the longer the loop
The architect owns the data model. All feedback is translated as changes to this data model which is then used together with the existing system of record to improve the warehouse.
The architect also has to keep track of the agents of change viz, the aging of systems, the aging of technology, the organizational upheaval and the drastically changed requirements. 
These agents of change let the architect to coordinate the efforts to rebuild the operational systems This works even better than changing the data model.
The steps taken to restructure the operational environment are as follows:
identify how the data model differs from the existing systems
study the impact of each delta item
how much will it cost to fix the delta item
and prepare a report to the management on what needs to be fixed, the estimate of resources, the order of work, and the disruption analysis.
The methodology used in data warehouse is called spiral development methodology.  It differs from the migration path described above. The migration path is about general activities. This methodology describes the specific activities, their deliverable and the order of the activities. Together the methodology and migration are needed to find out all that is required for building the warehouse.
A word of caveat about methodologies  is that there is an industry track record for even the most rationally appealing methodologies to fail during implementation.
These are due to the reasons such as they describe prescribed set of activities that often ignore iterations, or they describe how to do something but not what to do and they ignore the scope or the detail of the work involved. Other reasons include these : they mix project management with development and design, they don't differentiate operational from DSS systems, they don't include checkpoints and stops on failures.
In this sense the spiral methodology is different from the others in that it proceeds down the same path repeatedly as opposed to say the  waterfall methodology. Further, it is said to be data driven.

Friday, December 27, 2013

Since we discussed building the data warehouse, we will now discuss the migration to this environment and the ease of use. We mentioned earlier that this is a step by step activity. Most successful implementations have been built one iteration at a time. The size and the speed of the iterative development is important and the results must be delivered quickly.
But Migrations is different from the iterative development. Many migrations have failed and hence the book mentions about a generic migration plan that has worked in many cases.
This migration plan begins with a corporate data model. This data model includes major subjects, their definitions, relationships, and groupings.
Note that this model excludes the derived data or DSS data because they are subject to change and are temporal.
After the corporate data model and the mid-level models are in place, the next activity is defining the system of record. The system of record is defined in terms of the corporations existing systems. It represents the best data in terms of completeness, timeliness and accuracy.
Next the challenges in bringing the data into the data warehouse are identified.
These could include the following:
if the data arrives in large volumes, then the arrival rate has to be used to handle the influx of data
data that is voluminous is also summarized, aggregated or partitioned.
data may have to be extracted using the interfaces. Special transformation that include integration, alteration of time-basis, condensation and scanning may need to be done.
If the data modeling activity has been done properly, the design of the data warehouse is fairly simple. When the design is complete, the warehouse is organized by subject area.
Once the interfaces have been built, the first subject area is populated. The data is captured and transported into the warehouse environment. Sometimes only a fraction of data is required.
End users work by discovering the interfaces and then using it to populate the data.
Often the first population of the data is incorrect but it gives many of the corrections needed to populate it correctly going forward.
the population and the feedback processes continue afterwards.
By adding a lag time, the data warehouse developer serves the DSS needs of the company. This lag time is usually 24 hours.
Sometimes there is call for rapid population of warehouse which is referred to as active data warehousing since it supports a small amount of online access processing.
We continue our discussion on external data and data warehouse. Today we look at the role of metadata for external data. The metadata includes document ID, date of entry into the warehouse, description of the document, source, source date , classification , index words, purge date, physical location, length, and references. Associated with data is another type of data called notification data.
This is merely a convenience for the users to subscribe to alerts when certain type of external data is available.
External data is not directly stored in the warehouse. Its entries are made in the metadata of the data warehouse and the external data resides elsewhere.
One of the important design considerations of external data is that it often contains many different components some of which are more useful than others, so they have to be divided into manageable units.
Since external data cannot be reshaped, it has no place in the data model. The best that can be done here seems to be to keep references to the external data.
Summarization and Trends of external data are often more useful and are stored with the external data. Take Dow Jones index for example, the daily report is important but the trends and the monthly average can be calculated and put together with the external data.
 Every piece of information external or otherwise has a lifetime at the end of which it could be discarded or archived. On the other hand, the cost of the metadata is pretty low and that could be kept in tact.
One of the advantages of using external data is that it can be used to compare with internal data.  The comparison allows management a unique perspective. This comparison is usually done on a common key but that common key is sometimes difficult to find.  For example, the key-structure of the external data may need to be converted to that of the internal data. Or the data may not be cleaned such as segregating products to compare apples to apples and not apples with oranges.
In short, we have seen that the external data holds a lot of information but has many issues to deal with. Hence it is captured, processed, stored on different media, setup with notifications and referenced within the warehouse. The warehouse manages and tracks the external data without the latter pervasively invading the system.

---------------
In addition to our usual post today, I want to show sample code for using the UI automation library with iOS:
The following code tries to add a sentence six times to the Notes app.

var target = UIATarget.localTarget();
var app = target.frontMostApp();
var testName = 'Take Notes'
UIALogger.logStart(testName);
for( var i = 0; i < 6; i++)
{
if ( i == 0) app.navigationBar().buttons()["Add"].tap();
else app.mainWindow().tableViews()[0].cells()[0].tap();
var page = app.mainWindow().textFields()[0];
var currentContent = page.getValue();
currentContent +=  "The big brown bear ate the brownies in the big paper bag." ;
page.setValue(currentContent);
UIALogger.logMessage("Added sentence : " + i );
app.navigationBar().buttons()["Notes"].tap();
}
app.mainWindow().tableViews()[0].cells()[0].tap();
if (app.mainWindow().textFields()[0].getValue.match('/bear/g') ) {    UIALogger.logPass(testName);}
else {    UIALogger.logFail(testName);}

Thursday, December 26, 2013

We return to the book on building the data warehouse:
We will now look into external data for data warehouse. Up to now, all the data we have discussed comes from internal systems i.e. internal to the corporation. This has already been processed into a regularly occurring format.  There are several issues that come from external data but first lets look at why the data warehouse is the right place to store external data.  The external data should not invade the system. It should not lose the information of its source and this is best done in the single centralized data warehouse. Further, if it doesn't come to the warehouse, it could come in formats and channels that are not managed. Lastly, we want the tracking of external data which is facilitated by a warehouse.
On the other hand, external data is difficult to deal with for the following reasons:
The external data doesn't have a frequency of availability. It's not within the Corporation's control. It may even call for constant monitoring and alerts.
The other problem is that the external data is totally undisciplined. There may be several reformatting and structuring required that breaks ever so often.
The third factor that makes the external data hard to capture is its unpredictability. External data may come from practically any source at any time.
External data can be of the following types:
Records of excellent data collected by some source
External data from random reports, articles and other sources.
There are several ways to capture and store external data. Near-line storage can help with making the external data accessible but costing huge amounts of money to store. Indexes can be created on the external data and they can be kept on disk to alleviate the traffic to the external data.
Another technique for handling external data is to create two stores - one to store all the data and another to store only a subset.

I will take a short break from our readings from the book on building a data warehouse, to discuss the implementation of Matsuo algorithm. I've mentioned  that the algorithm proceeds with the following steps : preprocessing, selection of frequent terms, clustering frequent terms, calculation of expected probability as the ratio of term co-occurrences with the cluster to the total terms, calculation of the chi-square value based on the differences between expected and observed frequencies, and output the keywords.
In this post, I want to revisit the step involving  clustering frequent terms. Note that we find the term frequencies, set the cutoff to 30% to find the top frequent terms and then find the co-occurrence matrix between all terms and frequent terms where the co-occurrences are found by calculating the number of sentences where they co-occur.
Now we transpose the columns of this co-occurrence matrix and find the clusters taking two at a time. We find the mutual information measure between two terms x and y as P(x,y)/ P(x)P(y)
We can take the P(x) as the expected probability which is the sum of the total number of terms in sentences where x appears divided by the total number of terms in the document.
We could use hierarchical clustering methods. We take terms pairwise and put them in a cluster if their mutual information is above a threshold. This threshold we choose to be log(2.0). A term is said to co-occur with a cluster if it co-occurs with any term in the cluster.
Just like we populated the co-occurrence matrix, we take pairs of terms and cluster them. We maintain a term array containing cluster  labels or numbers where we start with 1 for the first cluster and increment for every new cluster. If one of the terms is already in a cluster, the other term joins the cluster. If the other terms belongs to another cluster, the clusters are merged. The highest of either cluster numbers is used for the merged cluster number When the clusters are merged the cluster matrix reassigns all members of both clusters to this new cluster and so on. If a term belongs to more than one clusters, then it is merged with the nearest cluster. Eventually, all the terms will be assigned to one cluster each. More than one terms may be assigned to a single cluster. All clusters will have distinct labels or numbers.




Wednesday, December 25, 2013

We looked at the distributed data warehouses in detail in the previous post. We will look at the executive information systems and the data warehouse next.  EIS was the predecessor for data warehouses. EIS was the notion that the computational information should be available to executives via appealing screens but it was more about the presentation than the computation. When the data warehouse first appeared, it was not accepted by EIS but the warehouse provided exactly what the EIS was lacking.  EIS could be argued has later morphed into business intelligence especially in terms of key-ratio indicator, trend analysis, drill-down analysis, problem monitoring, competitive analysis and key performance indicator. Trend analysis is the time sequenced measurements, drill-down analysis is the slicing and dicing on the data, key performance indicators could be such things as cash on hand, customer pipeline, length of sales cycle, collection time, product channel and competitive products.
To support such EIS operations, there must be a data infrastructure. In fact it has been seen that for every $1 spent on EIS software there is $9 spent on the data preparation. Furhermore, the executive analysis shifts focus every now and then exacerbating the requirements from the preparation.
This is mitigated with a data warehouse. The warehouse enables searching the definitive source of data, creating special extract programs from existing systems, dealing with un-integrated data, compiling and linking detailed and summary data, finding an appropriate time basis of data and changing business needs.
The EIS analyst can go to various levels in the warehouse to find such information. For example, the departmental level of processing, the lightly summarized level of processing, or the archival or dormant level of data. The drill down for the data is also in this order of levels in the warehouse.
One specific technique in the data warehouse that comes useful to EIS processing is event mapping. The simplest way to show event mapping is to use a trend line - for examples revenues varying over the months.  Superimposing trend lines also gives correlated information.
Detailed data is prerequisite for EIS/DSS processing but just how much detail has also been discussed in this book. It's argued that there is no limit to the amount of detail desired. Since the requirements can vary and vary frequently, being prepared with as much detail as necessary is important. However, there is another approach that says just as much as the processing requires which is based on Zeno's paradox. In Zeno's paradox it is suggested that a rabbit can not outrun a turtle as long as the turtle has a head start on the rabbit.  The underlying reasons for these arguments involve, storage and processing costs, volume that presents a challenge to readability, and reuse of previous analysis is difficult.
We mentioned the types of the distributed data warehouses in the earlier post. We will look into it now. First, the data warehouse may be affected by many different development efforts and these call for the data architect to manage the changes. For example, the development efforts could be based on different lines of products which are unintegrated or the same warehouse could be distributed based on geography such as north, south, east or west. There could also be different levels of data within the same data warehouse such as lightly summarized data , detailed data and OLAP that are build by different groups. Even the detailed non-distributed part of the same data warehouse could be build by different groups.  The unintegrated lines of business pose little or no conflicts however that is rare in the cases of data warehouses. The second case of multiple data warehouses across locations is more common. Different groups want to isolate their development effort often at the risk of repeating efforts across groups. The third case of building multiple levels of data simultaneously is much easier to manage than either of the two earlier cases. However, due to the difference in the levels, there are different uses and expecations. The fourth case requires the most attention from the data warehouse architect. This is because the non-distributed warehouse is being built like slices of a pie and they all need to come together. The architect is responsible for consistency and growth of the data warehouse. Another approach different from the above is to develop completely independent warehouses. They are integrated by a common minimal corporate warehouse . The corporate warehouse also requires metadata just like the others but its simply not tied to the others since there is no business integration.
 Now we will look at building the business in multiple levels. Here the main tool for the architect is the definitions for interconnectivity between the levels. And the primary data model is driven by the group that is building the current level of detail because they get to use the data warehoused data model. Interconnectivity addresses compatibility of access at the call level. And the current level of detail must be sufficient to generate the lightly summarized data. The co-ordination proceeds with agreements either informal or formal and time-sequenced such that there is no team waiting for data that is not made available yet.
In the case where there are multiple groups forming current level of detail, if the data is mutually exclusive between the groups, then there is little or no coordination. Otherwise the overlap can create redundancies and cost. Further, this redundancy introduces inconsistencies and the so-called spider web into the DSS environment. Therefore, a common data model needs to be established.  With the separation of a common versus local data models, there is no redundancy across the local groups.
The same technology can be used for all the data models. Another strategy could be to use different platforms for different types of data found at the detailed level.  There are data transfers between platforms and the boundary between technologies may be crossed many times. In all these cases, metadata sits on top of all these data.

Tuesday, December 24, 2013

After the data warehouse is built, the focus shifts from building the warehouse to day to day operations. It turns out that the cost of operating and maintaining a warehouse is high and the volume of data is increasing. The usages by the DSS may cause contention for resources but the biggest expense comes from the  periodic refreshes of legacy data.
In order to refresh the data, it must be read from old legacy databases. Repeated and direct reads of old legacy databases is a costly operation. First, the legacy DBMS must be active and online during the read process. It must occur in a window of time and stretching that window is not welcome.  The refreshment scan must process the entire legacy file when only a few percentage is required.
A mitigation to such repeated reads is to trap the data as it is being updated. Now the trapped data can be processed offline.
Data can be trapped either by data replication or change data capture. Both techniques are effective. The first requires data to be identified prior to update and a trigger causes the update activity to be trapped. The content and the format of data is clean.
The second approach is the change data capture . This approach captures all the changes that have occurred say in a day. In this approach the log or journal tape is read and while that can be cumbersome, its offline and captures all updates.
As with any production database environment, the data warehouse testing seems to call for two parallel environments be setup one for stage and one for production. The staging environment is where programmers test out new programs.
However due to the size of the warehouse, providing resources for even one of the environments is hard at times. Moreover, the programs are run on the warehouse in a heuristic manner and not repetitive manner as with conventional databases. If there is an issue with the data in the warehouse, the program is simply redone.
As a summary, data warehouses must support:
1) robust language interface
2) compound keys and variable length data
3) manage large amounts of data
4) easily index and monitor data
5) Interface with a wide number of technologies
6) allow the data to be placed directly on disk
7) store and access data in parallel
8) have metadata control in the warehouse
9) efficiently load the warehouse
10) efficiently use the indexes
11) store data in a compact way
12) support compound keys
13) selectively turn off the lock manager
14) do index-only processing
15) quickly restore from bulk-storage

Also we discussed that multi-dimensional OLAP technology is suited for data mart processing and not data warehouse processing and that metadata plays a different role in data warehouse.
Most organizations build a single centralized data warehouse environment.

Data warehouses are not always centralized. Businesses are distributed geographically or over a multiple, differing product lines.
Distributed data warehouses are of three different types:
1) local and global data warehouses where the former represents data and processing at a remote site and the latter represents the integration across business
2) technologically distributed data warehouse where the same single logical data warehouse is spread out over different physical data warehouse.
3) independently evolving distributed data warehouse where one warehouse comes up and then another in an uncoordinated manner.
 

Monday, December 23, 2013

Data Warehouse has special needs so the DBMS have also evolved to include data warehouse-specific features. Instead of transaction processing, data warehouse calls for load-and-access processing. Data is integrated, transformed and loaded into the warehouse from the operational environments and ODS. Data warehouse does not have updates. Instead data is stored in a series of snapshot records. When there is a change, a new snapshot record is added, rather than an update being done. 
There are different storage media required for a data warehouse often used in a dual environment where one processing environment is the DASD environment where online interactive processing is done and the other is the tape or mass store where the data is accumulated.
 The DASD environment could use a different vendor from the mass store environment. It's also possible that the DASD environment is split over more than one vendors. As long as these are deliberate instead of political or historical, these should not pose issues.
When comparing operational environment to data warehouse environment, the role of metadata is also very different. For example, in the operational environment, the audience for the metadata is the IT professional and he/she is usually savvy about computers whereas in the data warehouse the metadata is used by DSS analyst who could do with as much help as possible and hence the use for metadata services.
In the data warehouse, the metadata tracks the significant transformations  to the data as it passes from operational to data warehouse environments.
Moreover, data in a data warehouse exists for a lengthy span of time and the warehouse changes its structure in this time. Keeping track of this change in structure is a natural task for the metadata. In the operational environment on the other hand, the metadata keeps track of one and only one correct definition of the structure of the data.
Also, operational systems focus on the current data as in the current balance or the current inventory, however trends are not available from current data. They become apparent only when there is data accumulated over time. For the span of time that this data is accumulated, a new dimension by name data context becomes important. This data context is all the context information such as changes in the data source etc. that may explain the discrepancies in the trends on the mere data accumulated over time.
Three levels of context information must be maintained - simple, complex and external.
Simple contextual information relates to the basic structure of the data itself and includes structure, encoding, naming conventions, and metrics. Complex contextual information includes such things as product definitions, marketing territories, pricing, packaging etc. External contextual information includes such things as inflation, financial trends, taxation and economic growth. Complex and external contextual information can vary a lot and harder to keep in a structure.
In the past, collecting contextual information has been difficult because the audience was different, the interest was passive, the resources had vanished, and even when it was collected, it was restricted to simple contextual information.

Sunday, December 22, 2013

Some more technological requirements from the data warehouse involve:
the ability to load the warehouse efficiently
the ability to use indexes efficiently
the requirement to store data in a compact way
the requirement to support composite keys.
Data is loaded into the warehouse either on a record by record basis or through a language interface or in bulk with a utility. The latter is faster however it is better to repeatedly move a few at a time. As the burden of the volume of loading becomes an issue, the load is often parallelized. Once the input data is divided, each job stream is executed independently of the other job stream.  If its' possible to defer loading until all the transformations are complete, that may be an advantage as well. In this way, another related approach to the efficient loading of very large amounts of data is staging the data prior to loading. Staging helps when the complexity of processing is high or when merges of data is necessary.
Data warehouse may need to support the creation and loading of new indexes and these indexes must be accessed efficiently. Indexes can be bit maps, or have multiple levels. The index may be stored entirely in memory or compacted when the order of the data being indexed allows such compaction, creating the selective indexes or range indexes. Compaction of data is a crucial tool in efficient data warehouse. Data in the warehouse is seldom updated once inserted, and hence this comes useful. Compaction and de-compaction are CPU bound and not I/O bound, so their execution is not a concern.
Similarly, compound keys are important to the data warehouse because subjects are usually in different tables related by foreign keys and because the data is time variant. Both these factors require compound keys.
We take a look at some of the database features.
In the data warehouse, since data is relatively stable, variable length records pose no problems.
 Moreover, due to the variety of data found in the data warehouse, variable length structuring of data must be supported.
Lock Manager in databases is used widely to prevent updates to the same record at the same time. Since the data warehouse doesn't have any updates and because lock manager consumes significant amount of resources, turning this feature on or off is helpful to the performance of the warehouse.
Index only processing is another feature, It is possible to service a request by simply using the index instead of the associated data. In the case of data warehouse, such a feature will tremendously improve performance . For the feature to work, the query should have a hint that the index has been specified.
Fast Restore - A restore from a direct access storage device (DASD) is usually the norm.  That's why there is copying between the recovery repository and the DASD. But when the restore can occur directly from the secondary storage, the data warehouse benefits tremendously. Fast restore helps with error detection.
Finally it is preferable to not use transacted operations in the data warehouse. They become too complicated and are prone to a lot of resource consumption. Instead a robust non-transacted with retry logic can be used.
We looked at indexing and monitoring in a data warehouse. We now look at the interfaces to many technologies. The data warehouse recevies data from the operational systems and ODS and passes it to the data marts, DSS applications and alternate storage. The passing of data through the ware house should be smooth and easy. This is an important criterion for the data warehouse and note that this does not mean online passage of data which is not supported. The tests to see whether the data is passing smoothly through the DBMS is to see whether the passage requires changes to the format or if it can be passed to multi-dimensional processing or whether the data can be change data capture or if the context is lost in translation.
When the data is placed, the developer would like to improve efficiency of access and update with the placement of data at the physical block or page level.
Another requirement is to manage the data in parallel. This will improve performance.
Another technological requirement is to have a solid metadata control. And yet another technological requirement is to have a rich language interface to the data warehouse.
Parallel storage lets the volume grow arbitrarily. The entire issue of parallel storage and management of data is complex. The performance improves with the inverse of number of physical devices over which the data is physically distributed.
The use of metadata becomes all the more important with the heuristic iterative development cycle of the data warehouse. The metadata must always represent the warehouse with uptodate and accurate information.  Metadata contains such things as :
table structures
table attributes
source data
mapping from source data to warehouse
data model
extract logging
common routines for access of data
definitions or descriptions of data
relationships to one unit of data or the other.
As with data, every technology in the business environment has its own metadata. Report writers, ODS environments and ETL all have their own metadata.
Similarly the language interface could tremendously improve the usage of data. The requirements from a language interface include the ability to access data a set at a time, a record at a time, use one or more indexes, have  a SQL interface, perform update, delete and insert etc.
Note that the use of CLR and the SQL interfaces satisfy these conditions and were developed to that end.

Saturday, December 21, 2013

We saw how the data warehouse granularity affects the design and how it is decided. The next step is to anticipate the needs of the different architectural entities that will be fed from the data warehouse. This means that the data warehouse has to serve as the lowest common denominator to suit all the entities. Data that is too fine can always be summarized but the reverse is harder.
There is a feedback loop between these systems and the warehouse. We attempt to make it harmonious with the following techniques:
Start with a rapid prototype of very small steps of developing the warehouse, seek early feedback and make fast adjustments.
Use the existing granularity in the participating systems to gain insight into what works
Make recommendations based on the iteration experience and keep the changes visible to the users.
Meet for joint application design and simulate the output for feedback.
If the granularity has to be raised, it can be done in many ways such as :
Summarizing the data from the sources as it goes to the target.
averaging or selecting the minimum and maximum values into the target.
Selecting only the data that is needed into the target.
Selecting a subset of data using conditional logic.
Thus aggregation is used for summarizing and it can be based on well known average, min, max operations or user - defined aggregations.
The feedback loop is probably the single most distinguishing characteristic of a process for building the warehouse. Without the feedback loop with the DSS analyst as described above, it becomes very hard to build it right.
There are often dual-levels of granularity used in different industries. The first level is say the sixty day worth of operational data and the second level is the lightly summarized 10 year history.
  Since data from a data warehouse is usually fed into the data marts, the granularity of the data warehouse has to be at least at the lowest level required by any of the data marts.
When the data spills over to the overflow, the granularity is not a consideration any more. To store in the overflow, two pieces of software are used  - a cross-media storage manager (CMSM) that manages the traffic between the disk and the alternate storage and an activity monitor to determine the spill cutoff.
The data warehouse has to manage large amounts of data. The size of the data could be in terabytes and petabytes. The issue of storing and managing large amounts of data plays predominantly into every technology used in the warehouse. There are technologies for managing volumes, managing multiple media, to index and monitor freely and for receiving data and passing to a wide variety of technologies.
 The hierarchy of storage of data in terms of speed and access are as follows: main memory, expanded memory, cache, DASD, magnetic tape, near line, optical disk, Fiche.
Technology that supports indexing is also important. The cost of creating the index and using the index cannot be significant and a wide variety of indexes may need to be supported. Monitoring the data warehouse is equally important and involves such things as to detect when reorganization is required, or index is poorly structured, very little spillover data, the composition of the data and the available remaining space. 
We now look at the granularity in the data warehouse.We choose this based on what matters to the environments that depend on the warehouse data as well as what is efficient for the system given the volume of the data. By setting the granularity at the right level and spilling the excess volume of data to overflow storage, other aspects of the design and implementation can flow smoothly.
To determine the volume of data, we estimate the number of rows. If there are a few thousand rows, any granularity will do. If there are millions of row, a low level of granularity can be chosen. If there are billions of rows, a higher level of granularity is required.
The rule of thumb cited in the book suggests the following:
1) For each known table:
 we find how big a row can be and find both the biggest and smallest estimates
For the 1-year horizon, we find the number of rows possible
For the 5-year horizon, we find the number of rows possible
for each key of the table, we find the size in bytes.i
With the number of rows from above,  we calculate the max 1 year space as the biggest row times the number of 1-year max rows. We calculate the min 1 year space as the smallest row times the 1-year min rows plus index space.
2) We repeat this for all the known tables.
This way we can estimate the volume of data to manage. The author has some interesting caveats here :
one that the projections of the size of the database are usually low. and second, the growth rate of the warehouse is usually faster than the projection.
These row estimates and the number of direct access storage devices helps with the planning process. Note that more than the size of the records, the number of records matters. In fact, as long as all rows are reasonably sized, the calculation of index entries is entirely dependent on the number of rows and these together affect the choice of granularity and the overflow storage.

Friday, December 20, 2013

We continue our discussion on the data warehouse with the indirect use of the data warehouse data. We see an emerging pattern with the examples mentioned in the previous post.  The warehouse is periodically analyzed by a program examines relevant characteristics and criteria. This analysis creates a small file in the online environment that contains succinct information about the business of the enterprise. The small online file is used quickly and efficiently as is customary in the operational environment.
We look at these in more detail now. First, the program that runs on the warehouse is typically a machine learning program and has free rein to run anywhere in the warehouse. It runs in the background and is in harmony with the rate at which the data warehouse changes. The refreshment occurs periodically and relatively infrequently. And it replaces existing data as opposed to updating it.
The online pre-analyzed data file contains only a small amount of data per unit of data and may contain only what the online user/clerk needs. Its ownership and access are entirely within the operational environment.
In the data warehouse, a normalized approach produces the optimal design with flexibility. It fits well with granular data, it is not optimized for any given set of processing requirements and fits very nicely with the data model. Usually deviations to the normalized design comes from the common requirements of several departments for eg. storing data for all months. Data warehousing typically involves snowflake like design which includes the star joins, fact tables and dimensions.
Unlike data warehouses, data marts are very much shaped by processing requirements. However, Data warehouses have a much broader audience. If we look at the entity relationship diagram for a data mart, all the entities will look equal. However in reality some entities may have volume of data associate than others. The design that best utilizes this characteristic is a star join. As a simple example of a star join, the entity with the most data can have foreign keys to the identifiers of all associated entities.
One of the issues with the data warehouses and data marts is how data gets from the data warehouse to the data marts. Data in the data warehouse is very granular.  Data in the data mart is compact and summarized. Periodically data must be moved from the data warehouse to the data mart.
As mentioned earlier, the data mart are designed based on the requirements of the department. Different departments will have different data marts. Star joins involving fact tables and dimensions are suitable for a data mart. But not for a warehouse simply because the data warehouse does not cater to one department.  Thus DataMart data structures in general are not reusable or flexible for reconciliation across departments but the granular data in the warehouse is all of those. When data passes from a data warehouse to the data marts, it moves from a normalized world to a multidimensional world.
Unlike the processing requirements for the data mart, the data warehouse is designed from the enterprise requirements. The enterprise requirements are gathered from what is known as the Zachman framework.
This is a convenient device that lays out a matrix to include all perspectives. The output is perspective and context. As an example, consider a matrix with scope as rows and data as columns. Data could correspond to function, network, people, time and model. The scope could correspond to Enterprise model, system model, technical model, components and functional system. By exhaustively covering all perspectives, this approach attempts to let the enterprise requirements be articulated from the blueprint. These requirements subsequently lead to the enterprise data model which helps in building the data warehouse.  Therefore a lot of order and discipline is required in using this framework.
Managing volume in a data warehouse
We discussed profile records in a previous post and they come in handy to manage volume. The detailed records are reduced to a single profile record in the warehouse and as such are useful to reduce volume. This comes at a loss of detail but it isn't a bad thing as long as the DSS analysts gets what he wants from the summary. What goes into a profile record may need to be iteratively worked out with DSS analysts.
Another approach used together with the iterative approach is to create another level of historical data. An alternated detail is not designed to be used frequently. This additional level may not be used frequently but it comes in useful to save the architect whenever additional fields are demanded.
Multiple profile records could also be created from the same detail depending on the subject. The metadata records written for profile records are similar to those written for single activity snapshots.These records are generated usually on the operational server and require sorting and merging of data.
In general, data flows forward from operational systems to warehouse. In rare cases, its possible to flow the data backwards.
When data from a data warehouse is required by the operational systems, it is directly queried by the applications. Alternatively, the results of the query could be cached and the application can make do with the results of the query. Such direct queries are typically are responded in time that may take say 24 hours and further the requested data should be small and not in the order of MB or GB. There must be compatibility between the operational and warehouse systems for direct query and results may need to be transformed.
Indirect access of the data warehouse, on the other hand, is entertained.  For example in an airline reservation system where an airline employee talks to a travel agent and has a window of 2-3 minutes to complete the order, the flight date and flight status need to be looked up from historical information.  This flight status file is created periodically by reading historical data. It then becomes easier for the employee to read the current bookings and flight status table to facilitate the transaction.
Another example for the indirect use of data warehouse in operational environment is credit scoring in banking. When a customer applies for a loan, there's quite a bit of background check involved and this requires extensive historical data such as account history, payment history, job history, salary history, asset management history, etc. To satisfy most customers in short time, a program can periodically run and produce a file for pre-qualifications that has customer identification, approved credit limit, and specially approved limit.  When the customer applies for and gets a loan, if the loan is within the approved limit, it is granted.


Thursday, December 19, 2013

We continue our discussion on data transformation from operational to warehouse:
We mentioned summarizing data by properly sequencing the operational data as one of the functionalities required. Others are:
Input records can be of different proprietary or non-standard formats, syntax and even semantics.
Conversion for these records need to be authored and they can be complex.
Semantic relationships in the operational data is harder to translate and often come at the price of reverse engineering.
Data format conversion must be done to keep the data consistent in the warehouse.
Massive volumes of input must be accounted for including parallel loads or reads.
The warehouse must conform to a corporate data model. Since business changes more frequently than when the application was written, these mismatches need to be accounted for in designing and building a warehouse. There is order and discipline to the design and structure of the data warehouse.
The data warehouse reflects the historical need for information and during transformation, we add this element of time where missing.
The warehouse must be built for the informational need of the organization and not the clerical needs.
Since hardware, operating systems and environment changes between operational systems to datawarehouse systems, the newly generated output files from transformation may need some local conversion.
ETL software is often used for these extract-tranform-load operations.
An alternative to ETL software is ELT software to do extract-load-transform operations where the transformation can be skipped, however this does not benefit the warehouse.
The transformations or population of the data warehouse is triggered with an 'event-snapshot' interaction i.e. as and when snapshots become available, they are moved into the warehouse.
Events can be activity generated or time generated.
Events that are triggered by time are usually cyclic.
We look at the components of the snapshot next.  As mentioned earlier, a snapshot contains a unit of time, a key,  non-key primary data  and secondary data. The first usually marks the moment when the snapshot is taken. The key identifies the snapshot. The third is the data that relates to the key and the last is the data
that is circumstantial to the actual data.
We look at the components of the profile records as well.
Data in a warehouse is not all stable and some are susceptible to changes due to various factors. For these some kind of aggregate records is preferable. Such a record is called  a profile record.  For example, the monthly call records of a subscriber is aggregated to provide a single composite record.  Notice how this is similar to HDFS in which the column and the column counts don't matter. Records can have varying number of columns because they are now structured as key value pairs in a column family.  The records allow access to individual values based on key lookup for that record in that collection. A profile record in the data warehouse is a single composite record of all the changes captured until the event that triggers the profile record. Note that the comparison is being drawn in terms of the layout and not the content.
Since the profile record contains grouping of many detail records, the aggregation can take many different forms:
Values from the operational data can be summarized
Frequencies can be maintained, units of the operational data can be tallied, their min, max and avg() can be found and they can be put in ranges with boundaries.
A desirable benefit of using a single profile record is that it saves the end user from having to calculate it.


Interview question : Find min in shifted array : {26, 33, 49, 63, 3, 7, 13, 18, 19, 23)

int Min(int[] numbers)
{
if ( numbers == null || numbers.Length <= 0) throw new Exception();
int min = numbers[0];
for (int i = 1; i < numbers.Length; i++)
{
if (numbers[i] < min )
{
  min = numbers [i];
  break; // for shifted - sorted array
}
}
return min;
}

// binary chop
int Min (int[] numbers, int start, int end)
{
// validate parameters
// termination condition
if (start == end]) return numbers[start];
if (numbers[start] > numbers[end])
{
  int mid = start + end / 2;
  if ( numbers[start] > numbers[mid])
  return Min(numbers, start, mid);
  else
  return Min(numbers, mid+1, end);
}
else
return numbers[start];
}




Wednesday, December 18, 2013

We continue our discussion on data warehouse.
Extraction of data from one place to another seems deceptively simple but usually turns out to be much more complex and larger than initially assumed.
We see some of these transformations now:
1) The extraction of the data from the operational environment to the legacy environment requires a change in technology including operating systems, hardware and data.
2) The selection of data from the operational environment may require reading several tables, following relational constraints and to be done during the online operational window.
3) Input keys from operational data have to be restructured and converted. before they are written to a data warehouse.
4) Non-key data is reformatted - for example date-time formats may change to be more consistent across the warehouse.
5) Data is also cleansed before passing to the warehouse with cross-record verification and domain checks.
6) Multiple sources of data exist and must pass into the data warehouse.  This means that we tap into each data source with a defined logic.
7) When there are multiple input files, key resolution can be done before the files can be merged.
8) If there are multiple files, the sequence of files may not be compatible and this may involve record migrations as well.
9) The transformation logic that creates the data warehouse may produce different summarization levels for different data source and hence different outputs that need to be reconciled.
10) There has to be default values when there is no source of data
11) The efficiency of the selection of data is often a performance consideration because only a few records from the original source may need to be pulled.
12) data also needs to be summarized  and combined into the profile record.

Data Warehouses are built for a wide variety of systems.However each datacenter maintains a structure of data called a snapshot. This has four components - a key, a timestamp, a primary data that relates only to the key, secondary data captured as part of the snapshot process with no relationship to the primary.
The key identifies the record and the primary data.
The unit of time refers to the moment where the event being described by the snapshot has occurred.
The primary data is the non-key data
The secondary data is circumstantial data and is typically used with DSS processing
A relationship between the primary and the secondary can be inferred because they are adjacent. This is called an artifact.
We now discuss metadata.Metadata is very useful to describe the data in the warehouse. It acts like an index and it stores the structure of the data as known to the programmer and DSS analyst, a hint for the source data, the data model, and a history of extracts.
Reference Data is another component commonly used with data warehouses. This keeps track of the referential integrity. Reference data tends to be ignored with data accumulation however it should be made time variant just like any other part of the warehouse.
There are two different designs for reference data.
The first approach requires taking a snapshot of the reference data every six months. This approach is quite simple but its not sufficient because it might miss the activity between snapshots.
The second approach keeps track of all  the activities against the reference table.
The actual implementation might be something in between the first and the second approach.
There is another important factor pervasive in the data warehouse. This is the length of time a change of data in the operational environment takes to be reflected in the data warehouse and is referred to as the cyclicity of data.
The data warehouse contains historical information. The changes in the operational system should take at least 24 hours  to show up in the warehouse. This introduced delay is referred to as the wrinkle of time.
A wrinkle of time smaller than 24 hours is expensive to do but the goal is to not do the operational changes in the warehouse and vice versa.

Tuesday, December 17, 2013

In today's blog post, we continue on our discussion with data warehouse. We mentioned that the warehouse is built iteratively This has been proved correct by the industry track record, because the end user unable to give all the requirements up front, management doesn't make a commitment until there are some results, and visible results must be seen quickly.
We will now see the role of data model in the iterative development. The data warehouse serves as a roadmap for each of the development efforts. The data model plays a critical factor in these efforts. First the data model tells what needs to be done.
Second the data model tells how one development effort is to be integrated with the other.
The development efforts are like puzzle pieces and the data model makes sure the efforts fall in place.
If the data model were absent, many of the efforts would not fit well with others and there would be overlap and redundancies.
The output of a data model is a set of tables in fact a lot of small tables with little data. If a program has to go for several tables it has to interconnect with them.
A better approach would be to physically merge some of the tables so that minimal I/O is consumed.
Merging tables is one approach. Another approach is to create an array of data. An array of data lets the sequences reside in different locations and it would be easier to directly access the resource. For example creating an array by month is easy. It might be interesting to note that sometime redundant data is inserted. For example, if a base parts table has a non-redundant field call description then several parts table have to access the base the parts table for that field. This is expensive. Instead if the description field is redundantly placed in many tables, its access is independent and less expensive.  Note however, this is for the access to the tables.
Likewise if there are other fields in the table that is not as frequently or more frequently used, then they can be stored in separate tables such that the most frequent ones are by themselves.
Another technique is to use a derived data with the physical database design. For example, if we calculate the annual taxes at the year-end every year, we can store the calculated data. Then all subsequent access can use the calculated field.
Another technique is to use a creative index or a creative profile. This type of creative index is created as data is passed from operational to warehouse environment. Since the data is accessed unit by unit, the index creation is easy, inexpensive and based on users' interests.
Finally, a referential integrity can be used to establish relationships between databases. Each solution can be tested with a few use cases such as name, dept . 
We continue our discussion on data warehouse. We look into the "design" of the interface from operational systems. The design is actually heuristic. The requirements for a warehouse are not known until it is partially populated and in use. So a phase by phase approach is used instead. One portion of data is populated. It is then used and scrutinized by the DSS analyst. Then there is feedback from the user which leads to data modifications. Then another portion of the data warehouse is built. This feedback loop continues until through out the life of the warehouse.
On the other hand, anticipating requirements is still important, so there is a trade-off.
Design usually begins with operational data, the source for the warehouse. However, it's not extract and load operation. There are several transformations involved. Operational data comes from different applications They are often not integrated. Data when it makes its way into the warehouse has to be integrated across applications. Integration means standardizing the conventions across systems, keeping the syntax and semantics same on merging, and correctly mapping source fields to destination.
Three types of loads are made into the data warehouse from the operational environments:
archival data
data currently in the operational environment
periodic refreshes from updates to the data in the operational environment
Archival data is often skipped or one-time.
Loading current data is not disruptive because the data is loaded only once and usually via downloaded files.
Loading periodic refreshes on the other hand is a major concern because we have to find what's already present and what's not. Some common techniques used to limit the scanning of data are as follows:
1) scan the time-stamped data of the source and exclude the ones that are not in range. However source may not have timestamps
2) scan the delta file that contains only the changes made to the application as a result of the transactions.
3) scan the log file or the audit file that's generated a by-product of the transaction processing.
4) modify the application code for managing the amount of data scanned, however application code can be legacy, old and fragile.
5) the last option is literally a last resort in that it takes a snapshot before and after image of the operational file and dedicating large amount of resources
 

Monday, December 16, 2013

In data warehouses, exploration and data mining are used to analyze masses of historical data and to discover patterns of unknown business activity. The data warehouse contains cleansed, integrated and organized data.
Data warehouse can be designed  in a hybrid form to create what is known as a living sample database which is especially useful when the data has grown very large. The data in this database could be true archival data or lightly summarized data. Since its a subset of the original data warehouse, it needs to be periodically refreshed.  This is also not a general purpose database and they are useful only for analysis or to find trends. Queries that can't be run over the full database can only be run on this database.
The selection of data for this database is usually random but in some cases, a "judgement sample" is taken in which the records meet a certain criteria.
This translates to improved productivity for a DSS analyst with reduced time for turnarounds.
A second major issue for the data warehouse is partitioning. Partitioning of data refers to the breakup of data into separate physical units that can be handled independently.
Proper partitioning helps the warehouse in the following ways:
Loading data
Accessing data
Archiving data
Deleting data
Monitoring data
and Storing data
Independently managed partitions of data are portable for different processing environments. When data is monolithic, it becomes harder to restructure, index, sequentially scan, reorganize, recover and monitor.
Flexible access of data is a key design goal and partitions help with that goal. Data can be partitioned in many different ways such as by date, by line of business, by geography, by organizational unit, and all of the above. The choices for partitioning data is dependent on the developer.
Partitioning can be done in many ways. Partition can be done at the system level or at the application level. If the partitioning is at the system level, the DBMS requires that there is a single definition of data. If the partitioning is at the application level, the data can be moved around between environments. One test for a good partitioning of data is to see if an index can be added without major restructuring. or hampering of operations.

Sunday, December 15, 2013

We describe a simple tweak to Matsuo's paper on keyword extraction from a single document where we replace the unconditional probability of a term in the document with the prior Bayesian probability from a corpus. We could pre-compute and store the Bayesian probabilities of all terms in a table. We include all words from an English Dictionary and assign a default value to those terms not found in the corpus.We use the nltk.NaiveBayesClassifier to populate this table beforehand. Since the original approach finds the probabilities of only the top 30% of the frequent terms, the number of documents where the frequent terms won't be found in the corpus should be fairly small.
Said another way, we don't just set the weights on the terms based on their frequency in the document, but taken the overall likelihood of that term to appear in any document. In Matsuo's approach, this substitution does not affect alter the meaning of the expected probability too much. If anything, it improves on the probability from a wider source than the document.
We also aim to differentiate terms that occur similar number of times in a document. This is especially useful for short text where the frequency may not give as much information as in a larger text. Further the original approach was to find the term weights based on the document itself and avoid the corpus. In this case too, we avoid the corpus with a prior probability table that useful in cases where we can discriminate the terms.
Note that this tweak is more suitable to the approach we modify and may not apply in all cases. For example in the KLD distance calculation, KLD distance is about relative probability distribution and a measure how different one is from the other. In which case, we want the probabilities from the same sample space.
Here we cluster sentences based on the KLD similarity between a pair of sentences where sentences are represented by their term vector of probabilities.  We compute the probability that a term occurs in a sentence as the ratio of the term frequency of the term in the sentence to the total term frequency across all the sentences. Depending on the scope we could treat sentences with paragraphs or call them short text documents. By clustering, we categorize the text and therefore select the keywords that represent the categories. In both symmetric and pairwise clustering, we use co-occurrence of terms. The difference is that symmetric clustering groups phrases and pairwise clustering groups relevant terms with the original approach.
KLD equations come in many form.  All of them use the same probability distributions.

In today's post, we will read from a book "building the data warehouse" by Inmon. We discussed briefly about CIF Architecture and comparison with Star-Kimball model. We look at data warehouse in more detail now.
Data Warehouse is an accumulation of data. It has different levels of detail. There is an older level of detail, a current level of detail and a level of lightly summarized data (the data mart level), and a level of highly summarized data Before data makes its way into the warehouse, there's tremendous transformation involved. When data ages, it moves from current detail to older detail. When data is summarized it moves from current detail to lightly summarized and then to highly summarized.  The data across levels is organized based on subject areas. Typical subject areas in a corporate data model involve Customer, Product, Transaction, Policy, Claim and Account. A subject area could consist of several (order of tens or hundreds of) physical tables. If we take a subject area such as a Customer, there could be a physical table for base customer say between 1985-87, another for 1988-1990. There could be a cumulative table for 1986-1990 and a summary record for each customer record based on customer activity per month. All of the physical tables are related by a common key - the customer ID. Direct Access Storage Device, magnetic tape, fiche and optical disk can all be used to store data. a fiche is good for storing data that does not have to be reproduced in an electronic medium again
Some tables are organized on a from date to date basis while others are cumulative.
Due to their accumulation and size, Data warehouses are not built all at once. For example, on day 1 there may be just the legacy systems doing operational, transactional processing. On day 2, the first few tables of the first subject area of data warehouse are populated. Users start to discover warehouse and analytical processing. On day 3, there are more subjects and more curiosity drawing more users and a DSS analyst. On day 4, more data is extracted from operational environment, transformed and placed in the data warehouse. Many DSS applications spring up. Access needs to be granted for usage and some users are
On day 5, departmental databases come up. Departments now extract the data from the warehouse to their level for analysis.
On day 6, the land rush to departmental, multidimensional systems takes place. This is because the departmental data is now directly available from the warehouse. And users no longer look at the data level details but focus on the departmental analytical processing.
On day n, the architecture is fully developed. Although at this point it looks like a spider web, it is not one. "The Cabinet Design" article on May 1991 explains why.
Note that the level of the detail or the summarization unit of the data plays a very critical and pervasive role in the design of the data warehouse. This is referred to as Granularity.
In operational systems, the data is mandatorily stored at the lowest granularity. In data warehouses, such assumptions are not made. For eg. subscriber call records for a month are a high level of detail and a low level of granularity. But the summary of same subscriber calls are a low level of detail and a high level of granularity. Note the number of records to search the existence of one record is order of magnitudes greater in the first than in the second.
In most retail companies, there are usually dual-levels of granularity. There is a level of granularity for the true archival level  and a lightly summarized data in the other level. The lightly summarized data can come from say 30 days' details maintained by the operational data.

 

Saturday, December 14, 2013

We will talk about choosing cache engines in a LAMP stack:
We briefly discussed a few of them with the modules. We will consider their usage now.
Code cleanup and performance tuning in PHP scripts goes only so far to make pages fast enough. Caching engines prevent a trip to the server for pages that are already available. Caching can be done for internally compiled PHP scripts, the generated output or the intermediary data processing objects.
APC is an opcode caching engine and comes with PECL/PEAR repository. The compiled code from the PHP runtime engine is stored in the shared memory for later access. It can be linked into the PHP code or as Dynamic Shared Object (DSO)  and by specifying the extension as apc.so in the php.ini file
eAccelerator is a similar compiled-state PHP 4 caching engine and also optimizes as well as encodes code. The resulting can be distributed to those that have eAccelerator installed. Again the compiled objects are stored in shared memory and the different setting such as memory size, cache directory, time-to-live, prune period can be described in the php.ini file. The extension entry is accelerator.so  Similarly Zend optimizer uses the native Zend engine that powers PHP to cache compile time objects.
JPCache works by storing the generated output of a PHP script or disk to disk or in SQL Server. It's written as a PHP library. In addition to caching, it also provides compression resulting in more improvements in page load time. Configuration options include those for the database, the credentials and the compression level.
Memcached is a distributed memory object caching system and is available as a PHP DSO. we can add entries to memcache using the add method that takes key values as parameters.
The get_stats() method on the memcache is particularly helpful and gives such information as current connections, total connections, hits, misses, current items, total items, etc.
As we see now, there are overlap between the caching frameworks. The complimentary ones could increase performance but others may not work well together.
JPCache and memcached can work well with other but because eAccelerator and Zend have their own encoders, they are mutually exclusive.  Moreover, you only want one opcode caching. Memcached is a great choice for large objects and varied lifetime or small objects but frequent database access. If there are static pages, JPCache works well. If there is a lot of PHP code, then an opcode caching like APC works well.

In this post, we talk about Ajax (Asynchronous JavaScript and XML) in the LAMP stack.
 Ajax helps make calls to the server to update a page without a page refresh. This allows client side of the application to remain responsive even when the data is exchanged with the server.
 With page refreshes, the entire page was being transmitted several times consuming bandwidth, transmitting same elements over and over again and making the web application unresponsive during transfers. With Ajax, the page is responsive to user actions during transfers.
Ajax allows rich fat clients to be written. By that we mean the client is capable of a lot of functionality. Take Gmail for instance. It's a web client that uses Ajax.  The page isn't refreshed but the elements are manipulated via JavaScript with the application and XML communication with the GMail Server.
Prior to Ajax, the techniques used to communicate with the server without refresh included the following: The image source trick - with the src property of an image in the HTML, developers manipulated the brower's behavior to send a web request for the image and to package its response with a server side script to access a data store. However it was one-way and it wasn't robust because it couldn't handle errors.
Hidden Frames was another technique where the developers used it to send and receive information from the web server It's simple to use and debug and the frame load adds a browser history entry.
The downsides are that the user can manipulate the content.
Hidden iFrames instead of Frames help alleviate the issues with the behavior of the back button.
XMLHttp and XMLHttpRequest interfaces enables the following methods that are supported in all browsers.
abort() - aborts a request,
getResponseHeader(key) : returns the value of the HTTP response header named key
open(method, uri) : sets the HTTP method and the request URI
overrideMimeType(type): Forces the interpretation of the response
send(requestcontent) sends the HTTP request
setRequestHeader(key, value) assigns header elements and called before open or send
Since the interface goes by different names and is instantiated differently for different browsers, the AJAX code handles this by using the native XMLHttpRequest or the window.ActiveXObject for instantiating the interfaces.
Similarly the status code from the request is checked before the response is parsed.
The responseXML property of the request is parsed and elements may be added to the DOM. Knowing the XML structure may or may not be required. For example, handlers can be written at the level of an XML element instead of the response level.
Some common AJAX libraries are : SAJAX (Simple AJAX toolkit) which allows functions to be defined in PHP and called from the client. CPAINT which is the cross-platform Asynchronous interface toolkit that supports both plain text and XML response type and JPSPAN the bridges the gap between JavaScript and PHP.  It allows for a seamless importing of PHP classes to JavaScript.

Friday, December 13, 2013

This post is about site security on LAMP stacks.
Authentication is important for site security.  We discussed Apache authentication with the mod_auth_mysql modules. This module bolsters the site security by using a user database together with Apaches' authentication. We will see the PHP builtin security mechanisms now which allow us to make our own authentication system.
Security include restricting access. We can do this with Apache or PHP or both.
The easiest way to lock down a website is to use Basic Authentication. Using a combination of .htaccess and password files, we can restrict access to any folder within the website. This is easy to implement and it comes with the default installation. The downside to using the builtin apache authentication is the complex hierarchy of restricted folders and users.
 Instead of using this file based mechanism, we can use the mod_auth_mysql that better manages the username/password storage. for example, you can have complex permission matching abilities.
In the PHP based mechanism, we emulate the basic authentication by sending proper/response headers and checking for the right server variables. We setup  a table to store username/password and write a script to query the database for a given username and password.
Other user parameters can also be checked. For example, we could restrict access based on IP address. Also server variables such as HTTP_Referer could come in very useful. The thing to note is that anything you might expect from a client can get corrupted. Information gathering, cross-site scripting, sql injection are just some of the examples.
CAPTCHA functionality is available in PEAR.
Other kind of website attacks include abusing register_globals. This is a setting in the PHP.ini that controls the auto-population of variables with the same name as form elements or cookies.
Keeping the register_globals set to on is a bad practice. For example it makes $username field visible. Turning it off forces the values to be read from the input instead. If the register_globals cannot be disabled, the .htaccess file can be used to turn it off.
SQL Injection attack by itself is damaging but together with register_globals attack,  it can be even more critical. This is because they insert malicious SQL statements.
With input scrubbing and data validation, this can be avoided.
Some settings in PHP obviate this threat such as with magic_quotes_gpc in php.ini, and using  add_slashes() to all data that is passed to MySQL.
In general, the policy to resolve user privileges should grant only the minimum required for user access. This practice helps mitigate the threat by incremental access for all elevated privileges.
With SQL injection attack and register_global attacks are on the input data to the system, cross-site scripting targets the unclean dynamic output. The attacker changes the page's generated markup with the addition of Javascript that discloses sensitive information. PHP gives you an assortment  of options to choose from to clean the dynamic output: htmlspecialchars() that escapes any ampersand, htmlentities() that escapes any special characters that has an HTML entity equivalent, strip_tags() that removes all HTML and PHP tags from a string and utf8_decode() that converts UTF-8 encoded characters to ISO-8859-1 characters.
Today's post we look at the top ten web design mistakes as mentioned by Jakob Nielsen (in 1999).
1. Breaking or slowing down the back button.  He listed this as the second most used navigation feature and hence the importance. Sites tend to break the button with opening a new browser window, using an immediate redirect and preventing caching.
2. Opening new browser windows by sites is a crime when the user is hostile to taking over his machine. It beats the very purpose of keeping the user glued to the site. This includes opening PDF files in windows because its an 'undifferentiated blob' of content that's hard to navigate.
3. Not changing the color of visited links: Users want to know which links they visited because they want to exclude the ones that didn't work for them.
4. Non-scannable text : This is when the blogpost such as this one renders text without subheads, bulleted lists, highlighted keywords, short paragraphs, the inverted pyramid, a simple writing style, and a non-verbose language.
5. Fixed font - size: Readability for people over 40 is critical to gaining their attention. With CSS disabling the change-font-size feature on the browser, this is yet another disappointment.
6. Page Titles with low search engine visibility- Search is one of the most popular ways users discover new sites and navigate within sites. The title and the first 66 characters of micro-content are the most appealing part to the user. If all the pages have similar titles, it opens up more windows with less visibility to the user.
7. Anything that looks like an advertisement : Users tend to avoid it. Banners, animations and pop-ups are turn downs for the user in his  goal-driven navigation.
8. Violating design conventions: Consistency sets the users' expectation and prevents any surprises, removes insecurity and gives more feeling of control.
9. Bad search: Search is described by the author as the user's lifeline when navigation fails. The simpler the search and less literal and difficult for the user to read, the better.
10. Not answering user's questions : If the website doesn't provide what the user is looking for, it defeats their purpose. Simple question like where's the price should be visible rightaway.
Now a look at how content management systems work on the LAMP stack:
Content Management Systems come as two types - Enterprise CMS and Web CMS/Portals.
The former is high-powered software package that provides comprehensive solutions integrating functions such as shipping and delivery systems, invoicing, employee and human resources information, document management and transactional systems. Thus they are usually highly customized company wide solutions.
The portals are mostly created for use on the web. They allow users to collaborate on a website and enable developers to bring functionality to a website quickly 

Thursday, December 12, 2013

Continue from the book reading:
Here are some of the Apache tricks from the book:
Enabling mod_rewrite: This is an Apache module that's used for URL rewriting, keeping query-string input safe, and keeping the site search-engine-friendly and used for making URLs friendly and to stop hot linking.
A RewriteRule is a configuration directive that can provide a wide range of rewriting abilities. Rewrites can be for simple redirects to complex multiple-rule pattern matching substitutions.
A RewriteCond helps specify a condition for the rule. Http header values, request variables, server variables, time variables and some special variables can all be used inside the RewriteCond. RewriteLog and RewriteLogLevel are directives that can be used to specify a log file where  a running record of the internal rewrite processing will be sent and for the verbosity of the log.
URL spell checking is accomplished with the mod_speling module. With mod_speling module, Apache server takes care of finding the intent of the user and suggesting the correction. The CheckSpelling is the configuration directive to be used.
Content Compression is achieved with the mod_deflate module. Since HTML and the resource are in an uncompressed state, this feature allows us to transparently compress the page markup, images, stylesheets and content and send it across the internet where it is uncompressed before rendering.
We use the AddOutputFilterByType directive to accomplish this. The DeflateCompressionLevel sets the level of compression used to shrink the files.
For databases, we create security zones using Apache and basic authentication. This is available via mod_auth_MySQL module. The configuration directives in this case are AuthName for naming a security zone, AuthType for Basic, AuthMySQLDB for dbname, AuthMySQLUser for userid, AuthMySQLPassword for password, AuthMySQLEnable to On, and require valid-user.
The communication between client and server can be set to be encrypted using SSL with the mod_ssl, an Apache module that SSL-enables a website that Apache controls.  The configuration directive in this case is the OpenSSL and a certificate is used for the server Then a VirtualHost section is specified in the config file with this certificate.
Apache can also be used as a file repository with the WebDAV. WebDAV stands for Webbased Distributed Authoring and Versioning, enables Apache to allow users to treat an enabled directory as a remote directory or drive on their own computers..
Continuing from the book reading :
This book talks about code efficiency. Performance of code may depend on the machine configuration. For example, a little more memory can help tremendously with page swapping. To speed up a site, there are generally two categories of techniques - namely benchmark and profiling. Benchmarking is experimenting to determine the best approach for something before implementing it for real. Profiling is experimenting on the real thing to see how well it performs. In Benchmarking, there is a timer involved that is started before measuring the duration of some code execution and then stopped right after it. This start-stop is repeated for accumulating the durations. Further the timers could be labeled so that they can be started and stopped independently. The PEAR benchmarking class provides this functionality with its  Benchmark_Timer.
Profiling can help you plan the hardware improvements to the existing code. Things like connections, processor, memory can tremendously improve the actual performance.
Web server improvements also will help improve actual performance. Apache is highly configurable. Apache has several httpd processes that run to handle the requests. More processes can be started when the load becomes high and then subsequently the excess can be shutdown. In some cases, these processes may be shut down even if there's work to do to reclaim memory that's been leaking. An httpd process may also crash failing the request but another process can take the same request again.  For example, the configuration options to control the number of different httpd processes include MaxClients, MaxRequestsPerChild, StartServers and MinSpareServers. These are specified in the httpd.conf file.
 With improvements in hardware, database and Apache, we can focus on code improvements next. PEAR coding standards come useful here.
PEAR provides caching framework Static version of pages can be generated and served instead of regenerating each time. The source code for the page can be stored in pre-parsed format that the Zend engine can readily execute. Even browser cache can be used to improve experience. The LastModified header comes useful to determine the last time the content was changed. 
The Expires header can be used to determine how long the page should be valid for. Clients could also use the If-Modified-Since header that can be used by the server to generate a full response or send a Not-Modified response.
Furthermore, output buffering can be used to gather up the contents of the page and send it out all at once.  This can be done by including ob_start() and ob_end_flush() through different sections of the html. The ob_start('ob_gzhandler') can be used to make sure the buffered output is compressed.
 There are several different caching engines available to choose from. For example, these include Alternative PHP Cache (APC) which is a PHP opcode caching engine and comes with PECL/PEAR, Zend Engine which has also become part of PECL, eAccelerator which is another compiled state PHP caching engine, JPCache which is a memory/disk caching solution and is a PHP library that can be used to store the generated output of a PHP script to disk or in a SQL Server instead of saving the compiled state, and memcached which unlike the output caching systems of APC, eAccelerator and JPCache relies on the caching of backend code objects such as the database result objects and the data-model entities.