Cluster computing

Tuesday, December 17, 2013

In today's blog post, we continue on our discussion with data warehouse. We mentioned that the warehouse is built iteratively This has been proved correct by the industry track record, because the end user unable to give all the requirements up front, management doesn't make a commitment until there are some results, and visible results must be seen quickly.
We will now see the role of data model in the iterative development. The data warehouse serves as a roadmap for each of the development efforts. The data model plays a critical factor in these efforts. First the data model tells what needs to be done.
Second the data model tells how one development effort is to be integrated with the other.
The development efforts are like puzzle pieces and the data model makes sure the efforts fall in place.
If the data model were absent, many of the efforts would not fit well with others and there would be overlap and redundancies.
The output of a data model is a set of tables in fact a lot of small tables with little data. If a program has to go for several tables it has to interconnect with them.
A better approach would be to physically merge some of the tables so that minimal I/O is consumed.
Merging tables is one approach. Another approach is to create an array of data. An array of data lets the sequences reside in different locations and it would be easier to directly access the resource. For example creating an array by month is easy. It might be interesting to note that sometime redundant data is inserted. For example, if a base parts table has a non-redundant field call description then several parts table have to access the base the parts table for that field. This is expensive. Instead if the description field is redundantly placed in many tables, its access is independent and less expensive. Note however, this is for the access to the tables.
Likewise if there are other fields in the table that is not as frequently or more frequently used, then they can be stored in separate tables such that the most frequent ones are by themselves.
Another technique is to use a derived data with the physical database design. For example, if we calculate the annual taxes at the year-end every year, we can store the calculated data. Then all subsequent access can use the calculated field.
Another technique is to use a creative index or a creative profile. This type of creative index is created as data is passed from operational to warehouse environment. Since the data is accessed unit by unit, the index creation is easy, inexpensive and based on users' interests.
Finally, a referential integrity can be used to establish relationships between databases. Each solution can be tested with a few use cases such as name, dept .

We continue our discussion on data warehouse. We look into the "design" of the interface from operational systems. The design is actually heuristic. The requirements for a warehouse are not known until it is partially populated and in use. So a phase by phase approach is used instead. One portion of data is populated. It is then used and scrutinized by the DSS analyst. Then there is feedback from the user which leads to data modifications. Then another portion of the data warehouse is built. This feedback loop continues until through out the life of the warehouse.
On the other hand, anticipating requirements is still important, so there is a trade-off.
Design usually begins with operational data, the source for the warehouse. However, it's not extract and load operation. There are several transformations involved. Operational data comes from different applications They are often not integrated. Data when it makes its way into the warehouse has to be integrated across applications. Integration means standardizing the conventions across systems, keeping the syntax and semantics same on merging, and correctly mapping source fields to destination.
Three types of loads are made into the data warehouse from the operational environments:
archival data
data currently in the operational environment
periodic refreshes from updates to the data in the operational environment
Archival data is often skipped or one-time.
Loading current data is not disruptive because the data is loaded only once and usually via downloaded files.
Loading periodic refreshes on the other hand is a major concern because we have to find what's already present and what's not. Some common techniques used to limit the scanning of data are as follows:
1) scan the time-stamped data of the source and exclude the ones that are not in range. However source may not have timestamps
2) scan the delta file that contains only the changes made to the application as a result of the transactions.
3) scan the log file or the audit file that's generated a by-product of the transaction processing.
4) modify the application code for managing the amount of data scanned, however application code can be legacy, old and fragile.
5) the last option is literally a last resort in that it takes a snapshot before and after image of the operational file and dedicating large amount of resources

Monday, December 16, 2013

In data warehouses, exploration and data mining are used to analyze masses of historical data and to discover patterns of unknown business activity. The data warehouse contains cleansed, integrated and organized data.
Data warehouse can be designed in a hybrid form to create what is known as a living sample database which is especially useful when the data has grown very large. The data in this database could be true archival data or lightly summarized data. Since its a subset of the original data warehouse, it needs to be periodically refreshed. This is also not a general purpose database and they are useful only for analysis or to find trends. Queries that can't be run over the full database can only be run on this database.
The selection of data for this database is usually random but in some cases, a "judgement sample" is taken in which the records meet a certain criteria.
This translates to improved productivity for a DSS analyst with reduced time for turnarounds.
A second major issue for the data warehouse is partitioning. Partitioning of data refers to the breakup of data into separate physical units that can be handled independently.
Proper partitioning helps the warehouse in the following ways:
Loading data
Accessing data
Archiving data
Deleting data
Monitoring data
and Storing data
Independently managed partitions of data are portable for different processing environments. When data is monolithic, it becomes harder to restructure, index, sequentially scan, reorganize, recover and monitor.
Flexible access of data is a key design goal and partitions help with that goal. Data can be partitioned in many different ways such as by date, by line of business, by geography, by organizational unit, and all of the above. The choices for partitioning data is dependent on the developer.
Partitioning can be done in many ways. Partition can be done at the system level or at the application level. If the partitioning is at the system level, the DBMS requires that there is a single definition of data. If the partitioning is at the application level, the data can be moved around between environments. One test for a good partitioning of data is to see if an index can be added without major restructuring. or hampering of operations.

Sunday, December 15, 2013

We describe a simple tweak to Matsuo's paper on keyword extraction from a single document where we replace the unconditional probability of a term in the document with the prior Bayesian probability from a corpus. We could pre-compute and store the Bayesian probabilities of all terms in a table. We include all words from an English Dictionary and assign a default value to those terms not found in the corpus.We use the nltk.NaiveBayesClassifier to populate this table beforehand. Since the original approach finds the probabilities of only the top 30% of the frequent terms, the number of documents where the frequent terms won't be found in the corpus should be fairly small.
Said another way, we don't just set the weights on the terms based on their frequency in the document, but taken the overall likelihood of that term to appear in any document. In Matsuo's approach, this substitution does not affect alter the meaning of the expected probability too much. If anything, it improves on the probability from a wider source than the document.
We also aim to differentiate terms that occur similar number of times in a document. This is especially useful for short text where the frequency may not give as much information as in a larger text. Further the original approach was to find the term weights based on the document itself and avoid the corpus. In this case too, we avoid the corpus with a prior probability table that useful in cases where we can discriminate the terms.
Note that this tweak is more suitable to the approach we modify and may not apply in all cases. For example in the KLD distance calculation, KLD distance is about relative probability distribution and a measure how different one is from the other. In which case, we want the probabilities from the same sample space.
Here we cluster sentences based on the KLD similarity between a pair of sentences where sentences are represented by their term vector of probabilities. We compute the probability that a term occurs in a sentence as the ratio of the term frequency of the term in the sentence to the total term frequency across all the sentences. Depending on the scope we could treat sentences with paragraphs or call them short text documents. By clustering, we categorize the text and therefore select the keywords that represent the categories. In both symmetric and pairwise clustering, we use co-occurrence of terms. The difference is that symmetric clustering groups phrases and pairwise clustering groups relevant terms with the original approach.
KLD equations come in many form. All of them use the same probability distributions.

In today's post, we will read from a book "building the data warehouse" by Inmon. We discussed briefly about CIF Architecture and comparison with Star-Kimball model. We look at data warehouse in more detail now.
Data Warehouse is an accumulation of data. It has different levels of detail. There is an older level of detail, a current level of detail and a level of lightly summarized data (the data mart level), and a level of highly summarized data Before data makes its way into the warehouse, there's tremendous transformation involved. When data ages, it moves from current detail to older detail. When data is summarized it moves from current detail to lightly summarized and then to highly summarized. The data across levels is organized based on subject areas. Typical subject areas in a corporate data model involve Customer, Product, Transaction, Policy, Claim and Account. A subject area could consist of several (order of tens or hundreds of) physical tables. If we take a subject area such as a Customer, there could be a physical table for base customer say between 1985-87, another for 1988-1990. There could be a cumulative table for 1986-1990 and a summary record for each customer record based on customer activity per month. All of the physical tables are related by a common key - the customer ID. Direct Access Storage Device, magnetic tape, fiche and optical disk can all be used to store data. a fiche is good for storing data that does not have to be reproduced in an electronic medium again
Some tables are organized on a from date to date basis while others are cumulative.
Due to their accumulation and size, Data warehouses are not built all at once. For example, on day 1 there may be just the legacy systems doing operational, transactional processing. On day 2, the first few tables of the first subject area of data warehouse are populated. Users start to discover warehouse and analytical processing. On day 3, there are more subjects and more curiosity drawing more users and a DSS analyst. On day 4, more data is extracted from operational environment, transformed and placed in the data warehouse. Many DSS applications spring up. Access needs to be granted for usage and some users are
On day 5, departmental databases come up. Departments now extract the data from the warehouse to their level for analysis.
On day 6, the land rush to departmental, multidimensional systems takes place. This is because the departmental data is now directly available from the warehouse. And users no longer look at the data level details but focus on the departmental analytical processing.
On day n, the architecture is fully developed. Although at this point it looks like a spider web, it is not one. "The Cabinet Design" article on May 1991 explains why.
Note that the level of the detail or the summarization unit of the data plays a very critical and pervasive role in the design of the data warehouse. This is referred to as Granularity.
In operational systems, the data is mandatorily stored at the lowest granularity. In data warehouses, such assumptions are not made. For eg. subscriber call records for a month are a high level of detail and a low level of granularity. But the summary of same subscriber calls are a low level of detail and a high level of granularity. Note the number of records to search the existence of one record is order of magnitudes greater in the first than in the second.
In most retail companies, there are usually dual-levels of granularity. There is a level of granularity for the true archival level and a lightly summarized data in the other level. The lightly summarized data can come from say 30 days' details maintained by the operational data.

Saturday, December 14, 2013

We will talk about choosing cache engines in a LAMP stack:
We briefly discussed a few of them with the modules. We will consider their usage now.
Code cleanup and performance tuning in PHP scripts goes only so far to make pages fast enough. Caching engines prevent a trip to the server for pages that are already available. Caching can be done for internally compiled PHP scripts, the generated output or the intermediary data processing objects.
APC is an opcode caching engine and comes with PECL/PEAR repository. The compiled code from the PHP runtime engine is stored in the shared memory for later access. It can be linked into the PHP code or as Dynamic Shared Object (DSO) and by specifying the extension as apc.so in the php.ini file
eAccelerator is a similar compiled-state PHP 4 caching engine and also optimizes as well as encodes code. The resulting can be distributed to those that have eAccelerator installed. Again the compiled objects are stored in shared memory and the different setting such as memory size, cache directory, time-to-live, prune period can be described in the php.ini file. The extension entry is accelerator.so Similarly Zend optimizer uses the native Zend engine that powers PHP to cache compile time objects.
JPCache works by storing the generated output of a PHP script or disk to disk or in SQL Server. It's written as a PHP library. In addition to caching, it also provides compression resulting in more improvements in page load time. Configuration options include those for the database, the credentials and the compression level.
Memcached is a distributed memory object caching system and is available as a PHP DSO. we can add entries to memcache using the add method that takes key values as parameters.
The get_stats() method on the memcache is particularly helpful and gives such information as current connections, total connections, hits, misses, current items, total items, etc.
As we see now, there are overlap between the caching frameworks. The complimentary ones could increase performance but others may not work well together.
JPCache and memcached can work well with other but because eAccelerator and Zend have their own encoders, they are mutually exclusive. Moreover, you only want one opcode caching. Memcached is a great choice for large objects and varied lifetime or small objects but frequent database access. If there are static pages, JPCache works well. If there is a lot of PHP code, then an opcode caching like APC works well.

In this post, we talk about Ajax (Asynchronous JavaScript and XML) in the LAMP stack.
Ajax helps make calls to the server to update a page without a page refresh. This allows client side of the application to remain responsive even when the data is exchanged with the server.
With page refreshes, the entire page was being transmitted several times consuming bandwidth, transmitting same elements over and over again and making the web application unresponsive during transfers. With Ajax, the page is responsive to user actions during transfers.
Ajax allows rich fat clients to be written. By that we mean the client is capable of a lot of functionality. Take Gmail for instance. It's a web client that uses Ajax. The page isn't refreshed but the elements are manipulated via JavaScript with the application and XML communication with the GMail Server.
Prior to Ajax, the techniques used to communicate with the server without refresh included the following: The image source trick - with the src property of an image in the HTML, developers manipulated the brower's behavior to send a web request for the image and to package its response with a server side script to access a data store. However it was one-way and it wasn't robust because it couldn't handle errors.
Hidden Frames was another technique where the developers used it to send and receive information from the web server It's simple to use and debug and the frame load adds a browser history entry.
The downsides are that the user can manipulate the content.
Hidden iFrames instead of Frames help alleviate the issues with the behavior of the back button.
XMLHttp and XMLHttpRequest interfaces enables the following methods that are supported in all browsers.
abort() - aborts a request,
getResponseHeader(key) : returns the value of the HTTP response header named key
open(method, uri) : sets the HTTP method and the request URI
overrideMimeType(type): Forces the interpretation of the response
send(requestcontent) sends the HTTP request
setRequestHeader(key, value) assigns header elements and called before open or send
Since the interface goes by different names and is instantiated differently for different browsers, the AJAX code handles this by using the native XMLHttpRequest or the window.ActiveXObject for instantiating the interfaces.
Similarly the status code from the request is checked before the response is parsed.
The responseXML property of the request is parsed and elements may be added to the DOM. Knowing the XML structure may or may not be required. For example, handlers can be written at the level of an XML element instead of the response level.
Some common AJAX libraries are : SAJAX (Simple AJAX toolkit) which allows functions to be defined in PHP and called from the client. CPAINT which is the cross-platform Asynchronous interface toolkit that supports both plain text and XML response type and JPSPAN the bridges the gap between JavaScript and PHP. It allows for a seamless importing of PHP classes to JavaScript.