Cluster computing

Wednesday, February 13, 2013

libraries for data mining

Some of the data mining packages available on googlecode are nifty utilities to mine 2D data. They come with various alogorithms and technique implementations. This can directly be used with different kinds of data. One such logic involves computing cluster centroids and points, then finding euler distance between the points and the centroids. Euler distance is the square root of the sum of squares of the x and y offsets from the centroid. In the case of finding keywords in a text, wordnet similarity distance between the selected candidates could be considered and a 2D array (points, clusters) of similarity populated each ranging between 0 and 1. We could iterate through all these points to compute sum of euler distance for a point from all of the clusters. Also, for the same point we could compute the inverse of the fraction of the distance to each cluster to the sum just computed. Then we could normalize the matrix to tag each data point to different clusters. Once we know the membership of each words to different clusters, we could populate index with words that maximize cluster membership.

Monday, February 11, 2013

Glossary

Glossary of terms in computing

Automation : This refers to the set of tasks that have been specified in a way that they can run without manual intervention. This comes in very useful in test setup, execution, tear down, results store, and reports making. Frequently such tasks are delegated to a test harness.

Balancing : This refers to sharing the load between one or more processors or units of control such that there is a distribution of work resulting in more concurrency and faster execution as well as fault tolerance.

Cloud Computing : This refers to the use of one or more virtual machines in a hosted environment where the provider of VM and its network, the blades and hardware are taking away the onus of maintaining the application on a proprietary system while assuring high availability and geographic reachability.

Degree of Parallelism: This refers to the number of concurrent unit of contol invoked in the execution of a task usually by a partitioning of data.

Execution This refers to the actions needed to complete a task. The actions are usually outlined ahead together with the context and data with which to operate on and in a tightly controlled environment.

Flow diagrams : This refers to either the control or the data flows in a system where the system interaction and vulnerabilities can be studied by seeing how the data changes and with what controls as it goes through the system. This is one way to white box a system in that its internal workings are made clear with examples of data and control. Such diagrams are useful for threat analysis.

GigaByte: This is a unit of storage equal to 10 power 9 bytes and represents a way to describe the size of say files or databases.

Hashing : This is a technique to make lookups faster by keeping track of where insertions were made for entries in a collection so that they can be retrieved with a fixed cost. Typically, a range of hash values is used and a hash is computed on the object given and then put into the category. The hashing is very effective when a good hash function is used otherwise chaining and linear search may result.

Instruction : This is a specific task specified to the processor and can be considered ‘bare metal’ or ‘native’ in that it is independent of any programming languages and very much tied to the hardware for which it is specified.

Jump : This is a very specific instruction given to the processor to move from the next instruction in a sequence to something outside the sequence.

Keys, public and private : This refers to cryptographic key pairs usually used for encryption such that only the sender and receiver can interpret the encrypted traffic. The private key is also called the secret and is used for decrypting while the public key can be shared.

Locking: This refers to the locking of a set of instructions or resources such that they provide mutual exclusion so as to prevent clobbering by interfering reads or writes. Interestingly, Lock free implies the guarantee of atomic reads and writes without conflicts by tightly controlling the sequence in which they appear but without the use of locks.

Monitoring: This is ensuring continuous availability of say a web service by frequently checking for some heartbeat events or directly calling the service periodically.

Nightly run: This refers to running large analytical workloads such as OLAP on system when there is less chance for conflicts with update statements.

Proxy: This is a software entity that sits between a sender and receiver and usually passes the traffic . Proxy can be either transparent or opaque and are commonly used with http and https.

Query: This is the language of retrieving data from a data store and can be both data definition as well as data manipulation in nature. They can be as granular as a batch or statement in T-SQL

Resource: A resource is known by its cost and is helpful with management. It can be used to refer to memory or cpu required fo r execution and can loosely be used for humans assigned to different tasks.

Scale out/up: A scale out is used to refer to additional servers added horizontally to say a cluster where as a scale up tries to increase the capacity of a single server in terms of memory or processors

Ticket: A ticket is a snapshot that captures enough information to identify an incident and for tracking. It can be associated with a user on whose behalf it is created and tracked as it is worked upon and closed.

User: A user to a computer system knows only the functionality and not the mechanics of the system. In other words, the user should not be expected to figure out whether a system failed to complete a specified action or what the remedial actions are. The user only sees the system as a tool for solving a processing.

Violation, access: Access violation is a bad address specified to the processor and most commonly is a null reference.

Zero: A zero is a state of a bit in which there is no signal and is a convenient test for the Boolean state false. It provides a comparison against a signal.

Friday, February 8, 2013

Abstract Data Types

Abstract Data Types (ADTs) can be used directly in the relational model. They don't change the algebra, just the expressions over attribute values. However deeper support for non-relational structured types is provided via XML with languages like XPath and XQuery. The following three approaches enumerate some of the handling of structured types like XML. The first is to build a custom database system that operates on data with structured types but this perhaps more laborious and intensive than supporting XML in relational model. The second approach is to treat the complex type as an ADT where say a column type of XML is used to store one XML document per row. All executions are handled outside the query optimizer. A third approach is for the DBMS to normalize the nested structure into a set of relations upon insertion and this is often called shredding. With shredding, the sub-objects are mapped with foreign keys and ordering information between elements of the same level is removed which improves performance.

Thursday, February 7, 2013

a new way to tag words

When you want to select keywords in a text, there are several techniques such as linguistic search etc available as mentioned earlier. Usually words are tagged as various parts of speech (POS) available from say WordNet. I propose a slightly different tag that might make document indexing easier. This is a very simple metric that is based on how frequently these words are used across the internet. With this metric, I'm not suggesting a better way to find hapaxes. Instead I'm proposing to differentiate between frequently occuring not so common words in a text block.

The metric could be a simple Poisson Distribution where the formula is

and K is the number of occurrences of the word, k! is the factorial of k and

is a positive real number

Tuesday, February 5, 2013

indexing methods

Among the various algorithms for automated document indexing, the following variations could be considered:
1) use clusters based on number of words sharing similar concepts and their cumulative frequency. ( so this is a histogram based on topic and weights assigned to individual candidates based on the fraction of occurrances of this candidate to the overall occurances of all the words in the cluster. This idea is based on a paper by Document Indexing with a Concept Hierarchy by Gelbukh, Sidorov and Guzman. However, this variation plans to use existing concepts as detected by clustering functions using any dictionary or ontology to find relevance between index candidates.
One way to improve performance is such that the entire dictionary be in-memory or perhaps those pages that are relevant to the candidates at hand. In order to find the pages, we know the synsets of the words and their hierarchy. Alternatively, we could consider taking only fixed size sections of the dictionary and loading them in memory at a time. This way only the sections can be switched in and out as required for processing.

Monday, February 4, 2013

Design of skip lists

There are sample implementations of skip lists in many languages:
In C++ here is one reference
In C# here is another reference
In Python here is another reference
The common features across all these implementations are as follows:
1. Each skip list node has a height and a set of neighboring skip list nodes, precisely as many neighboring nodes as its height, some of which may be null.
2. Each skip list node has some piece of data associated with it.
3. Each skip list node has a key to look it up

Skiplist is initialized with the max height of any node.
Head and tail are initialized and attached

Insert operation:
Find where the node goes by comparing keys
If its a duplicate, don't insert, otherwise perform insert
Use a random number to generate a height

Remove operation:
Find the node to be deleted, traverse using available max height to lowest till you reach the node
for each of the levels on the current node, update head and tail forward references and delete the node
if not found return false

Sunday, February 3, 2013

automatic document indexing

Automatic document indexing of an unstructured document involves cherry picking of words from various parts of the document and classifying them. Research has proposed statistical and linguistic methods, machine learning, neural-networks, self-organizing maps and connectionist models. Among the linguistic methods, natural language processing libraries have enabled semantic interpretation of index candidates. In the absence of other input in terms of mark-ups, metadata or punctuations, the index candidates could be picked up based on clusters and outliers to the topics presented in the text. To detect semantic relevance of the words, we can use concept hierarchy structures such as from WordNet. In a set of related words representing a cluster and all other things considered equal, then the one appearing in the text with the highest frequency can be picked.


Design:



I propose a single table with the record of all words occuring in the document and their attributes. Attributes will 



include the following: word offsets, page numbers, frequency, tags and emphasis etc. Candidates will be unigrams. 



Classifier logic could be written in separate modules to determine the clusters and outliers in the above table. 



Tables will be implemented as skip-lists in memory for fast traversal, insert and update.



Document parsing, word stemming, canonicalization and population of the table is the first step.



Clustering and index selection is the next step.







Advantages:

The index selected will be semantically organized and connected.

Entries and Sub-Entries can be computed with different interchange-able classifier algorithms.

Index candidates could be a common subset of various classifier.



Disadvantages:

The index selected will be single word representations.

Clusters could miss the newly introduced words that are not in dictionary or ontology used.

Outliers could miss an idea that is not mentioned anywhere else.