Wednesday, February 13, 2013
libraries for data mining
Some of the data mining packages available on googlecode are nifty utilities to mine 2D data. They come with various alogorithms and technique implementations. This can directly be used with different kinds of data. One such logic involves computing cluster centroids and points, then finding euler distance between the points and the centroids. Euler distance is the square root of the sum of squares of the x and y offsets from the centroid. In the case of finding keywords in a text, wordnet similarity distance between the selected candidates could be considered and a 2D array (points, clusters) of similarity populated each ranging between 0 and 1. We could iterate through all these points to compute sum of euler distance for a point from all of the clusters. Also, for the same point we could compute the inverse of the fraction of the distance to each cluster to the sum just computed. Then we could normalize the matrix to tag each data point to different clusters. Once we know the membership of each words to different clusters, we could populate index with words that maximize cluster membership.
Monday, February 11, 2013
Glossary
Glossary of terms in computing
Automation : This refers to the set of tasks that have been
specified in a way that they can run without manual intervention. This comes in
very useful in test setup, execution, tear down, results store, and reports
making. Frequently such tasks are
delegated to a test harness.
Balancing : This
refers to sharing the load between one or more processors or units of control
such that there is a distribution of work resulting in more concurrency and
faster execution as well as fault tolerance.
Cloud Computing : This refers to the use of one or more
virtual machines in a hosted environment where the provider of VM and its
network, the blades and hardware are
taking away the onus of maintaining the application on a proprietary system
while assuring high availability and geographic reachability.
Degree of Parallelism: This refers to the number of concurrent
unit of contol invoked in the execution of a task usually by a partitioning of
data.
Execution This refers
to the actions needed to complete a task. The actions are usually outlined
ahead together with the context and data with which to operate on and in a
tightly controlled environment.
Flow diagrams : This refers to either the control or the
data flows in a system where the system interaction and vulnerabilities can be
studied by seeing how the data changes and with what controls as it goes
through the system. This is one way to white box a system in that its internal workings
are made clear with examples of data and control. Such diagrams are useful for
threat analysis.
GigaByte: This is a unit of storage equal to 10 power 9
bytes and represents a way to describe the size of say files or databases.
Hashing : This is a technique to make lookups faster by
keeping track of where insertions were made for entries in a collection so that
they can be retrieved with a fixed cost. Typically, a range of hash values is
used and a hash is computed on the object given and then put into the
category. The hashing is very effective
when a good hash function is used otherwise chaining and linear search may
result.
Instruction : This is a specific task specified to the
processor and can be considered ‘bare metal’ or ‘native’ in that it is
independent of any programming languages and very much tied to the hardware for
which it is specified.
Jump : This is a very specific instruction given to the
processor to move from the next instruction in a sequence to something outside
the sequence.
Keys, public and private : This refers to cryptographic key
pairs usually used for encryption such that only the sender and receiver can interpret
the encrypted traffic. The private key is also called the secret and is used
for decrypting while the public key can be shared.
Locking: This refers to the locking of a set of instructions
or resources such that they provide mutual exclusion so as to prevent clobbering
by interfering reads or writes. Interestingly, Lock free implies the guarantee
of atomic reads and writes without conflicts by tightly controlling the
sequence in which they appear but without the use of locks.
Monitoring: This is ensuring continuous availability of say
a web service by frequently checking for some heartbeat events or directly
calling the service periodically.
Nightly run: This refers to running large analytical
workloads such as OLAP on system when there is less chance for conflicts with
update statements.
Proxy: This is a software entity that sits between a sender
and receiver and usually passes the traffic . Proxy can be either transparent
or opaque and are commonly used with http and https.
Query: This is the language of retrieving data from a data
store and can be both data definition as well as data manipulation in nature.
They can be as granular as a batch or statement in T-SQL
Resource: A resource is known by its cost and is helpful
with management. It can be used to refer
to memory or cpu required fo r execution and can loosely be used for humans
assigned to different tasks.
Scale out/up: A scale
out is used to refer to additional servers added horizontally to say a cluster
where as a scale up tries to increase the capacity of a single server in
terms of memory or processors
Ticket: A ticket is a snapshot that captures enough
information to identify an incident and for tracking. It can be associated with
a user on whose behalf it is created and tracked as it is worked upon and
closed.
User: A user to a computer system knows only the
functionality and not the mechanics of the system. In other words, the user
should not be expected to figure out whether a system failed to complete a
specified action or what the remedial actions are. The user only sees the system as a tool for
solving a processing.
Violation, access: Access
violation is a bad address specified to the processor and most commonly is a
null reference.
Zero: A zero is a state of a bit in which there is no signal
and is a convenient test for the Boolean state false. It provides a comparison against a signal.
Friday, February 8, 2013
Abstract Data Types
Abstract Data Types (ADTs) can be used directly in the relational model. They don't change the algebra, just the expressions over attribute values. However deeper support for non-relational structured types is provided via XML with languages like XPath and XQuery. The following three approaches enumerate some of the handling of structured types like XML. The first is to build a custom database system that operates on data with structured types but this perhaps more laborious and intensive than supporting XML in relational model. The second approach is to treat the complex type as an ADT where say a column type of XML is used to store one XML document per row. All executions are handled outside the query optimizer. A third approach is for the DBMS to normalize the nested structure into a set of relations upon insertion and this is often called shredding. With shredding, the sub-objects are mapped with foreign keys and ordering information between elements of the same level is removed which improves performance.
Thursday, February 7, 2013
a new way to tag words
When you want to select keywords in a text, there are several techniques such as linguistic search etc available as mentioned earlier. Usually words are tagged as various parts of speech (POS) available from say WordNet. I propose a slightly different tag that might make document indexing easier. This is a very simple metric that is based on how frequently these words are used across the internet. With this metric, I'm not suggesting a better way to find hapaxes. Instead I'm proposing to differentiate between frequently occuring not so common words in a text block.
and
K is the number of occurrences of the word, k! is the factorial of k and
is a positive real number
The metric could be a simple Poisson Distribution where the
formula is
is a positive real number
Tuesday, February 5, 2013
indexing methods
Among the various algorithms for automated document indexing, the following variations could be considered:
1) use clusters based on number of words sharing similar concepts and their cumulative frequency. ( so this is a histogram based on topic and weights assigned to individual candidates based on the fraction of occurrances of this candidate to the overall occurances of all the words in the cluster. This idea is based on a paper by Document Indexing with a Concept Hierarchy by Gelbukh, Sidorov and Guzman. However, this variation plans to use existing concepts as detected by clustering functions using any dictionary or ontology to find relevance between index candidates.
One way to improve performance is such that the entire dictionary be in-memory or perhaps those pages that are relevant to the candidates at hand. In order to find the pages, we know the synsets of the words and their hierarchy. Alternatively, we could consider taking only fixed size sections of the dictionary and loading them in memory at a time. This way only the sections can be switched in and out as required for processing.
1) use clusters based on number of words sharing similar concepts and their cumulative frequency. ( so this is a histogram based on topic and weights assigned to individual candidates based on the fraction of occurrances of this candidate to the overall occurances of all the words in the cluster. This idea is based on a paper by Document Indexing with a Concept Hierarchy by Gelbukh, Sidorov and Guzman. However, this variation plans to use existing concepts as detected by clustering functions using any dictionary or ontology to find relevance between index candidates.
One way to improve performance is such that the entire dictionary be in-memory or perhaps those pages that are relevant to the candidates at hand. In order to find the pages, we know the synsets of the words and their hierarchy. Alternatively, we could consider taking only fixed size sections of the dictionary and loading them in memory at a time. This way only the sections can be switched in and out as required for processing.
Monday, February 4, 2013
Design of skip lists
There are sample implementations of skip lists in many languages:
In C++ here is one reference
In C# here is another reference
In Python here is another reference
The common features across all these implementations are as follows:
1. Each skip list node has a height and a set of neighboring skip list nodes, precisely as many neighboring nodes as its height, some of which may be null.
2. Each skip list node has some piece of data associated with it.
3. Each skip list node has a key to look it up
Skiplist is initialized with the max height of any node.
Head and tail are initialized and attached
Insert operation:
Find where the node goes by comparing keys
If its a duplicate, don't insert, otherwise perform insert
Use a random number to generate a height
Remove operation:
Find the node to be deleted, traverse using available max height to lowest till you reach the node
for each of the levels on the current node, update head and tail forward references and delete the node
if not found return false
Sunday, February 3, 2013
automatic document indexing
Automatic document indexing of an unstructured document involves cherry picking of words from various parts of the document and classifying them. Research has proposed statistical and linguistic methods, machine learning, neural-networks, self-organizing maps and connectionist models. Among the linguistic methods, natural language processing libraries have enabled semantic interpretation of index candidates. In the absence of other input in terms of mark-ups, metadata or punctuations, the index candidates could be picked up based on clusters and outliers to the topics presented in the text. To detect semantic relevance of the words, we can use concept hierarchy structures such as from WordNet. In a set of related words representing a cluster and all other things considered equal, then the one appearing in the text with the highest frequency can be picked.
Design:I propose a single table with the record of all words occuring in the document and their attributes. Attributes willinclude the following: word offsets, page numbers, frequency, tags and emphasis etc. Candidates will be unigrams.Classifier logic could be written in separate modules to determine the clusters and outliers in the above table.Tables will be implemented as skip-lists in memory for fast traversal, insert and update.Document parsing, word stemming, canonicalization and population of the table is the first step.Clustering and index selection is the next step.Advantages:The index selected will be semantically organized and connected.Entries and Sub-Entries can be computed with different interchange-able classifier algorithms.Index candidates could be a common subset of various classifier.Disadvantages:The index selected will be single word representations.Clusters could miss the newly introduced words that are not in dictionary or ontology used.Outliers could miss an idea that is not mentioned anywhere else.
Subscribe to:
Comments (Atom)

