Cluster computing

Sunday, January 26, 2014

In this post, we will continue the discussion on AWS. The AWS cloud services platform consists of Database, Storage and CDN, Cross service, Analytics, Compute and networking, Deployment and management services and Application Services.
The AWS global physical infrastructure consists of geographical regions, availability zones and edge locations.
The databases offered by AWS include DynamoDB which is a predictable and scalable NoSQL store, an ElastiCache which is an in-memory cache. an RDS which is a managed relational database and a RedShift which is a managed PetaByte scale data warehouse. The Storage and CDN consist of S3 which is a scalable storage in the cloud, an EBS which is a network attached block device, a CloudFront which is a global content delivery network, a Glacier which is an archive storage in the cloud, a Storage gateway which integrates on-premises IT with cloud storage and an Import Export which ships large dataset.
The cross-service includes Support, Marketplace which is to buy and sell software apps, ManagementConsole a UI to manage AWS services, SDKs, IDEKits and CLIs. Analytics include the Elastic MapReduce which is a popular managed Hadoop framework. Kinesis, which is a real-time data processing and Data pipeline which is an orchestration for data driven workflows.
The compute and networking include EC2 - the most popular way to access the large array of virtual servers directly including remote desktop with admin access, VPC which is a virtual private network based on NAT or ipv6 fencing, ELB which is a load balancing service, Workspace which is a virtual desktop in the cloud, AutoScaling which is automatically scale up and down, DirectConnect dedicated network connection to AWS and Route 53 which is a scalable domain name system.
The deployment and management services include CloudFormation which is a templated AWS resource creation, CloudWatch which is a resource and application monitoring, Elastic Beanstalk which is an AWS application container, IAM which is a secure AWS access control, CloudTrail which is a User activity logging, OpsWorks which is a DevOps Application Manangement service and CloudHSM which is a hardware based key storage for compliance.

In this post, we review white paper on AWS from their website.
AWS solves IT infrastructure needs. Applications have evolved from a desktop centric installation to client/server models, to loosely coupled web services and then to service oriented Applications. This increases the scope and dimension of the infrastructure. Cloud computing now builds on many of the advances such as virtualization, failovers etc. As Gartner mentions, it is known for scalable and elastic IT enabled capabilities as a service to external customers using internet technologies.
The capabilities include compute power, storage, databases, messaging and other building block services that run business applications.
When coupled with a utility style pricing and business model, cloud computing delivers an enterprise grade IT infrastructure in a reliable timely and cost-effective manner.
Cloud computing is about outsourcing the infrastructure services and keeping it decentralized. Development teams are able to access compute and storage resources on demand. Using AWS, you can request compute power, storage and services in minutes.
AWS is known to be
Flexible - different programming models, operating systems, databases and architectures can be enabled. Application developers don't have to learn new skills. SOA based solutions with heterogeneous implemantations can be enabled.
Cost-effective - with AWS, organizations pay only for what they use. Costs such as power, cooling, real estate, and staff are now taken away from organizations. There is no up-front investment, long term commitment and there is minimal spend.
Scalable and elastic = organizations could quickly add and subtract AWS resources in order to meet customer demand and manage costs. It can handle a spike in traffic or two or more and not hamper normal business operations.
Secure - It provides end to end security and end to end privacy. Confidentiality, integrity, and availability of your data is of the utmost importance to AWS and it is maintaining trust and confidence. AWS takes the following approaches to secure the cloud infrastructure:
certifications and accreditations - in the realm of public sector certifications
physical security - it has many years of experience designing, constructing and operating large scale data centers.
secure services - unauthorized access or usage is restricted without sacrificing the flexibility of customer demand.
privacy - encrypt personal and business data in the AWS cloud.
Experienced - when using AWS, organizations can leverage Amazon's years of experience in this field.
AWS provides a low friction path to cloud computing. Scaling on demand is an advantage.

Saturday, January 25, 2014

In this post, we complete the discussion on Teradata and its SQL. We briefly cover the stored procedures SQL. Teradata follows the SQL conventions so these are similar to writing them elsewhere. However they include comprehensive stored procedure language. They are stored in either DATABASE or USER PERM and are invoked using the CALL statement. It may return one or more values to client as parameter. IN, OUT, and INOUT parameters can be used with the stored procedure.
Triggers can similarly be authored. It is an event driven maintenance operation. When a table is modified, the trigger is executed. The original data modification, the trigger and all subsequent triggers are all part of the same transaction. Triggers can be specified as 'FOR EACH STATEMENT' as well as for each data modification.
The hashing functions on Teradata are also very popular. The HASHROW function is used to produce the 32bit binary Row Hash that is stored as part of the data row. The HashRow function can be executed on data column values. From the above answer of 1, it is a great sign that the data is perfectly distributed or even unique.
The HashBucket function is used to produce the 16 bit binary Hash bucket that is used with the hash map to determine the AMP that should store and retrieve the data row. It can return a maximum of just over 1,000,000 unique values. The HashAmp function returns the identification number of the primary AMP for any hash bucket number. The HashBakAmp function returns the identification number of the Fallback AMP for any Hash Bucket number. A great way to see distribution of primary and fallback rows.
The Batch Teradata query tool has been around for some time. It is a report writer and imports and exports data from a Teradata server one row at a time. Queries can be run interactively or in a batch script. After you logon, you can directly execute queries in interactive mode. The WITH and WITH BY commands are available for totaling and sub-totaling. Scripts can be batched and saved in a text file. This concludes our reading of the book on Teradata

Having looked at occurrence and co-occurrence based keyword selection, let's recap: Both the Zipf's law of word occurrences and the refined studies of Booth demonstrate that mid-frequency terms are closely related to the conceptual content of a document. Therefore, it is possible to hypothesize that terms closer to TP can be used as index terms. TP = (-1 + square-root(8 * I + 1))/2 where I represents the number of words with frequency = 1. Alternatively, TP can be found by the first frequency that is not repeated in a non-increasing frequency sorted vocabulary.
We will now attempt to explain a simple clustering based method for co-occurrence based keyword selection.
We start with a co-occurrence matrix where we count the occurrences of pairwise terms. The matrix is N * M where N is the total number of terms and M is the set of words selected from the keywords with TP above. M << N. We ignore the diagonals of this matrix. The row wise total for a term is the sum of all co-occurrences with the known set for that term.
If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be in the same cluster. Similarity is measured based on Kullback Leibler divergence. Kullback-Leibler divergence is calculated as KLD = sum for all terms p.log (p/q) We use standard k-means clustering method. Here the initial cluster centroids are chosen far from each other and then the terms are assigned to the nearest cluster, their centroids recomputed and the cycle repeated until cluster centers stabilize. we take p and q to be attribute-wise co-occurrences from the matrix we populated and aggregate them to compute the KLD
The distribution of co-occurring terms in particular is given by
Pz(t) = weighted-avg attribute-wise Pz(t)
= Sum-z((nz/n)*(n(t,z)/nz))
= 1/n (Sum-z(n(t,z))

Friday, January 24, 2014

We continue with Vector Space Model next. In this model, we represent documents of a collection by using vectors with weights according to the terms appearing in each document. The similarity between two documents d1 and d2 with respect to query q is used is measured as the cosine of the angle between the documents and the query. A document is said to be closer to the query if its cosine is smaller. Each query is modeled as a vector using the same attribute space as the documents.
Groups of documents that are similar with respect to the users' information needs will be obtained by clustering, the documents representing a collection of points in a multidimensional space where the dimensions are given by the features selected There are different algorithms to clustering and almost all algorithms can cluster. Their fitness or quality is measured.
Sometimes, it is better to reduce the number of dimensions without significant loss of information. This transformation to a subspace helps with clustering and reducing the computations Another technique uses a matrix for principal component analysis. In this method, the document and query methods are projected onto a different subspace spanned by k principal components. The matrix helps define the principal values. as for example a covariance matrix.
Notice that reducing the number of dimensions does not shift the origin but only the number of dimensions, so the vectors are spaced apart the same. The principal component analysis on the other hand shifts the origin to the center of the basis vectors in the subspace so the documents are better differentiated.

def LSI (vectors):
return vectors.transform(vectors.k_largest_singular_values);

def COV(vectors):
return vectors.transform(vectors.subset_principal_components_with_k_highest_values);

In this post, we will begin with a way to differentiate between LSI and PLSI. Latent Semantic Indexing is an approach where we try to associate terms with the same main hidden semantic component as found by co-occurrences in similar context. LSI uses conditional probabilities.. In PLSI, we use similar occurrence and co-occurrence of terms but with non-negative weights whose sum is 1. Comparing and clustering this co-occurrence distributions gives us the latent components as represented by the centers of the clusters.
We use similar clustering methods with document classification as well. Formally, we use the vector space model. This model represents documents of a collection by using vectors with weights according to the terms appearing in each document.
Vector Space model can be both boolean and term-weighted models. In the boolean model, the presence or occurrence of each term is given a boolean value. In the term-weighted model, the same term may be assigned different weights by different weighting methods such as tf-idf, frequency ratio. When the weights add up to one, they have been normalized. Typically probabilities or conditional probabilities adding up to one are used.
One of the advantages of using vector space model is that it enables relevance ranking of documents of heterogeneous format with respect to user input queries as long as the attributes are well-defined characteristics of the document.

Thursday, January 23, 2014

In continuation of our discussion on keywords and documents:
We looked at a few key metrics for selecting keywords based on content, frequency and distribution.
However keywords can also be selected based on co-occurrence of similar terms in the same context.
i.e similar meanings will occur with similar neighbors.

This model proposes that the distribution of a word together with other words and phrases is highly indicative of its meaning This method represents the meaning of a word with a vector where each feature corresponds to a context and its value to the frequency of the word's occurring in that context.
This vector is referred to as the Term profiles. The profile V(t) of the term t is a set of terms from the list T that co-occur in sentences together with term t, that is P(t) = {t' : t' belongs to s(t)}

Once such a representation is built, machine learning techniques can be used to automatically classify or cluster words according to their meaning. However, these are often noisy representations. This is due to say polysemy and synonymy. Synonymy refers to the existence of similar terms that can be used to express an idea or object in most languages. Polysemy refers to the fact that some words have multiple unrelated meanings. If we ignore synonymy, the clusters we create will have many small disjoint clusters while some of these could have been clustered together. If we ignore polysemy, it can lead to clustering of unrelated documents.

Another challenge in using this technique to massive databases is efficient and accurate computation of the largest few singular values and their vectors in the highly sparse matrix. In fact scalability beyond a few hundred thousand documents is difficult.

Note the technique above inherently tries to find the keywords based on the latent semantics and hence is referred to as Latent semantic indexing. The comparable to clustering in this approach is to find the conditional probability distributions such that the observed word distributions can be decomposed into a few of the latent distributions and a noisy remainder. The latent components are expected to be the same for co-occurring words.

Next, we will also look at a popular way to represent documents in a collection which is with Vector Space Model.

Material read from survey of text mining, papers by Pinto, papers by Wartena