Cluster computing

Wednesday, January 29, 2014

In today's post, I want to briefly discuss the differences between SOAP and REST based services.
SOAP follows a convention where the data is exchanged as messages and the operations are usually methods on objects. Data can be exchanged in both text and binary. It needs specific tools to inspect the messages.
REST is a lightweight protocol in that it describes resources and operations as a set of predefined verbs usually HTTP verbs. Instead of a big XML format, it can use JSON format. Data can be intercepted by standard http traffic tools such as Fiddler.

Tuesday, January 28, 2014

In this post, I wanted to start on a book for Splunk. I will have to defer that to the next few posts. Instead I want to add a few more items on AWS specifically Amazon EC2 instance store volumes. These are called ephemeral drives. and provide block level storage. The blocks are preconfigured and pre-attached to the same physical server that hosts the EC2 instance. The EC2 instance determines the size of the storage. In some instances, such as micro instance there may be no instance storage and the Amazon EBS storage may be used instead.
Instances can have very fast HI1 solid state drive. In contrast, HS1 instances are optimized for very high storage density and sequential IO.
Local instance store volumes are ideal for temporary storage of information or that which is continually changing. such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances such as a load balanced pool of servers.
The instance owns the data and storage unlike EBS where store volumes can be attached or detached. High I/O typically uses instance store volumes backed by SSD and are ideally suited for many high performance database workloads such as Cassandra and MongoDB.
Data warehouses, Hadoop storage nodes, cluster file systems involve a lot of high sequential IO.
These are better supported by high storage instances. EC2 instance storage have same performance characteristics as Amazon EBS volumes. RAID striping is supported. The bandwidth to the disks is not limited to the network.
Unlike EBS volume data, data on instance stores are persisted only for the lifetime of the instance. For durability, the main concern here is the persistence between instance reboots. The cost of the Amazon EC2 instance includes any local instance store volumes.
Storage volumes are fixed and defined by the instance type, so scaling the number of store volume is not an option but the overall instance and data can be instantiated multiple times to scale.
Local instance store volumes are tied to a particular Amazon EC2 instance and are fixed in number and size

I'm going to pick up a book on Splunk next but I will complete the discussion on AWS and Amazon S3 (simple storage service) first. We talked about usage patterns, performance, durability and availability. We will look at Cost Model, Scalability and Elasticity, Interfaces and AntiPatterns next. Then we will review Amazon Glacier.
S3 supports virtually an unlimited number of files in a directory. Unlike a disk drive that restricts the size of the data before partitioning, S3 can store unlimited number of bytes. Objects are stored in a single bucket, and S3 will scale and distribute redundant copies of the information.
In terms of interfaces, standard REST and SOAP based interfaces are provided. These support both management and data operations. Objects are stored in buckets. Objects have unique key. Each object is web based and rather than a file system but has a file system like hierarchy.
There's an SDK available over these interfaces that are more popular in most programming languages.
Where S3 is not the right choice include the following: S3 is not a standalone filesystem
It cannot be queried to retrieve a specific object unless you know the bucket name and key.
It doesn't support rapidly changing data and its also not a backup or archival storage. While it is ideal for websites, it is used to store the static content with the dynamic content stored on EC2.
Amazon Glacier is an extremely low cost storage service for backup and archival. Customers can reliably store their data for as little as 1 cent per gigabyte per month. You store data in Amazon glacier as archives. Archives are limited to 4 TB but there is no limit on their number.

Monday, January 27, 2014

In this post, we continue our discussion on AWS particularly the storage options. As we have seen, AWS is a flexible cost-effective easy to use cloud computing platform. The different choices for storage are memory based storage such as file caches, object caches, in-memory databases, and RAM, message queues that provide temporary durable storage for data sent asynchronously between computer systems or application components etc. Other options include storage area networking (SAN) where virtual disk LUNs often provide the highest level of disk performance and durability, Direct attached storage where local hard disk drives or arrays residing in each server provide higher performance than a SAN, Network attached storage which provide a file level interface that can be shared across multiple systems. Finally, we have the traditional databases where structured data resides as well as a NoSQL non-relational database or a data warehouse. The backup and archive include non-disk media such as tapes or optical media.

These options differ in the performance, durability and cost as well as in their interfaces. Architects consider all these factors when making choices. Sometimes, these combinations form a hierarchy of data tiers. These Amazon simple storage service is storage for the internet. It could store any amount of data, at any time, from within the compute cloud. or from anywhere on the web. Writing, reading and deleting objects of any sizes is now possible. It is also highly scalable allowing concurrent read write access. Amazon S3 is commonly used for financial transactions and clickstream analytics or media transcoding

The performance of Amazon S3 from within the Amazon EC2 in the same region is fast. It is also built to scale storage, requests and users. To speed access to the relevant data, Amazon S3 is also often paired with a database such as DynamoDB or Amazon RDS. Amazon S3 stores the actual information while the database serves as a repository for metadata. This metadata can be easily indexed and queried. In this case it helps to locate an object's reference with the help of a query.

Durability is guaranteed via automatic and synchronous saving of data across multiple devices and multiple facilities. Availability for mission critical data is designed for such a high percentage that there is miniscule or no downtime. For non-critical data, reduced redundancy storage in Amazon S3 can be used.

Sunday, January 26, 2014

I tested my KLD clusterer with real data. There are some interesting observations I wanted to share.
First lets consider a frequency distribution of words from some text as follows:
categories, 4
clustering, 3
data, 2
attribute, 1

Next let's consider their co-occurrences as follows:
categories clustering 1
data categories 1
data clustering 2

Since the terms 'categories' and 'clustering ' occur together and with other terms, they should belong to the same cluster. This cluster is based on the KLD divergence measure and the measure can fall in one of the several K-means clusters chosen for the different ranges of such measure.

Furthermore the term 'attribute' occurs only once and does not co-occur. Therefore it must be different from 'categories' and 'clustering' terms and should fall in a cluster different from theirs.

The cluster for the term 'data' is ambiguous. If the K-means clustering chose say two partitions only, then it would be merged with one of the clusters. If the K-means clustering had many partitions, it could end up in its own cluster.

Items are clustered together based on their KLD measure. Similar KLD measures between pairs of terms should end up in the same cluster.

In this post, we will continue the discussion on AWS. The AWS cloud services platform consists of Database, Storage and CDN, Cross service, Analytics, Compute and networking, Deployment and management services and Application Services.
The AWS global physical infrastructure consists of geographical regions, availability zones and edge locations.
The databases offered by AWS include DynamoDB which is a predictable and scalable NoSQL store, an ElastiCache which is an in-memory cache. an RDS which is a managed relational database and a RedShift which is a managed PetaByte scale data warehouse. The Storage and CDN consist of S3 which is a scalable storage in the cloud, an EBS which is a network attached block device, a CloudFront which is a global content delivery network, a Glacier which is an archive storage in the cloud, a Storage gateway which integrates on-premises IT with cloud storage and an Import Export which ships large dataset.
The cross-service includes Support, Marketplace which is to buy and sell software apps, ManagementConsole a UI to manage AWS services, SDKs, IDEKits and CLIs. Analytics include the Elastic MapReduce which is a popular managed Hadoop framework. Kinesis, which is a real-time data processing and Data pipeline which is an orchestration for data driven workflows.
The compute and networking include EC2 - the most popular way to access the large array of virtual servers directly including remote desktop with admin access, VPC which is a virtual private network based on NAT or ipv6 fencing, ELB which is a load balancing service, Workspace which is a virtual desktop in the cloud, AutoScaling which is automatically scale up and down, DirectConnect dedicated network connection to AWS and Route 53 which is a scalable domain name system.
The deployment and management services include CloudFormation which is a templated AWS resource creation, CloudWatch which is a resource and application monitoring, Elastic Beanstalk which is an AWS application container, IAM which is a secure AWS access control, CloudTrail which is a User activity logging, OpsWorks which is a DevOps Application Manangement service and CloudHSM which is a hardware based key storage for compliance.

In this post, we review white paper on AWS from their website.
AWS solves IT infrastructure needs. Applications have evolved from a desktop centric installation to client/server models, to loosely coupled web services and then to service oriented Applications. This increases the scope and dimension of the infrastructure. Cloud computing now builds on many of the advances such as virtualization, failovers etc. As Gartner mentions, it is known for scalable and elastic IT enabled capabilities as a service to external customers using internet technologies.
The capabilities include compute power, storage, databases, messaging and other building block services that run business applications.
When coupled with a utility style pricing and business model, cloud computing delivers an enterprise grade IT infrastructure in a reliable timely and cost-effective manner.
Cloud computing is about outsourcing the infrastructure services and keeping it decentralized. Development teams are able to access compute and storage resources on demand. Using AWS, you can request compute power, storage and services in minutes.
AWS is known to be
Flexible - different programming models, operating systems, databases and architectures can be enabled. Application developers don't have to learn new skills. SOA based solutions with heterogeneous implemantations can be enabled.
Cost-effective - with AWS, organizations pay only for what they use. Costs such as power, cooling, real estate, and staff are now taken away from organizations. There is no up-front investment, long term commitment and there is minimal spend.
Scalable and elastic = organizations could quickly add and subtract AWS resources in order to meet customer demand and manage costs. It can handle a spike in traffic or two or more and not hamper normal business operations.
Secure - It provides end to end security and end to end privacy. Confidentiality, integrity, and availability of your data is of the utmost importance to AWS and it is maintaining trust and confidence. AWS takes the following approaches to secure the cloud infrastructure:
certifications and accreditations - in the realm of public sector certifications
physical security - it has many years of experience designing, constructing and operating large scale data centers.
secure services - unauthorized access or usage is restricted without sacrificing the flexibility of customer demand.
privacy - encrypt personal and business data in the AWS cloud.
Experienced - when using AWS, organizations can leverage Amazon's years of experience in this field.
AWS provides a low friction path to cloud computing. Scaling on demand is an advantage.