Cluster computing

Tuesday, February 10, 2015

Today we are going to read from the WRL Research report "Why aren't operating systems getting faster with hardware ?" WRL research laboratory tests their ideas by designing, building, and using real systems. This paper was written in 1989. It evaluated several hardware platforms and operating systems using a set of benchmarks that test memory bandwidth and various operating system features such as kernel entry/exit and file systems. The benchmarks are mostly micro-benchmarks in the sense that each one measured a particular hardware or operating system feature. Such benchmarks can suggest the strength and weaknesses of a system but not indicate how good a system is. However, one benchmark was used to assess the overall speed of the operating system by testing a variety of file system features. The author finds that the different hardware platforms show significant speedup but the operating system performance of RISC workstations does not keep up. The benchmarks that were used in this study suggested two possible factors - one was the memory bandwidth that did not seem to scale to the processor speed and the second is the file system, some of which require synchronous disk I/Os in common situations.Here the processor gets faster but the disks don't. The hardware platforms used for the benchmarks include an abbreviation for each platform, an approximate MIPS rating, and an indication of whether the machine is based on RISC or CISC processors. The hardware used were MIPS M2000 (M2000), DEC station 3100 (DS3100), Sun-4/280 (Sun4), VAX 8800 (8800), Sun-3/75 (Sun3) and Microvax II (MVAX2). The first three are RISC machines and the next three are CISC machines.The operating systems used were Ultrix, SunOS, RISC/os and Sprite. Ultrix and SunOS are the DEC and Sun derivatives of Berkeley's 4.2 BSD Unix. RISC/os was developed for the M2000 and Sprite was an experimental operating system at UC Berkeley. Some of the differences between SunOS 4.0 and 3.5 used with the Sun machines included a major restructuring of the virtual memory system and file system. With these operating systems and hardware, a set of benchmarks were studied - the first of which was the kernel entry or exit. This measures the cost of entering and leaving the operating system kernel. It does this by repeatedly invoking the getpid kernel call which returns the callers process identifier. The average time for this benchmark ranged from 18 microseconds on M2000 to 207 microseconds on MVAX2. The Sun machines also showed higher average time than DS3100 and 8800. The hardware performance was also expressed in a number relative to their MIPS rating by taking the ratio of the Sun3 time to the particular machine's time and dividing that by the ratio of the machine's MIP rating to the Sun 3's MIPS rating.

Monday, February 9, 2015

Today I would like to post about Django applications and SSO integration. In SSO Integration, there's usually a SAML exchange between an identityProvider (federated) and a Service provider (Django App). The SAML assertions are authentication assertions and attribute assertions. There's usually a protocol (SOAP over http) and binding (how messages are exchanged) that defines the way SAML asks for and gets assertion. In the Django implementation, there are a few things we need to configure to get it to talk to the idP.

First it must ask the idP for a metadata configuration. This is usually a JSON script.

The SP must implement both metadata URLs and SSOLogin URLs. The former is used to describe what the service provider SAML and the latter is used to implement what the callback/relay for the login to complete.

The Django app implements two methods in its view - one to call the auth.login() and the other to handle the callback from OKTA.

Note however that the Django app implemented this way handles SSO but not OAuth.

By the way, the integration of SSO with Django application is something I've learned from a colleague at my work place.

Typically OAuth is for API and SSO is for portal.

#codingexercise

Double GetOddNumberRangeSumPower ()(Double [] A, int n, int m)

{

if (A == null) return 0;

Return A.OddNumberRangeSumPower(n,m);

}

We now look at an API Gateway. This is both a central authenticator and an aggregator. This is typically deployed in a demilitarized zone. There are different layers and different places at which the gateway can apply. For example, it can apply to different protocols such as FTP, POP, HTTP etc and can be applied to a webstack hosted locally or by a third party. An HTTP Proxy is a typical example for an API gateway to a REST based web stack. An API gateway can be inside a CDN filtering traffic before it reaches the servers. It can be a proxy hosted by a third party before the traffic reaches the web service. It can be a dedicated machine which can be in our own cloud or infrastructure. It can be even within the application stack if the API Gateway can be a module that can be invoked within the application.

Wednesday, February 4, 2015

Today we read the paper the Gold project by Barbara, Molina and Mehrotra. The purpose of this project was to develop a set of tools that interoperate with unstructured and semi-structured data and services. Its a mailer software that allows users to send and receive messages using different media. (e-mail, faxes, phones), efficiently store and retrieve these messages and access a variety of sources of other useful information. It solves the problem of information overload, organization of messages, and multiple interfaces. The mailer doesn't provide a relational like query facilities but it provides a database system interface. What that means is the querying is convenient from programmability. The reason emails were chosen for the kind of tools they develop was that email are ubiquitous and it provides a convenient form of data. Moreover, messages need to be stored or filed. Most mailers have primitive filing facilities and stored messages are seldom found. By its sheer volume, variations in size and number, and organizations, emails are often not easily retrievable. This paper was written prior to the indexing and full text search that's now common on the operating system and mailing software. At the same time, this software uses a mail database. Each stored message is viewed as a tuple with fixed fields and treated as structured and semi-structured sequences of words. The paper mentions that the unstructured model is more appropriate for message searches. and that it can also encompass a wide variety of message formats, even unknown formats. They also improved the search by introducing operations that are different from sql and relations. Another goal was to allow the use of existing interfaces for electronic mail. The project also provided a graphical user interface but this added convenience was not just limited to search and UI. By abstracting the messages and putting them in a store, a user now could search based on words and they would appear as a set of virtual messages. When the user responds to any of the virtual cards, the corresponding email address is looked up and then the mail dispatched by the mailer. The mailer also provides a flexible model for the stored objects. This is another reason why they favor the unstructured approach as opposed to looking them up based on customary fields. This paper details the indexing issues encountered in designing the mailer.
#codingexercise
Double GetAlternateOddNumberRangeSumPower ()(Double [] A, int n, int m)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeSumPower(n,m);
}
We continue our discussion on the mailer. The architecture of the mailer is described as follows:
Users interact with a GUI, that issues commands to read new mail messages and retrieve others from database. When a user decides to store a new message, the user interface passes it to a preprocessor which breaks it up into tokens and gives it to the index engine. The preprocessor can also receive messages directly for parsing and subsequent indexing.It produces a canonical file that is used by index engine to create the appropriate data structures. Messages are stored in an object store or database. The user interface also submits queries to index engine which results in a list of matching file identifiers. The user interface retrieves the objects for displaying. The index engine can receive queries and storage requests from many different user interfaces and hence implements concurrency control. The message that is indexed has headers and body. The user sees it as a bag of words. Alternatively, the user may want to view the message as a semistructured object with separate fields. This distinction enables different type of queries for the two The first enables queries involving relative position and the second involves that only in same segments.
#codingexercise
Double GetAlternateOddNumberRangeProductPower ()(Double [] A, int n, int m)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeProductPower(n,m);

}

We will continue our discussion with the challenges encountered during indexing. We noticed that the user interface allows messages to flow to the preprocessor which tokenizes the words in the message. In addition, news feed can directly flow into the preprocessor. The preprocessor then gives these messages to the indexing engine. which interacts with the user interface for querying and retrieving based on users searches. The index engine also maintains its index structures so that the querying and retrieving can be efficient. In addition, the front end has access to an object store. The messages themselves comprise of header and body. Gold mailer handles structured fragments within a document as well as treat the document as a bag of words. The index engine maintains attributes for each keyword encountered. This is represented in a relational table as keyword, document_id, position and header. The document_id points to the indexed object while the position points to where the keyword occurs within the document. The header attribute is for referencing the header if present. This table is used only by the indexer. The query engine has no knowledge of this internal data structure. The values in any of the cell of this table of attributes doesn't have to be discrete or atomic. This comes in useful to the indexer. At the same time the query engine can differentiate between words by their position as hinted in the query. For example, if the engine sees subject lunch and time as three words to be searched where they appear consecutively, it can do so by setting the positional difference between the words to be zero. This differentiates this particular query from other queries where the three words may co-occur at different positions in the document.
#codingexercise
Double GetAlternateOddNumberRangeProductCube ()(Double [] A, int n)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeProductCube(n);
}

Furthermore, the query engine can select from the table mentioned above where the header field is subject to look for the occurrence of the three keywords. A mail message can be considered as a collection of labeled fragments and the processor will report the word subject as the label for the words and provide byte offsets of the words within the fragments. No matter whether the structured fields are used or not, the indexing engine works the same way. The index engine stores a variety of documents and a user may be interested only in documents of a specific type.

The Gold Mailer supports the feature of message annotation. When a new message is to be stored, the user interface allows the user to enter additional words for indexing. For instance, the user may want to associate the words budget and sales with a message even though these two words don't appear in the message itself. The user interface appends these added keywords to the end of the message, and the index engine does not treat them differently. Mailers usually provide the capability of defining folders. Messages flow into folders for organization. Gold mailer uses annotation to implement folders. This means that a virtual folder can also be created via message annotations but its not the same as an actual folder. An instance querying the system for all messages that container the keyword database retrieves messages that are annotated by the keyword as well as those that contain the keyword somewhere in the text. Folders and annotations give the user enough flexibility to organize their mails.
Next we look at their usages. Users may choose to use folders for redirecting the emails into folders that are classified appropriately. This gives them the flexibility to look for emails in narrower domains. But its merely a convenience for the user. The annotations are used on the other hand as labels. This allows both system and user to search.
Not all messages have the same format. The Gold mailer deals with messages that come from fax and scan. These are run through the optical character recognition server. This software with the help of an online dictionary recognizes as many words from the electronic document as possible and creates a file with the words. Even a handful of keywords are enough to be recognized for indexing. The keyword file and the image are both stored in a fax directory that's owned by the addressee. The same folder can be shared. The Gold mailer also sends out faxes using the same media.
Messages can be composed and received using both text and images and through a variety of media.
The query language was developed with a simple and flexible message model. The goal was to keep it easier to specify. For example, mail friend Princeton command would search through the database for objects using those words friend and Princeton.
The Gold mailer expects most queries not to involve proximity searches. If proximity is required, it can be specified by grouping together the terms involved. The user interface allows the user to incorporate new messages, send messages using the information stored in address cards, edit address cards, and pose queries to the database to retrieve messages.

Tomorrow we wrap up our discussion on this project.

The information stored in the index engine is viewed, as mentioned, as a tuple of keyword, document identifier, position and header. The index engine supports the following three operations :
insert document : this inserts all the records pertaining to a specific document
delete document : this deletes all the records pertaining to a specific document
retrieve query : this command instructs the engine to find all documents that satisfy the query and return the list of names to the mailer.
This structure and operations enable efficient evaluation of queries which is critical to the mailer.

The index engine cannot be considered as a warehouse where insertions are batched at low time and deletions are never done. In that sense, the mailer provides newly inserted documents immediately. Index reconstruction is considered maintenance and not part of the workflow.

The file organization required for the mailer is dependent on the types of queries and their space requirements. The queries are keyword based lookups and don't involve ranges between keywords. Therefore the keywords are indexed and the records are stored as a hashed file. For each keyword, a list of positions where the keyword appears is maintained resulting in efficient disk access as opposed to keeping a record one for each position. If this reminds us of the BigData principles, it would behoove us to note that this paper was published in 1993.

The records corresponding to a document in a hashed file are chained together forming a list, referred to as the next key list for the document. This enables easier deletions.

When the records are inserted, they are not directly placed in the hashed files because the number of keywords in a document can be arbitrary. Instead an overflow list is maintained one per keyword in a block or more of main memory and the records are inserted at the tail. The records of the document therefore appear contiguously in these lists. Compare this to the buckets where the records are kept contiguously based on keywords and not the same document. When documents are deleted the records are not deleted one by one, instead blocks can be returned. This way the mailer avoids having to request n blocks for n keywords.

Concurrency control has to make accommodations for the fact that the insert and delete operations may not just be user invoked but are also background tasks. This implies that operations can just be sequential because the user operation may then have to wait till the background job completes. Therefore locks are introduced and these are taken at page level. Also failures can occur from processes and similar to concurrency, we cannot rollback all operations since background jobs may have substantial loss of work. This is overcome by restarting where the jobs were prior to the failure. Logging is supported for both physical page level operations as well as logical inserts and deletes.

This system shows how to organize the mail messages in a way that retrieval is efficient as opposed to user having to browse the previously created folders. As an extension, it could be made to index not only mail messages but library catalogs and databases. This mailer software also intended to investigate improvements in accessing data from other information sources such as file systems.

Today I will continue to discuss the AWS Security mechanism. We were discussing Server Side Encryption.An S3 object can be passed to the server with a per-object key over SSL and encrypted on the server side before being put into the S3 bucket. The client side encryption is also possible so long as the client maintains its keys and encryption of objects without sending or sharing the keys.
Even data maintained on file systems or database can be encrypted. Both databases and file systems can provide transparent encryption. Whole disk or file level encryption is possible. With AWS, volumes can be encrypted too. Snapshots of encrypted volumes can also be taken. Optionally S3 bucket access can be taken and access logs to S3 can be taken. Hardware Security modules in the virtual private cloud enable to manage the AWS provisioned HSM. Customers can be single tenants in these VPC. If we are deploying a service, we can isolate our service, simplify security groups and forensics. We can also connect two VPCs in the same region bridged by routing table. Services that have APIs can also use logging. Logging can be enabled with CloudTrail. CloudTrail can monitor who made the API Call, when was the call made, what was the API call, and what were the resources that were acted upon etc. CloudTrail event collection can be configured in JSON just like a policy. CloudTrail partners with some of the major log reading, parsing and indexing vendors. To summarize the security recommendations, we should turn on Multi factor authentication for our root account, create IAM users, groups and policies, and never use the root account API keys and we should scope limit our policies. In addition if there are any geographical requirements such the whitepapers on Auditing, Logging, Risk, Compliance, and Security for say Australia, then those should be followed as well. Lastly, the IAM groups and security is one thing and user managing their resources is another. The tools available for the user are to chain their resources in a way that different accounts can have access or to enable signed provisioning of their resources for granting access to specific people outside the users and groups. In the latter case, the user is not really sharing the access keys and at the same time not granting public access to the read or write. This method is also favored in cases where the s3 resource needs to be downloadable via the http. In addition, signed access can also be time constrained so that the access can be revoked on the expiration of the duration specified. This is convenient for revokes. And provides ability to not have to do bookkeeping of the recipients.

Tuesday, February 3, 2015

Today we will discuss AWS Security as read from the slides of James Bromberger at AWS Summit. In this he describes not only the security considerations for users of AWS but also what AWS undertakes. He says that their primary concern is compliance while the audiences' is account management, service isolation, visibility and auditing. AWS concerns itself with securing the physical assets such as facilities, infrastructure, network, and virtualization. The customer secures the operating system, application, security groups, OS Firewall, Network configuration and Account management. AWS compliance is actually available for everyone to see on their website. It showcases certifications and approving industry organizations. On the other hand customers have primarily been interested in securing accounts - which in a way are the keys to the kingdom. Identity and Access Management consequently enable users and groups, unique security credentials, temporary security credentials, policies and permissions, roles as well as multi-factor authentication (MFA). Recommendations for account security involve : securing our master account with MFA, creating an IAM group for our Admin team, creating IAM users for our Admin staff as members of our Admin group and even turning on MFA for these users. Enhanced password management with expiry, reuse check, and change on next log in is also available.
Next we look at temporary credentials. These are used for running an application. We remove hardcoded credentials from scripts and config files, create an IAM role and assign restricted policy, then we launch the instance into a role. AWS SDKs transparently get temporary credentials.
IAM policies are applied with the least privilege policies. This means the resources will be qualified and the actions will be incremental in privileges granted. Policies can have conditions which can restrict access even further and this is good. AWS has a policy generator tool which can generate policies but there;s even a policy simulator tool that can be used to test it.
Another set of credentials we use is the access key and secret key These are used to sign requests. and to make API calls. They are generally issued once and the next time a new set is reissued.SSL is required for all traffic because data is exchanged and we want it to be encrypted. Even database connections and data transfers over them are to be encrypted.
In addition, AWS provides server side encryption using AES 256 bit that is transparent to customers. Keys are generated, encrypted and then stored using a master key. and the generated key is used to encrypt data.

Sunday, February 1, 2015

Command line tools for object storage such as S3cmd and awscli provide almost all the functionality required to interact with objects and storage. However, the use of SDK enables integration with different types of user interfaces: For example, if we want to get ACLs, we use :

  public function get($bucket, $key = null, $accessKey=null, $accessSecret=null){

             if ($bucket == null) return array('Grants' => array());

            $client = $this->getInstance($accessKey, $accessSecret);

            if ($key != null){

             $acls = $client->getObjectAcl(array(

                  'Bucket' => $bucket,

                  'Key' => $key));

             if ($acls == null) $acls = array('Grants' => array());

             return $acls;

}

            else {

             $acls = $client->getBucketAcl(array(

                  'Bucket' => $bucket));

             if ($acls == null) $acls = array('Grants' => array());

             return $acls;

}

}
If we want to set the ACLs we could use:

 // 'private', 'public-read', 'project-private', 'public-read-write', 'authenticated-read', 'bucket-owner-read', 'bucket-owner-full-control'

  public function set($bucket, $key, $acl, $accessKey=null, $accessSecret=null){

    $client = $this->getInstance($accessKey, $accessSecret);

    $result = $client->putObjectAcl(array(

    'ACL'    => $acl,

    'Bucket' => $bucket,

    'Key'    => $key,

    'Body'   => '{}'

));

    return $result;

}

#codingexercise
Double GetAlternateEvenNumberRangeSumPower ()(Double [] A, int n, int m)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRange
SumPower(n,m);
}

Today we cover object storage vs block storage.

The rate of adoption for ObjectStorage can become very exciting both as an IT Admin and as a user. Here are some use cases where it may prove more beneficial than say block storage:

If you have static content of varying sizes and you cannot label the workload into a category except calling it miscellaneous, then you can use ObjectStorage. It lets you add metadata to each content now treated as an object with some context around the data. It doesn’t split up files into raw blocks of data but the entire content is treated as an object. Whether its a few photos, uploaded music videos, backup files or just other data from your PC, they can now be archived and retrieved at will.
No matter how many objects or their sizes, each object can be uniquely and efficiently retrieved by its ID.
When the basket of miscellaneous object grows to a few hundred terabytes or even a few petabytes, storage systems that were relying on adding block storage cannot keep up. Object Storage does not require you to mount drives, manage volumes or remap volumes. Besides, objects can store multiple copies of data which improves availability and durability. Whether its S3, Swift or Atmos, most vendors give this assurance.
ObjectStorage can work with NAS and commodity nodes where scaling out is just addition of new compute rather than new storage.
That brings to the point that if you are using data that is high up in read-write such as say databases, then block storage such as with SAN will be helpful.
You can sign a link to the object and share it with others to download with their web browser of choice.

If you have access keys to the object storage, you can upload the objects this way :

       $s3 = S3Client::factory(array(

            'base_url' => $host,

            'key'    => $access,

            'secret' => $secret,

            'region' => $region,

            'ssl.certificate_authority' => $ssl_certificate_authority,

));

        $result = $s3->putObject(array(

            'Bucket'     => $bucket,

            'Key'        => $key,

            'SourceFile' => $file,

            'Metadata'   => array(

                'source' => 'internet',

                'dated' => 'today'

)

));

        $s3->waitUntilObjectExists(array(

            'Bucket' => $bucket,

            'Key'    => $key,

));

This is written using S3 APIs

Most S3 compatible vendors of Object Storage maintain their own set of access keys so you cannot use their access keys against another vendors' endpoint or storage.