Cluster computing

Tuesday, August 14, 2018

We were discussing the suitability of Object Storage to various workloads and the programmability convenience that enables migration of old and new workloads. In particular, we discussed connectors for various data sources and their bidirectional data transfer. Duplicity is a command line tool that is an example of a connector tool but we were discussing availability of an SDK with the object storage. Writing the connectors for each data source is very much like an input-output model. The data is either from the external source to an object storage or from object storage to external source. In each of these directions a connector only changes for the type of external source. Otherwise the object storage facing part of the connector is already implemented in the form of S3 Apis for read and write. The APIs varies only for the data source as available from the data source. This makes it easy to write the connector as an amalgam of source facing API for bidirectional data transfer to Object-Storage facing S3 Apis. A read from the external data source is written to Object storage with s3 put api and a write to the external data destination has data coming from Object storage with a read using S3 get apis. Since each connector varies by the type of external data platform, they can be written one per data platform so that it is easier to use with that data platform. Also, SDKs facilitate development by providing language based convenience. Therefore, the same connector sdk may be offered in more than one language.

SDKs may be offered in any language for the convenience of writing data transfer in any environment. It just does not stop there. UI widens the audience for the same purposes and brings in administrators and systems engineering without the need for writing scripts or code. ETL for example is a very popular usage of designer tools with drag and drop logic facilitating wiring and transfer of data. SDK may power the UI as well and both can be adapted to the data source, environment and tasks.

#codingexercise

bool isDivisibleBy55(uint n)

{

return isDivisibleBy5(n) &&isDivisibleBy11(n);

}

bool isDivisibleBy77(uint n)

{

return isDivisibleBy7(n) &&isDivisibleBy11(n);

}

Monday, August 13, 2018

We were discussing the suitability of Object Storage to various workloads
We said that the connectors for these data sources are not offered out of object storage products but they could immensely benefit data ingestion. S3 Api deals exclusively with the namespace, buckets and objects even when the Apis are made available as part of SDK but something more is needed for the connectors.
Writing the connectors for each data source is very much like an input-output model. The data is either from the external source to an object storage or from object storage to external source. In each of these directions a connector only changes for the type of external source. Otherwise the object storage facing part of the connector is already implemented in the form of S3 Apis for read and write. The APIs varies only for the data source as available from the data source. This makes it easy to write the connector as an amalgam of source facing API for bidirectional data transfer to Object-Storage facing S3 Apis. A read from the external data source is written to Object storage with s3 put api and a write to the external data destination has data coming from Object storage with a read using S3 get apis. Since each connector varies by the type of external data platform, they can be written one per data platform so that it is easier to use with that data platform. Also, SDKs facilitate development by providing language based convenience. Therefore, the same connector sdk may be offered in more than one language.
The connectors are just an example of programmability convenience of data ingestion from different workloads. Specifying metadata for the objects and showing sample queries on object storage as part of sdk is another convenience for the developers using Object Storage. Well written examples in the sdk and documentation for easing search and analytics associated with Object Storage will tremendously help the advocacy of Object Storage in different software stacks and offerings. Moreover, it will be helpful to log all activities of the sdk for data and queries so that these can make its way to a log store for convenience with audit and log analysis. The usage of sdk to improve automatic tagging and logging is a powerful technique to improve usability and maintaining history.
#codingexercise
boolean isDivisibleBy22(uint n){
return isDivisibleBy2(n) && is DivisibleBy11(n);
}
boolean isDivisibleBy33(uint n) {
return isDivisibleBy3(n) && isDivisibleBy11(n);
}

Sunday, August 12, 2018

We were discussing the suitability of Object Storage to various workloads after having discussed its advantages and its position as a perfect storage tier:
The data sources can include:
Backup and restore workflows
Data warehouse ETL loads
Log stores and indexes
Multimedia libraries
Other file systems
Relational database connections
NoSQL databases
Graph databases
All upstream storage appliances excluding aging tiers.
Notice that the connectors for these data sources are not offered out of object storage. In reality, S3 Api deals exclusively with the namespace, buckets and objects even when the Apis are made available as part of SDK.
Writing the connectors for each data source is very much like an input-output model. The data is either from the external source to an object storage or from object storage to external source. In each of these directions a connector only changes for the type of external source. Otherwise the object storage facing part of the connector is already implemented in the form of S3 Apis for read and write. The APIs varies only for the data source as available from the data source. This makes it easy to write the connector as an amalgam of source facing API for bidirectional data transfer to Object-Storage facing S3 Apis. A read from the external data source is written to Object storage with s3 put api and a write to the external data destination has data coming from Object storage with a read using S3 get apis. Since each connector varies by the type of external data platform, they can be written one per data platform so that it is easier to use with that data platform. Also, SDKs facilitate development by providing language based convenience. Therefore, the same connector sdk may be offered in more than one language.

#codingexercise
bool isDivisibleBy14 (n) {
return isDivisibleBy (2) && isDivisibleBy(7);
}

Saturday, August 11, 2018

Object Storage is very popular with certain content. Files directly map to objects. Multimedia content are also helpful to be served from object storage Large files such as from Artifactory are also suitable for Object Storage. An entire cluster based file system may also be exported and this may be used with Object Store. Deduplication appliance may also provide benefit benefits in conjunction with an Object Storage.
Object Storage is usually viewed as a storage appliance in itself. Therefore it provides a form of raw storage suitable for what can be viewed as objects. However a suite of connectors may be made available in the form of sdk, that enables data to move into object storage from well-known platforms. For example, data in a content-library can be moved into object storage with the help of a connector in the sdk. This is just one of the examples, there are several more.

The data sources can include:
Backup and restore workflows
Data warehouse ETL loads
Log stores and indexes
Multimedia libraries
Other file systems
Relational database connections
NoSQL databases
Graph databases
All upstream storage appliances excluding aging tiers

#codingexercise
bool isDivisibleBy21 (n) {
return isDivisibleBy (3) && isDivisibleBy(7);
}

Friday, August 10, 2018

We were discussing application virtualization and the migration of workloads:
We brought up how both application as well as storage tier benefit from virtualization and the automation of workload migration using tools. Object storage itself may be on a container facilitating easy migration across hosts. Since object storage virtualizes datacenters and storage arrays, it is itself at once a storage application as well as a representation of unbounded storage space. Once the workloads have been migrated to object storage, both can then be moved around the cloud much more nimbly than they were if they used raw storage volumes.
One of the challenges associated with migration is that Application Server - Storage Tier model has evolved to a lot more complex paradigms. There is no more just an application server and a database. In fact servers are replaced by clusters and nodes. Applications are replaced by modules and modules run on containers. Platform as a service has evolved to using Mesos and Marathon where even the storage volumes are moved around if they are not a shared volume. Data usually resides in the form of files and the database connectivity is re-established because the connection string does not change as the nodes are rotated. Marathon monitors the health of the nodes as the application and storage is moved around In the object storage, the location of the object is arbitrary once the underlying storage is virtualized. Object storage itself may use a container that may make it portable but it is generally not the norm to move Object Storage around in a Marathon framework. If anything Object Storage is akin to a five hundred pound gorilla in the room.
Object Storage is very popular with certain content. Files directly map to objects. Multimedia content are also helpful to be served from object storage Large files such as from Artifactory are also suitable for Object Storage. An entire cluster based file system may also be exported and this may be used with Object Store. Deduplication appliance may also provide benefit benefits in conjunction with an Object Storage.
Object Storage is usually viewed as a storage appliance in itself. Therefore it provides a form of raw storage suitable for what can be viewed as objects. However a suite of connectors may be made available in the form of sdk, that enables data to move into object storage from well-known platforms. For example, data in a content-library can be moved into object storage with the help of a connector in the sdk. This is just one of the examples, there are several more.

#codingexercise
bool isDivisibleBy12(uint n)
{
return isDivisibleBy3(n) && isDivisbleBy4(n);
}

Thursday, August 9, 2018

We were discussing application virtualization and the migration of workloads:
There are a few other caveats with application virtualization. The storage volumes usually move with the rotation of the servers as demonstrated by Mesos. This is very different from object storage where the storage is virtualized. When the storage volumes are moved around, the data usually resides in the form of a file such as a database file. The database connectivity is re-established because the connection string does not change as the nodes are rotated. Furthermore, the in and out rotation servers may use the same database file. In the object storage, the location of the object is arbitrary once the underlying storage is virtualized. This might explain why object storage provides a storage tier as opposed to an end to end virtualization. There are tools that help with the workload migration. These tools provide what is termed as "smart availability" by enabling dynamic movements of workloads between physical, virtual and cloud infrastructure. This is an automation of all the tasks required to migrate a workload. Even the connection string can be retained when moving the workload so long as the network name can be reassigned between servers. What this automation doesn't do is perform storage and OS level data replication because the source and destination is something the users may want to specify themselves and is beyond what is needed for migrating the workloads. Containers and shared volumes come close to providing this kind of ease but they do not automate all the tasks needed on the container to perform seamless migration regardless of the compute. Also, it makes no distinction between Linux containers and docker containers. These tools are often used for high availability and for separating the read only data access to be performed from the cloud.
With the help of above explanation of workload migration, we have brought up how both application as well as storage tier benefit from virtualization and the automation of migration using tools. Object storage itself may be on a container facilitating easy migration across hosts. Since object storage virtualizes datacenters and storage arrays, it is itself at once a storage application as well as a representation of unbounded storage space. Once the workloads have been migrated to object storage, both can then be moved around the cloud much more nimbly than they were if they used raw storage volumes.
#codingexercise
bool isDivisibleBy4(uint n)
{
uint m = n %100;
return (m % 4 == 0);
}

Wednesday, August 8, 2018

We were discussing the cloud-first strategy for newer workloads as well as migrating older workloads to Object Storage. We did not mention any facilitators of workload migrations but there are many tools out there that help with the migration. We use the IO capture and playbook tools to study the workload profiling. This we can perform in a lab environment or in production as permitted. In addition, there are virtualizers that take a single instance of an application or service and enable it to be migrated without any concern for the underlying storage infrastructure.

It is these kinds of tools we make note of today. These tools provide what is termed as "smart availability" by enabling dynamic movements of workloads between physical, virtual and cloud infrastructure. This is an automation of all the tasks required to migrate a workload. Even the connection string can be retained when moving the workload so long as the network name can be reassigned between servers. What this automation doesn't do is perform storage and OS level data replication because the source and destination is something the users may want to specify themselves and is beyond what is needed for migrating the workloads. Containers and shared volumes come close to providing this kind of ease but they do not automate all the tasks needed on the container to perform seamless migration regardless of the compute. Also, it makes no distinction between Linux containers and docker containers. These tools are often used for high availability and for separating the read only data access to be performed from the cloud.

It should be noted that the application virtualization does not depend on the hypervisor layer. There are ways to do it but it is not required. In fact, the host can be just about any compute as long as the migration is seamless which means it can be on-premise or in the cloud. There is generally a one to one requirement for the app to have a host. One application to many hosts seamless execution is excluded unless they are running in serverless mode. Even so, different functions may be executed one on one over a spun-up host. The host is not taken to be a cluster without some automation of which nodes execute the serverless functions. An application that is virtualized this way is agnostic of the host. This is therefore an extension of server virtualization but with the added benefits of fine-grained control.

We noted that workload patterns can change over time. There may be certain seasons where the peak load may occur annually. Planning for the day to day load as well as the peak load therefore becomes important. Workload profiling can be repeated year round so that the average and the maximum are known for effective planning and estimation.
Storage systems planners know their workload profiles. While deployers view applications, services and access control, storage planners see workload profiles and make their recommendations based exclusively on the IO, costs and performance. In the object storage world, we have the luxury of comparision with file-systems. In a file-system, we have several layers each contributing to the overall I/O of data. On the other hand, a bucket is independent of the filesystem. As long as it is filesystem enabled, users can get the convenience of a file system as well as the object storage. Moreover, the user account accessing the bucket can also be setup. Only the IT can help determine the correct strategy for the workload because they can profile the workload.