Thursday, July 7, 2016

Backup services in a private cloud 
A private cloud has several valuable assets in the form of virtual machine instances. A backup of these VMs helps with restore to point of time, disaster recovery and such other benefits. Backups are different from snapshots because they may include storage volumes and application data. Both backup and snapshots involve large files to the tune of several hundred Gigabytes. 
Many cloud providers offer their own backup products. However, a private cloud may be made up with more than one cloud provider. Moreover the private cloud may shift weight on its cloud vendors. It doesn't seem flexible to use one or the other of these cloud specific products. A combination of these products could work but each would introduce a new variable to the clients. Some commercial backup products successfully claim to provide a central solution against heterogeneous clouds but they may impose their own restrictions on the backup activities. Consequently a Do-it-yourself backup product that can do all that we want in a private cloud becomes the only satisfactory solution. 
As is conventional with offerings from the private cloud, a service seems better than a product. In addition if this is written in a micro service model, it becomes much more manageable and tested given that this is a dedicated services 
Moreover clients can now call via command line, user interface or direct api calls. Each client offers two kinds of backup service: frequent backups via scheduled jobs and infrequent on-demand-push-button backups.  They differentiate the workloads from a customer and make it easier customer to realize their backups. A frequent schedule alleviates the onus on the customer to take repeated actions and provides differentiation to the workloads that can be siphoned off to automation. 
By nature the backup service does not take into consideration the source of the data and performs the same kinds of operations when data is made available to it. The data is differenced, then compressed and encrypted and stored on destination file storage. The data in this case happens to be virtual machine files. The type of Virtual machine, its flavor of operating system or the cloud provider from which the virtual machine is provisioned, doesn’t matter. The virtual machine is exported as a set of files.  
Consequently backup service can be used with all kinds of data and the architecture for the service takes this into consideration. If the source of the data is not the virtual machines but files and folders within a virtual machine, it performs the same set of operations. The data is made available to the backup service by agents within the virtual machine or some delegate for the cloud provider that exports the virtual machines. There is a decoupling between the data packager which in this case is the one that is tasked with exporting the Virtual machines from the cloud provider and the backup activities performed once the data is made available.  
That said, the task of packaging the virtual machines is complex. Different cloud providers have different nuances. These involve tasks such as interacting with the cloud center to take a snapshot of the virtual machine. Often a backup can be taken only when the virtual machine is powered off. This is sometimes an unacceptable solution as virtual machines may be required to be active 24x7. The virtual machines in such cases may require to be cloned by ‘quiescing’ the file system.  Different cloud providers involved different techniques or mechanisms to do it. Some merely allow only a snapshot to be taken which works only with their originating datacenter. Others allow the virtual machine to be exported in a more portable format that can be used to restore the virtual machines to other datacenters. Moreover, the set of tasks to export a VM may also differ based on the flavor of the virtual machine and the tasks associated with the virtual machine. The tasks of packaging vary based on where it is performed. For example, there are differences in what is packaged by a request to the operating system or by a request to the cloud provider such as a VMWare vCenter. In comparison to public clouds that offer snapshots of virtual machines as well as an export of the storage volumes associated with the virtual machine, different cloud providers may have different ways to request the export of all files relevant to a virtual machine.  It just so happens that the primary disk of a virtual machine in a public cloud has all the information for the virtual machine to be launched. Therefore virtual machines can be exported as a set of virtual disk files.  
The task of packaging does not culminate with the availability of disk files. Sometimes descriptors need to be added or the files need to be packaged into an archive all of which requires a download of the relevant files to a local disk where the operation can be performed.  Such copying can involve large files and requite a lot of time and bandwidth.  
If a download is involved, there must be enough space on the local disk or a remote share. If it’s a remote share, there will be contention from different copiers. Consequently, it is better to separate each copier's destination to a different file share. For example, file in the source may be categorized by the vCenters they belong to. This gives some distribution of source files to destination and results in reduced contention at any single file share.   
Moreover, each file copy may span several minutes. This means the task will require its own states. Fortunately the state does not need to be updated very often. Task parallel libraries like Celery maintain this state on their backend. Typically a message broker is involved. If each file copy and transfer may be considered a processor, and this processor has a corresponding does bookkeeping with a message broker, an enterprise grade message broker would be required so that all tasks can share the same endpoint for message broker. Message brokers can scale to form a cluster just like databases and do not need to be partitioned. However, if the source destination loads are independent and can be partitioned to different processors, databases and message brokers, then there can be more than one such server to do the backups. This has a tradeoff in terms of how many different units of such servers are maintained. Each unit adds more complexity, maintenance and additional points of failure. On the other hand they are designed to scale as a central resource. Having one server that targets different regions for their processing is simpler and gives one endpoint to the client while managing the distribution internally. Additionally only one script with the prepopulated instance names will be required and the same script can be used in parallel by different processors running for different regions. The suggestion here is to leverage the virtualization for different regions but provide one co-ordinator for all clients. The co-ordinator itself is no more than a distributor and the virtualization is as deep as cloned processors for each copy-transfer with all its associated resources being independent for each region. Since processors are cloned there is no change to the script for different workers. This enables tasks to be parallelized with a task parallel library. The processor is invoked with appropriate parameters at the time the user makes the request to backup. The job id issued by the processor is maintained along with the backup request until the state of the request is changed by the processor from pending to completed. The processor resides on machines earmarked one for each region so as to distribute the network and disk i/o of virtual machines in the same region.  The database and message broker for use by the processors do NOT need to be virtualized by the processor for each region because they hold only the metadata and do not affect the processing or data operations  
An enterprise grade message broker will be required for any such task parallelization and asynchronous processing. This message broker maintains a job id for each job in its queue. It can handle fair queuing, retries and queue chaining. Each job will have states that can be queried for the completion status. The job enqueuers can file and forget about the job until later and then check if the job is done. The job is looked up by its id which was issued at the time the job was queued. This is how task parallel libraries enable asynchronous processing. Since each job can be completed by a separate worker, it enables parallelization.   
Progress can be described if the transfer is done in chunks. A simple stdout from the script during large transfers of chunks can be sufficient to provide progress information. Consequently script output may be piped for job progress indication. 
We alleviated bottlenecks and improved partitions and distribution of loads along with asynchronous processing. Therefore this system can scale to 1000 virtual machine backups in fifteen minute windows each from the time the lease is acquired to the state being update as completed. 
The task of backup does not merely involve copying but also differencing the source data, storing the full or incremental data set at a storage different from the source and encrypting it. Consequently data differencing, compression and encrypting tools are required.  Moreover since thousands of Virtual machines may be backed up at any given time, the activities performed on each must be transparent and readily diagnosable via detailed logs for up to the current time processing logs. 
Together, the task of packaging the virtual machines and their backup to remote file storage complete the backup of individual virtual machines. The rest is about partitioning and scalability to handle workloads that are specified in a database where state can be maintained for each operation. 
  
#codingquestion

  
  1. Given an unsorted integer (positive values only) array of size ‘n’, we can form a group of two or three, the group should be such that the sum of all elements in that group is a multiple of 3. Find the maximum number of groups that can be generated in this way.

Void Combine (List<int> a, List<int> b,int start, int level)
{
For (int I = start ; I < a.Length; i++)
{
b[level] = a[i];
if ((b.length == 2 || b.length == 3) && b.sum() % 3 == 0) print(b);
If (I < a.Length)
Combine(a,b,start+1,level+1)
B[level] = 0;


}

No comments:

Post a Comment