Tuesday, June 21, 2016

Design considerations for backup of large files from vCenter:
Files in the vCenter can be upwards of a hundred GB. Several files may need to be downloaded before they are archived using tools like rsync or bbcp. If the number of files for simultaneous download is large and the size of each file is large, even multiple streams may not be sufficient to collapse the transfer time. Moreover, the local storage needed for repackaging these files can be arbitrarily large.
The optimum solution would be to read the source as stream and write to destination with the same stream.  However, since repackaging is involved, a local copy and transfer can only be addressed with the following techniques:
1.       Keep the local file only until it is repackaged and transferred to destination
2.       Compress and archive the packaged file  before transfer
3.       Maintain a database with the mappings for the source and destination files, their metadata and status for retries
4.       Parallelize based on each file and not folders. This gives more granularities.
5.       Use a task parallel library such as Celery along with a message broker for each transfer of a file.
6.       Use of tools like duplicity may require either the source or the destination to be local. This means they need to be invoked twice for using a local copy as temporary. If the repackaging is not permissible at source, may be the repackaging can be attempted at destination. This works well for remote file storage.
7.       The storage for files must be adequately large to support n number of active downloads of an earmarked size.
8.       There must be policy to prevent more than a certain number of active downloads. This can be facilitated with the bookkeeping status in the database.
9.       Instead of using transactions, it would be helpful to enable states ad retries for such transfers.
Overall, the local storage option is expensive and when it is unavoidable, the speed of the transfer, the number of active transfers, the ease of parallelization and the robustness against failures together with retry logic addresses these pain points.
#codingexercise
 Shuffle a deck of cards:
           Knuth Shuffling :
                  split the cards into 1 to I and I+1 to n-1.
                   pick a card from 1 to I randomly and uniformly (I times)
                   replace with random number card between I+1 to n-1
           Fisher-Yates shuffling:
                   loop over an array
                   swap each element with a random element past the iteration point

void shuffle(ref List<int> cards)
{
var random = new Random()
 for (int i = cards.Count-1; i > 0; i--)
{
      int j = random.next(0,i);
      Swap(ref cards, i, j);
}
}

No comments:

Post a Comment