Cluster computing

Friday, February 22, 2019

Today we continue discussing the best practice from storage engineering:

493) To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. These scores can then be used with tags. The same applies to resources.

494) To determine the closeness a couple of mathematical formula help. In this case, we could use the Euclidean distance or the Pearson co-efficient. The Euclidean distance finds the distance between two points in a multidimensional space by taking the sum of the square of the differences between the coordinates of the points and then calculating the square root of the result.

495) The Pearson correlation co-efficient is a measure of how highly correlated the two variables are. It’s generally a value between -1 and 1 where -1 means that there is a perfect inverse correlation and 1 means there is a perfect correlation while 0 means there is no correlation. It is computed with the numerator as the sum of the two variables taken together minus the average of their individual sums and this is divided by the square-root of the product of the squares of the substitutions to the numerator by using the same variable instead of the other.

496) Tags can be used to make recommendations against the data to be searched. Tags point to groups and the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the participants in the selected group.

Thursday, February 21, 2019

Today we continue discussing the best practice from storage engineering:

489) Tags can generate more tags. Background processing and automation can work with tags to generate more tags. For example, a clustering operation on the existing data using similarity measures on existing tags will generate more tags.

490) Tags also work as friendly names for resources that are not visible or tracked at the billing level. For example, if a virtual machine has several network interface cards (NIC) then keeping track of the different models of the virtual machines may not be sufficient granularity for the tags. On the other hand keeping track of all the models of the NIC albeit software device with their identifiers may be too many to keep track off. Instead tags could represent hierarchical information by masking different tags at lower levels. Thus hierarchical tags can be used to have a sliding scale of granularity on the associated resources. This way search can be expanded to include sub-resources

491) We can assign tags only to resources that already exist. If we add a tag that has the same key as an existing tag on that resource, the new value overwrites the old value. We can edit tag keys and values, and we can remove tags from a resource at any time. We can set a tag's value to the empty string, but we can't set a tag's value to null. We can even control who can see these tags.

492)Tagging unlike relational data can come in very helpful for NoSQL like querying and batch processing. Since it does not involve operational data on the resources for the cloud provider, it does not have any performance impact and is more suited for analytics, offline processing and reporting.

Wednesday, February 20, 2019

Today we continue discussing the best practice from storage engineering:

485) Tags don't have any semantic meaning to the functional aspects of the resource and are interpreted strictly as a string of characters. Also, tags are not automatically assigned to our resources.

486) Tags can easily be authored and managed by console, command line interface or API

487) Resources have their identifiers but the metadata on the resources can even be added after the instance has been created. If we treat tags as friendly names for these data types then we have more tags than earlier and thus expanding the options mentioned above.

488) Tags are also lines of search. When a user gives a search term or terms, very often she is trying to find one item that is not being found. The user has to improve the search terms or invoke a lot more options or dig through voluminous results. Instead, if the lines of search were available as intentions, then we can show results corresponding to them.

489) Tags can generate more tags. Background processing and automation can work with tags to generate more tags. For example, a clustering operation on the existing data using similarity measures on existing tags will generate more tags.

Tuesday, February 19, 2019

Today we continue discussing the best practice from storage engineering:

481) The nature of the query language determines the kind of resolving that the data virtualization needs to do. In addition, the type of storage that the virtualization layer spans also depend on the query language.

482) In order to explain the difference between data virtualization over structured and unstructured storage types, we look at metadata in structure storage. All data types used are registered. Whether they are system builtin types or user defined types, the catalog helps with the resolution.

483) A query describing the selection of entries with the help of predicates does not necessarily have to be bound to structured or unstructured query languages. Yet the convenience and universal appeal of one language may dominate another. Therefore, in such cases whether the query language is agnostic or predominantly biased, it can be modified or rewritten to suit the needs of the storage stacks described earlier.

484) Delegation doesn’t have to be the only criteria for the virtualization layer. Both the administrator and the system may maintain rules and configurations with which to locate the store for the data. More importantly the rules can be both static and dynamic. The former refers to rules that are declared ahead of the launch of the service and the service merely loads it in. The latter refers to the evaluations that dynamically assign queries to store based on classifiers and connection attributes.

Monday, February 18, 2019

Today we continue discussing the best practice from storage engineering:

473) Storage products are also prone to increasing their test matrix with new devices such as solid state drive and emerging trends such as IoT

474) Storage products have to be limitless for their customers but they cannot say how they will be used. They will frequently run into usages where customers use them inappropriately and go against their internal limits such as the number of policies that can be applied to their organizational units.

475) There was a time when content addressable storage was popular. It involved generating a PEA file to save contents that could be looked up by their hash. The use of object storage made it easier to access the objects directly.

476) Data is increasingly being produced as fixed content Emails and faxes are examples of these. The lifecycle of content such as from system, personal computing, Network centric and content centric are progressively higher and higher in their durations

477) Drop and create of user artifacts helps the user to cleanup. This is not the case for say system catalog. Still the storage artifacts used on behalf of the user is also the same as the storage artifacts used for system itself. Creating and dropping such artifacts would be helpful even if they are internal.

478) The retention policy is typically 6 months for email, 3 years for financial data, 5 years for legal. The retention period for object storage is user defined.

479) Object Storage is touted as best for static content. Data that changes often is then said to be preferred in NoSQL or other unstructured storage. With object versioning, API and SDK, this is no longer the case.

480) Data Transfers have never been considered a virtual storage since they belong to the source. Data in transit can live in queues, cache and object storage which is good for vectorized execution .

Sunday, February 17, 2019

Today we continue discussing the best practice from storage engineering :

469) Storage products are tested for size and usage under all circumstances. These can be fine grained or aggregated and can be queried at different scopes and levels.

470) Storage product is generally the source of truth for all upstream data sources and workflows.

471) Storage products manage to be the source of truth even with different consistency models. They just need to meet their usages.

472) Storage products have evolved from being purely disk based solutions to compute and network embracing software defined stacks. Consequently they are better able to be tested but the complexity increases.

473) Storage products are also prone to increasing their test matrix with new devices such as solid state drive and emerging trends such as IoT

474) Storage products have to be limitless for their customers but they cannot say how they will be used. They will frequently run into usages where customers use them inappropriately and go against their internal limits such as the number of policies that can be applied to their organizational units.

#codingexercise
Friends Pairing problem:
Given n friends, each one can remain single or can be paired up with some other friend. Each friend can be paired only once so ordering is irrelevant
The total number of ways in which the friends can be paired is given by;
Int GetPairs(int n)
{
If (n <=2) return n;
Return GetPairs(n-1) + GetPairs(n-2)*(n-1);
}

Saturday, February 16, 2019

Today we continue discussing the best practice from storage engineering :

466) Storage products have multiple paths of data entry usually via protocols. These are each tested using their respective protocol tools

467) Storage products are usually part of tiered storage. As such data aging and validation needs to be covered

468) Storage products are tested with different batches of loads. They are also tested using continuous loads with varying rate over time

469) Storage products are tested for size and usage under all circumstances. These can be fine grained or aggregated and can be queried at different scopes and levels.

470) Storage product is generally the source of truth for all upstream data sources and workflows.

#codingexercise
The coin selection problem can scale to any constant number of coins that can be picked in each turn using the methods below.
int GetCoinsKWithDP(List<int> coins, int i, int j, int k)
{
if (i > j) return 0;
if (i==j) return coins[i];
if (j - i +1 <= k) {
List <int> change = new ArrayList <int>();
for (int c = i ; c <= j; c++) {
change.add (coins [c]) ;
}
return Collections.sum (change);
}
List <int> options = new ArrayList ();
for (int m = 0; m < k; m++) {
int change = 0;

for (int left = 0; left < k; left++) {
change += coins [i+left];
int option = change + GetCoinsDP (sequence, i+left, j);
options.add (option);
}

for (int right = 0; right < k; right++) {
change += coins [j-right];
int option = change + GetCoinsDP (sequence, i, j-right);
options.add (option);
}

for (int left = 0; left < k; left++) {
for (int right = 0; right < k-left right++) {
change += coins [j-right];
int option = change + GetCoinsDP (sequence, i+left, j-right);
options.add (option);
}
}
}
return Collections.max (options);
}
The selections in each initiation level of this GetCoinsKWithDP can be added to a list and alternate additions can be skipped as belonging to the other player since the method remains the same for both.