Cluster computing

Tuesday, February 26, 2019

We continue with the discussion on ledger parser
Checks and balances need to be performed as well by some readers. The entries may become spurious and there are ways a ghost entry can be made. Therefore, there is some trade-off in deciding whether the readers or the writers or both do the cleanup work.
The trouble with cleanup is that it requires modifications to the ledger and while they can be done independently by designated agents, they don’t give as much confidence as when they came from the source of the truth as the online writer. Since cleanup is a read-write task, it involves some form of writer.
If all the writes were uniform and clean entries, it does away with the need for designated agents and their lags and delays. At this end of the writer only solution, we have traded-off performance. At the other end of high-performance online writing, we are introducing all kinds of background processors, stages and cascading complexity in analysis.
The savvy online writer may choose to use partitioned ledgers for clean and accurate differentiated entries or quick translations to uniform-entries by piggy-backing the information on the transactions where it can keep a trace of entries for that transaction. The cleaner the data, the simpler the solution for downstream systems.
The elapsed time for analysis is the sum of whole. Therefore, all lags and delays from background operations are included. The streamlining of analysis operations requires data to be clean so that the analysis operations are read only transactions themselves.
If the operations themselves contribute to delays, the streamlining of operations for the overall analysis requires the use of persisting results from these operations where the input does not change. Then the operations need not be run each time and the results can be re-used for other queries. If the stages are sequential, the overall analysis becomes streamlined.
Unfortunately, the data from users cannot be confined to a size. Passing the results from one stage to another breaks down when there is a large amount of data to be processed in that stage and sequential processing does not scale. In such cases, partial results retrieved earlier help enable downstream systems to have something meaningful to do. Often, this is adequate even if all the data has not been processed.

Monday, February 25, 2019

Ledger Parser
Bookkeeping is an essential routine in software computing. A ledger facilitates book-keeping between components. The ledger may have entries that are written by one component and read by another. The syntax and semantics of the writes need to be the same for the reads. However, the writer and the reader may have very different loads and capabilities. Consequently, some short-hand may sneak into the writing and some parsing may be required on the reading side.
When the writer is busy, it takes extra effort to massage the data with calculations to come up with concise and informative entries. This is evident in the cases when the writes are part of online activities that the business depend on. On the other hand, it is easier for the reader to take the chores of translations and routines that are expensive because the reads are usually for analysis which do not have to impact online transactions.
This makes the ledger more like a table with different attributes so that all kinds of entries can be written. The writer merely captures the relevant entries without performing any calculations. The readers can use the same table and perform selection or rows or projection of columns to suit their needs. The table sometimes ends up as being very generic and expands its columns.
The problem is that if the writers don’t do any validations or tidying, the ledger becomes bloated and dirty. On the other hand, the readers may find it more and more onerous to look back a few records because there is more range to cover. This most often manifests as delays and lags between when an entry was recorded and when it made into the reports.
Checks and balances need to be performed as well by some readers. The entries may become spurious and there are ways a ghost entry can be made. Therefore, there is some trade-off in deciding whether the readers or the writers or both do the cleanup work.
The trouble with cleanup is that it requires modifications to the ledger and while they can be done independently by designated agents, they don’t give as much confidence as when they came from the source of the truth as the online writer. Since cleanup is a read-write task, it involves some form of writer.
If all the writes were uniform and clean entries, it does away with the need for designated agents and their lags and delays. At this end of the writer only solution, we have traded-off performance. At the other end of high-performance online writing, we are introducing all kinds of background processors, stages and cascading complexity in analysis.
The savvy online writer may choose to use partitioned ledgers for clean and accurate differentiated entries or quick translations to uniform-entries by piggy-backing the information on the transactions where it can keep a trace of entries

Sunday, February 24, 2019

Today we continue discussing the best practice from storage engineering:

500) There are no containers for native support of decision tree, classified and outlier data in unstructured storage but since they can be represented in key values, they can be assigned to objects themselves or maintained in dedicated metadata.
501) The instructions for setting up any web application on the object storage are easy to follow because they include the same steps. On the other hand, performance optimization for such web application depends on a case by case basis.
502) Application optimization is probably the only layer that truly remains with the user even in a full-service stack using a storage product. Scaling, availability, backup, patches, install, host maintenance, rack maintenance do remain with the storage provider.
503) The use of http headers, attributes, protocol specific syntax and semantics, REST conventions, OAuth and other such standards are well-known and covered in their respective RFCs on the net. Content-Delivery-Network can be provisioned straight from the object storage. Application optimization is about using both judiciously
504) An out of box service which facilitates an administrator defined rules for enabling the type of optimizations to perform.  Moreover, rules need not be written in the form of declarative configuration. They can be dynamic in the form of a module.
505) The Application Optimization also acts as a gateway when appropriate. Any implementation of gateway has to maintain a registry of destination addresses. As http access enabled objects proliferate with their geo-replications, this registry becomes granular at the object level while enabling rules to determine the site from which they need to be accessed. Finally, they gather statistics in terms of access and metrics which come very useful for understanding the http accesses of specific content within the object storage

Saturday, February 23, 2019

Today we continue the discussion on the best practice from storage engineering:
496) Tags can be used to make recommendations against the data to be searched. Tags point to groups and the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the participants in the selected group.

497) A useful data structure for mining the logical data model is the decision tree. Structure involves interior nodes = set (A1, … An) of categorical attributes . The leaf is the class label from domain(C). The edge is a value from domain(Ai), Ai associated with parent node. The property is a search tree. The tuples in R -> leafs in class labels . The decision tree's property is that it associates the tuples in R to the leafs i.e. class labels. The advantage of using a decision tree is that it can work with heterogeneous data and the decision boundary is parallel to the axis.

498) Clustering is a technique for categorization and segmentation of tuples. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R. Find a set of those groups of rows in R with the objectives that the groups should be cohesive and not coupled. The tuples within a group are similar to each other. The tuples across group are dissimilar. The constraint is that the number of clusters may be given and the clusters should be significant.

499) Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given.

500) There are no containers for native support of decision tree, classified and outlier data in unstructured storage but since they can be represented in key values, they can be assigned to objects themselves or maintained in dedicated metadata.

Friday, February 22, 2019

Today we continue discussing the best practice from storage engineering:

493) To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. These scores can then be used with tags. The same applies to resources.

494) To determine the closeness a couple of mathematical formula help. In this case, we could use the Euclidean distance or the Pearson co-efficient. The Euclidean distance finds the distance between two points in a multidimensional space by taking the sum of the square of the differences between the coordinates of the points and then calculating the square root of the result.

495) The Pearson correlation co-efficient is a measure of how highly correlated the two variables are. It’s generally a value between -1 and 1 where -1 means that there is a perfect inverse correlation and 1 means there is a perfect correlation while 0 means there is no correlation. It is computed with the numerator as the sum of the two variables taken together minus the average of their individual sums and this is divided by the square-root of the product of the squares of the substitutions to the numerator by using the same variable instead of the other.

496) Tags can be used to make recommendations against the data to be searched. Tags point to groups and the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the participants in the selected group.

Thursday, February 21, 2019

Today we continue discussing the best practice from storage engineering:

489) Tags can generate more tags. Background processing and automation can work with tags to generate more tags. For example, a clustering operation on the existing data using similarity measures on existing tags will generate more tags.

490) Tags also work as friendly names for resources that are not visible or tracked at the billing level. For example, if a virtual machine has several network interface cards (NIC) then keeping track of the different models of the virtual machines may not be sufficient granularity for the tags. On the other hand keeping track of all the models of the NIC albeit software device with their identifiers may be too many to keep track off. Instead tags could represent hierarchical information by masking different tags at lower levels. Thus hierarchical tags can be used to have a sliding scale of granularity on the associated resources. This way search can be expanded to include sub-resources

491) We can assign tags only to resources that already exist. If we add a tag that has the same key as an existing tag on that resource, the new value overwrites the old value. We can edit tag keys and values, and we can remove tags from a resource at any time. We can set a tag's value to the empty string, but we can't set a tag's value to null. We can even control who can see these tags.

492)Tagging unlike relational data can come in very helpful for NoSQL like querying and batch processing. Since it does not involve operational data on the resources for the cloud provider, it does not have any performance impact and is more suited for analytics, offline processing and reporting.

Wednesday, February 20, 2019

Today we continue discussing the best practice from storage engineering:

485) Tags don't have any semantic meaning to the functional aspects of the resource and are interpreted strictly as a string of characters. Also, tags are not automatically assigned to our resources.

486) Tags can easily be authored and managed by console, command line interface or API

487) Resources have their identifiers but the metadata on the resources can even be added after the instance has been created. If we treat tags as friendly names for these data types then we have more tags than earlier and thus expanding the options mentioned above.

488) Tags are also lines of search. When a user gives a search term or terms, very often she is trying to find one item that is not being found. The user has to improve the search terms or invoke a lot more options or dig through voluminous results. Instead, if the lines of search were available as intentions, then we can show results corresponding to them.

489) Tags can generate more tags. Background processing and automation can work with tags to generate more tags. For example, a clustering operation on the existing data using similarity measures on existing tags will generate more tags.