Cluster computing

Thursday, February 28, 2019

Today we continue discussing the best practice from storage engineering:

515) Most of the storage products have embraced APIs in one form or the other. Their usage for protocols with external agents, internal diagnostics and manageability are valuable as online tools and merit the same if not better appreciation than scripts and offline tools.

516) Storage products solve a piece of the puzzle. And customers don’t always have boilerplate problems. Consequently, there needs to be a bridging somewhere.

517) Customers also prefer ability to switch products and stacks. They are willing to try out new solutions but have become increasingly wary of tying to any one product or the increasing eencumbrances

518) Customers have a genuine problem with data being sticky. They cannot keep up with data transfers

519) Customers want the expedient solution first but they are not willing to pay for re- architectures

520) Customers need to evaluate the cost of even data transfer over the network. Their priority and severity is most important to them.

Wednesday, February 27, 2019

Today we continue discussing the best practice from storage engineering:

510) The storage product just like any other software product is a culmination of efforts from a forum of roles and people playing those roles. The recognition and acceptance of the software is their only true feedback.

511) Almost every entry in the storage for users data is sandwiched between a header and a footer in some container and the data segments read with offset and length. This mechanism is repeated at various layers and becomes all the more useful when data is encrypted.

512) Similarly entries of data are interspersed with routine markers and indicators from packaging and processing perspectives. Many background jobs frequently stamp what’s relevant to them in between data segments so that they can continue their processing in a progressive manner.

513) It must be noted that the packaging of data inside the storage product has many artifacts that are internal to the product and certainly not readable in raw form. Therefore some offline command line tool to dump and parse the contents could prove very helpful.

514) The argument above also holds true for message passing between shared libraries inside the storage product While logs help capture the conversations, their entries may end up truncated. An offline tool to fully record, replay and interpret large messages would be helpful for troubleshooting.

515) Most of the storage products have embraced APIs in one form or the other. Their usage for protocols with external agents, internal diagnostics and manageability are valuable as online tools and merit the same if not better appreciation than scripts and offline tools.

Tuesday, February 26, 2019

We continue with the discussion on ledger parser
Checks and balances need to be performed as well by some readers. The entries may become spurious and there are ways a ghost entry can be made. Therefore, there is some trade-off in deciding whether the readers or the writers or both do the cleanup work.
The trouble with cleanup is that it requires modifications to the ledger and while they can be done independently by designated agents, they don’t give as much confidence as when they came from the source of the truth as the online writer. Since cleanup is a read-write task, it involves some form of writer.
If all the writes were uniform and clean entries, it does away with the need for designated agents and their lags and delays. At this end of the writer only solution, we have traded-off performance. At the other end of high-performance online writing, we are introducing all kinds of background processors, stages and cascading complexity in analysis.
The savvy online writer may choose to use partitioned ledgers for clean and accurate differentiated entries or quick translations to uniform-entries by piggy-backing the information on the transactions where it can keep a trace of entries for that transaction. The cleaner the data, the simpler the solution for downstream systems.
The elapsed time for analysis is the sum of whole. Therefore, all lags and delays from background operations are included. The streamlining of analysis operations requires data to be clean so that the analysis operations are read only transactions themselves.
If the operations themselves contribute to delays, the streamlining of operations for the overall analysis requires the use of persisting results from these operations where the input does not change. Then the operations need not be run each time and the results can be re-used for other queries. If the stages are sequential, the overall analysis becomes streamlined.
Unfortunately, the data from users cannot be confined to a size. Passing the results from one stage to another breaks down when there is a large amount of data to be processed in that stage and sequential processing does not scale. In such cases, partial results retrieved earlier help enable downstream systems to have something meaningful to do. Often, this is adequate even if all the data has not been processed.

Monday, February 25, 2019

Ledger Parser
Bookkeeping is an essential routine in software computing. A ledger facilitates book-keeping between components. The ledger may have entries that are written by one component and read by another. The syntax and semantics of the writes need to be the same for the reads. However, the writer and the reader may have very different loads and capabilities. Consequently, some short-hand may sneak into the writing and some parsing may be required on the reading side.
When the writer is busy, it takes extra effort to massage the data with calculations to come up with concise and informative entries. This is evident in the cases when the writes are part of online activities that the business depend on. On the other hand, it is easier for the reader to take the chores of translations and routines that are expensive because the reads are usually for analysis which do not have to impact online transactions.
This makes the ledger more like a table with different attributes so that all kinds of entries can be written. The writer merely captures the relevant entries without performing any calculations. The readers can use the same table and perform selection or rows or projection of columns to suit their needs. The table sometimes ends up as being very generic and expands its columns.
The problem is that if the writers don’t do any validations or tidying, the ledger becomes bloated and dirty. On the other hand, the readers may find it more and more onerous to look back a few records because there is more range to cover. This most often manifests as delays and lags between when an entry was recorded and when it made into the reports.
Checks and balances need to be performed as well by some readers. The entries may become spurious and there are ways a ghost entry can be made. Therefore, there is some trade-off in deciding whether the readers or the writers or both do the cleanup work.
The trouble with cleanup is that it requires modifications to the ledger and while they can be done independently by designated agents, they don’t give as much confidence as when they came from the source of the truth as the online writer. Since cleanup is a read-write task, it involves some form of writer.
If all the writes were uniform and clean entries, it does away with the need for designated agents and their lags and delays. At this end of the writer only solution, we have traded-off performance. At the other end of high-performance online writing, we are introducing all kinds of background processors, stages and cascading complexity in analysis.
The savvy online writer may choose to use partitioned ledgers for clean and accurate differentiated entries or quick translations to uniform-entries by piggy-backing the information on the transactions where it can keep a trace of entries

Sunday, February 24, 2019

Today we continue discussing the best practice from storage engineering:

500) There are no containers for native support of decision tree, classified and outlier data in unstructured storage but since they can be represented in key values, they can be assigned to objects themselves or maintained in dedicated metadata.
501) The instructions for setting up any web application on the object storage are easy to follow because they include the same steps. On the other hand, performance optimization for such web application depends on a case by case basis.
502) Application optimization is probably the only layer that truly remains with the user even in a full-service stack using a storage product. Scaling, availability, backup, patches, install, host maintenance, rack maintenance do remain with the storage provider.
503) The use of http headers, attributes, protocol specific syntax and semantics, REST conventions, OAuth and other such standards are well-known and covered in their respective RFCs on the net. Content-Delivery-Network can be provisioned straight from the object storage. Application optimization is about using both judiciously
504) An out of box service which facilitates an administrator defined rules for enabling the type of optimizations to perform.  Moreover, rules need not be written in the form of declarative configuration. They can be dynamic in the form of a module.
505) The Application Optimization also acts as a gateway when appropriate. Any implementation of gateway has to maintain a registry of destination addresses. As http access enabled objects proliferate with their geo-replications, this registry becomes granular at the object level while enabling rules to determine the site from which they need to be accessed. Finally, they gather statistics in terms of access and metrics which come very useful for understanding the http accesses of specific content within the object storage

Saturday, February 23, 2019

Today we continue the discussion on the best practice from storage engineering:
496) Tags can be used to make recommendations against the data to be searched. Tags point to groups and the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the participants in the selected group.

497) A useful data structure for mining the logical data model is the decision tree. Structure involves interior nodes = set (A1, … An) of categorical attributes . The leaf is the class label from domain(C). The edge is a value from domain(Ai), Ai associated with parent node. The property is a search tree. The tuples in R -> leafs in class labels . The decision tree's property is that it associates the tuples in R to the leafs i.e. class labels. The advantage of using a decision tree is that it can work with heterogeneous data and the decision boundary is parallel to the axis.

498) Clustering is a technique for categorization and segmentation of tuples. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R. Find a set of those groups of rows in R with the objectives that the groups should be cohesive and not coupled. The tuples within a group are similar to each other. The tuples across group are dissimilar. The constraint is that the number of clusters may be given and the clusters should be significant.

499) Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given.

500) There are no containers for native support of decision tree, classified and outlier data in unstructured storage but since they can be represented in key values, they can be assigned to objects themselves or maintained in dedicated metadata.

Friday, February 22, 2019

Today we continue discussing the best practice from storage engineering:

493) To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. These scores can then be used with tags. The same applies to resources.

494) To determine the closeness a couple of mathematical formula help. In this case, we could use the Euclidean distance or the Pearson co-efficient. The Euclidean distance finds the distance between two points in a multidimensional space by taking the sum of the square of the differences between the coordinates of the points and then calculating the square root of the result.

495) The Pearson correlation co-efficient is a measure of how highly correlated the two variables are. It’s generally a value between -1 and 1 where -1 means that there is a perfect inverse correlation and 1 means there is a perfect correlation while 0 means there is no correlation. It is computed with the numerator as the sum of the two variables taken together minus the average of their individual sums and this is divided by the square-root of the product of the squares of the substitutions to the numerator by using the same variable instead of the other.

496) Tags can be used to make recommendations against the data to be searched. Tags point to groups and the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the participants in the selected group.