Cluster computing

Sunday, February 24, 2019

Today we continue discussing the best practice from storage engineering:

500) There are no containers for native support of decision tree, classified and outlier data in unstructured storage but since they can be represented in key values, they can be assigned to objects themselves or maintained in dedicated metadata.
501) The instructions for setting up any web application on the object storage are easy to follow because they include the same steps. On the other hand, performance optimization for such web application depends on a case by case basis.
502) Application optimization is probably the only layer that truly remains with the user even in a full-service stack using a storage product. Scaling, availability, backup, patches, install, host maintenance, rack maintenance do remain with the storage provider.
503) The use of http headers, attributes, protocol specific syntax and semantics, REST conventions, OAuth and other such standards are well-known and covered in their respective RFCs on the net. Content-Delivery-Network can be provisioned straight from the object storage. Application optimization is about using both judiciously
504) An out of box service which facilitates an administrator defined rules for enabling the type of optimizations to perform.  Moreover, rules need not be written in the form of declarative configuration. They can be dynamic in the form of a module.
505) The Application Optimization also acts as a gateway when appropriate. Any implementation of gateway has to maintain a registry of destination addresses. As http access enabled objects proliferate with their geo-replications, this registry becomes granular at the object level while enabling rules to determine the site from which they need to be accessed. Finally, they gather statistics in terms of access and metrics which come very useful for understanding the http accesses of specific content within the object storage

Saturday, February 23, 2019

Today we continue the discussion on the best practice from storage engineering:
496) Tags can be used to make recommendations against the data to be searched. Tags point to groups and the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the participants in the selected group.

497) A useful data structure for mining the logical data model is the decision tree. Structure involves interior nodes = set (A1, … An) of categorical attributes . The leaf is the class label from domain(C). The edge is a value from domain(Ai), Ai associated with parent node. The property is a search tree. The tuples in R -> leafs in class labels . The decision tree's property is that it associates the tuples in R to the leafs i.e. class labels. The advantage of using a decision tree is that it can work with heterogeneous data and the decision boundary is parallel to the axis.

498) Clustering is a technique for categorization and segmentation of tuples. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R. Find a set of those groups of rows in R with the objectives that the groups should be cohesive and not coupled. The tuples within a group are similar to each other. The tuples across group are dissimilar. The constraint is that the number of clusters may be given and the clusters should be significant.

499) Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given.

500) There are no containers for native support of decision tree, classified and outlier data in unstructured storage but since they can be represented in key values, they can be assigned to objects themselves or maintained in dedicated metadata.

Friday, February 22, 2019

Today we continue discussing the best practice from storage engineering:

493) To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. These scores can then be used with tags. The same applies to resources.

494) To determine the closeness a couple of mathematical formula help. In this case, we could use the Euclidean distance or the Pearson co-efficient. The Euclidean distance finds the distance between two points in a multidimensional space by taking the sum of the square of the differences between the coordinates of the points and then calculating the square root of the result.

495) The Pearson correlation co-efficient is a measure of how highly correlated the two variables are. It’s generally a value between -1 and 1 where -1 means that there is a perfect inverse correlation and 1 means there is a perfect correlation while 0 means there is no correlation. It is computed with the numerator as the sum of the two variables taken together minus the average of their individual sums and this is divided by the square-root of the product of the squares of the substitutions to the numerator by using the same variable instead of the other.

496) Tags can be used to make recommendations against the data to be searched. Tags point to groups and the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the participants in the selected group.

Thursday, February 21, 2019

Today we continue discussing the best practice from storage engineering:

489) Tags can generate more tags. Background processing and automation can work with tags to generate more tags. For example, a clustering operation on the existing data using similarity measures on existing tags will generate more tags.

490) Tags also work as friendly names for resources that are not visible or tracked at the billing level. For example, if a virtual machine has several network interface cards (NIC) then keeping track of the different models of the virtual machines may not be sufficient granularity for the tags. On the other hand keeping track of all the models of the NIC albeit software device with their identifiers may be too many to keep track off. Instead tags could represent hierarchical information by masking different tags at lower levels. Thus hierarchical tags can be used to have a sliding scale of granularity on the associated resources. This way search can be expanded to include sub-resources

491) We can assign tags only to resources that already exist. If we add a tag that has the same key as an existing tag on that resource, the new value overwrites the old value. We can edit tag keys and values, and we can remove tags from a resource at any time. We can set a tag's value to the empty string, but we can't set a tag's value to null. We can even control who can see these tags.

492)Tagging unlike relational data can come in very helpful for NoSQL like querying and batch processing. Since it does not involve operational data on the resources for the cloud provider, it does not have any performance impact and is more suited for analytics, offline processing and reporting.

Wednesday, February 20, 2019

Today we continue discussing the best practice from storage engineering:

485) Tags don't have any semantic meaning to the functional aspects of the resource and are interpreted strictly as a string of characters. Also, tags are not automatically assigned to our resources.

486) Tags can easily be authored and managed by console, command line interface or API

487) Resources have their identifiers but the metadata on the resources can even be added after the instance has been created. If we treat tags as friendly names for these data types then we have more tags than earlier and thus expanding the options mentioned above.

488) Tags are also lines of search. When a user gives a search term or terms, very often she is trying to find one item that is not being found. The user has to improve the search terms or invoke a lot more options or dig through voluminous results. Instead, if the lines of search were available as intentions, then we can show results corresponding to them.

489) Tags can generate more tags. Background processing and automation can work with tags to generate more tags. For example, a clustering operation on the existing data using similarity measures on existing tags will generate more tags.

Tuesday, February 19, 2019

Today we continue discussing the best practice from storage engineering:

481) The nature of the query language determines the kind of resolving that the data virtualization needs to do. In addition, the type of storage that the virtualization layer spans also depend on the query language.

482) In order to explain the difference between data virtualization over structured and unstructured storage types, we look at metadata in structure storage. All data types used are registered. Whether they are system builtin types or user defined types, the catalog helps with the resolution.

483) A query describing the selection of entries with the help of predicates does not necessarily have to be bound to structured or unstructured query languages. Yet the convenience and universal appeal of one language may dominate another. Therefore, in such cases whether the query language is agnostic or predominantly biased, it can be modified or rewritten to suit the needs of the storage stacks described earlier.

484) Delegation doesn’t have to be the only criteria for the virtualization layer. Both the administrator and the system may maintain rules and configurations with which to locate the store for the data. More importantly the rules can be both static and dynamic. The former refers to rules that are declared ahead of the launch of the service and the service merely loads it in. The latter refers to the evaluations that dynamically assign queries to store based on classifiers and connection attributes.

Monday, February 18, 2019

Today we continue discussing the best practice from storage engineering:

473) Storage products are also prone to increasing their test matrix with new devices such as solid state drive and emerging trends such as IoT

474) Storage products have to be limitless for their customers but they cannot say how they will be used. They will frequently run into usages where customers use them inappropriately and go against their internal limits such as the number of policies that can be applied to their organizational units.

475) There was a time when content addressable storage was popular. It involved generating a PEA file to save contents that could be looked up by their hash. The use of object storage made it easier to access the objects directly.

476) Data is increasingly being produced as fixed content Emails and faxes are examples of these. The lifecycle of content such as from system, personal computing, Network centric and content centric are progressively higher and higher in their durations

477) Drop and create of user artifacts helps the user to cleanup. This is not the case for say system catalog. Still the storage artifacts used on behalf of the user is also the same as the storage artifacts used for system itself. Creating and dropping such artifacts would be helpful even if they are internal.

478) The retention policy is typically 6 months for email, 3 years for financial data, 5 years for legal. The retention period for object storage is user defined.

479) Object Storage is touted as best for static content. Data that changes often is then said to be preferred in NoSQL or other unstructured storage. With object versioning, API and SDK, this is no longer the case.

480) Data Transfers have never been considered a virtual storage since they belong to the source. Data in transit can live in queues, cache and object storage which is good for vectorized execution .