Cluster computing

Friday, February 8, 2019

Today we continue discussing the best practice from storage engineering:

436) Event monitoring software can accelerate software development and test cycles. Event monitoring data is usually machine data generated by the IT systems. Such data can enable real-time searches to gain insights into user experience. Dashboards with charts can then help analyze the data. This data can be accessed over TCP, UDP and HTTP. Data can also be warehoused for analysis. Issues that frequently recur can be documented and searched more quickly with the availability of such data leading to faster debugging and problem solving.

437) Data is available to be collected, indexed, searched and reported. Applications can target specific interests such as security or correlations for building rules and alerts. Data is also varied such as from network, from applications, and from enterprise infrastructure. Powerful querying increases the usability of such data.

438) Queries for such key valued data can be written using PIG commands such as load/read, store/write, foreach/iterate, filter/predicate, group-cogroup, collect, join, order, distinct, union, split, stream, dump and limit.

439) Some of the differentiators of such software include the ability to have one platform, fast return on investment, ability to use different data collectors, use non-traditional flat file data stores, ability to create and modify existing reports, ability to create baselines and study changes, programmability to retrieve information as appropriate and ability to include compliance, security, fraud detection etc

440) Early warning notifications, running rules engine, detecting trends are some of the features that enhance not only popular use cases by providing feedback of deployed software but also increase customer satisfaction as changes are incremental

Thursday, February 7, 2019

Today we continue discussing the best practice from storage engineering :

431) There are two flavors of the release consistency model - the serialization consistency and processor consistency flavors. All of the models in this group allow a processor to read its own write early. However, the two flavors are the only ones whose straightforward implementations allow a read to return the value of another processor's write early. These models distinguish memory operations based on their type and provide stricter ordering constraints for some type of operations.

432) The Weak ordering model classifies memory operations into two categories: data operations and synchronization operations. Since the programmer is required to identify at least one of the operations as a synchronization operation, the model can reorder memory operations between these synchronization operations without affecting the program correctness.

433) The other category of models for relaxing all program orders such as Alpha, RMO and PowerPC - all provide explicit fence instructions as their safety nets. The alpha model provides two different fence instructions: the memory barrier and the write memory barrier. The memory barrier (MB) instruction can be used to maintain program order from any memory operation before the MB to any memory instruction after the MB. The write memory barrier instruction provides this guarantee only among write operations.

434) The PowerPC model provides a single fence instruction: the SYNC instruction. This is similar to the memory barrier instruction with the exception that when there are two reads to the same location, one may return the value of an older write than the first read. This model therefore requires read-modify-write semantics to enforce program order.

435) A key goal of the programmer centric approach is to define the operations that should be distinguished as synchronization. In other words, a user's program consists of operations that are to be synchronized or otherwise categorized as data operations in an otherwise sequentially consistent program.

#codingexercise:
We were discussing the game of drawing coins from a sequence of coins to maximize our collection against an opponent:

And the implementation for GetCoins3 where we return the combined value of 3 or less coins can be as follows:
Int GetCoins3OneTimeDraw (List<int> coins, int i, int j)
{
int n = coins.count;
If (i >= j) return 0;
If (i==j) return coins[i];
If (j == i+1) return sum (coins [i], coins [j]);
If (j == i+2) return sum(coins[i], coins [i+1], coins[j]);
// using handpicking
Var option1 = max (
coins[I] + coins[I+1] + coins[I+2],
coins[I] + coins[I+1] + coins[j],
coins[I] + coins[j-1] + coins[j],
coins[j-2] + coins[j-1] + coins[j]);
// using GetCoins2
Var option2 = max (
coins[I] + GetCoins2(coins, i+1, j),
GetCoins2(coins, i, j-1) + coins[j]);

//using GetCoins
Var option3 = max (
Coins[I] + coins[I+1] + GetCoins(coins, I+2, j),
Coins[I] + GetCoins(coins, I+1, j-1) + coins[j],
GetCoins(coins, I, j-2) + coins[j-1] + coins[j]
);
Return max(option1, option2, option3);
}

Wednesday, February 6, 2019

Today we continue discussing the best practice from storage engineering:

425) Another factor to improve data residency has been compression but this has required representations that are amenable to data processing internals..

426) There can be conflicts during replication cycle. For example, server A creates an object with a particular name at roughly the same time that Server B creates an object. with the same name. The conflict reconciliation process kicks in at the next replication cycle. The server looks for the version numbers of the updates and whichever is higher wins the conflict. If the version numbers are same, whichever attribute was changed at a later time wins the conflict.

427) If an object is moved to a parent that is now deleted, that object is placed in the lost and found container.

428) The single-master replication has following drawbacks: it has a single point of failure, there's geographic distance from master to clients performing the updates, and less efficient replication due to single originating location of updates. With multi-master replication, these can be avoided but they must be made part of a topology and the way the masters replicate with each other must be defined.

429) Some replication techniques use background loading where the data is loaded offline before being made available online. This is especially useful when the process of loading can take a long time.

430) Availability of a service is improved by adding a cluster instead of a server. On the other hand, processes involved in background loading can use a primary server together with secondary servers. In such cases, a primary server is authoritative but a secondary server can serve the content when primary is unavailable.

#codingexercise
In a game of collecting coins of different value from a sequence, is it better to go first or second ?

int getPlayer(List<int> sequence) {

Int player1Collection = GetCoins(sequence, 0, sequence.size() - 1);

Int player2Collection = Math.max(GetCoins(sequence, 1, sequence.size() - 1), GetCoins(sequence, 0, sequence.size() - 2));

List<int> collections = Arrays.asList(player1Collection, player2Collection);

Int max = Collections.max(collections);

Return collections.indexOf(max);

}

Tuesday, February 5, 2019

Today we continue discussing the best practice from storage engineering:

416) Key Value pairs are organized according to the key. Keys in turn are assigned to a partition. Once a key is assigned to a partition, it cannot be moved to a different partition. Since it is not configurable, the number of partitions in a store is decided upfront.

417) The number of storage nodes in use by the store can however be changed. When this happens, the store undergoes reconfiguration, the partitions are balanced between new and old shards, redistribution of partition between one shard and another takes place.

418) The more the number of partitions the more the granularity for the reconfiguration. It is typical to have ten to twenty partitions per shard. Since the number of partitions cannot be changed afterwards, it is decided at design time.

419) The number of nodes belonging to a shard called its replication factor improves the throughput. The higher the replication factor, the faster the read because of availability. The same is not true for writes since there is more copying involved. Once the replication factor is set, the storage product takes care of creating the appropriate number of replication nodes for each shard.

420) A topology is a collection of storage nodes, replication nodes, and the associated services. At any point of time, a deployed store has one topology. The initial topology is such that it minimizes the possibility of a single point of failure for any given shard. If the storage node hosts more than one replication node, those replication nodes will not be from the same shard. If the host machine goes down, the shard can continue for reads and writes.

Monday, February 4, 2019

Today we continue discussing the best practice from storage engineering :

409) Object storage provides local real-time writes while supporting a read dominated workload. More geographical distribution and horizontal scalability helps improve performance
410) The objects can live in the cache as well. A search engine can provide search over the catalog. Functional data access can be provided by the API. The API and the engine can separately cover all operations on the catalog.
411) Almost every NoSQL database has a comparison between its product and its competitors. The differences enumerated between these products while retaining the similarities shows how a product design affects its position in the Gartner magic quadrant. Most of them start with a simple idea of emphasizing some design choice over other.
412) Object storage has similar competitors differing mainly in their support of distributed and cluster file systems. Object storage is not merely an S3 façade storage. It brings durability, availability and content distribution to the storage while enabling multi-protocol access with or without file-system being enabled.
413) The catalog can be organized into Item, Variant, Price, Hierarchy, Facet and Vendors in the object store. The applications can search for data via prejoined objects either in cache or in store via indexing through search engine using Lucene/Solr architecture
414) All Catalog entities such as Item, Variant, Price, Hierarchy, Facet and Vendors can be represented as key-value collections or document models.
415) The unit of storage in non-relational stores is the key-value collection and each row can have different number of columns from a column family. The option for performance and scalability has been to use sharding and partitions.

Sunday, February 3, 2019

Today we continue discussing the best practice from storage engineering:

406) Catalogs support data modeling, data synchronization, data standardization, and flexible workflows. There are layer of information management using catalog starting with print/translation workflows at the bottom layer, followed by workflow or security for access to the assets, their editing, insertions and bulk insertions, followed by integrations or portals followed by flexible integration capabilities, full/data exports, multiple exports, followed by integration portals for integration with imports/exports, data pools and platforms, followed by digital asset management layer for asset on-boarding and delivery to channels, and lastly data management for searches, saved searches, channel based content, or localized content and the ability to author variants, categories, attributes and relationships to stored assets.

407) Relational databases suffer from limitations for supporting catalog workflows due to the following:
The field inventory is a local view only until it makes its way to the central store.
The relational store involves a one-a-day sync or something periodic
Stale views are served until the refresh happens which is often not fast enough for consumers.
The stale view interferes with analytics and aggregations reports.
Downstream internal and external apps have to work around the delays and stale views with sub-optimal logic.

408) The purpose of the catalog is to form a single view of the product with one central service, flexible schema, high read volume, write spike tolerant during catalog update, and to have advanced indexing and querying and geographical distribution for HA and low latency.

409) Object storage provides local real-time writes while supporting a read dominated workload. More geographical distribution and horizontal scalability helps improve performance

410) The objects can live in the cache as well. A search engine can provide search over the catalog. Functional data access can be provided by the API. The API and the engine can separately cover all operations on the catalog.

Saturday, February 2, 2019

Today we continue discussing the best practice from storage engineering:

401) Catalog can be maintained as a one stop shop in a store. There does not need to be sub-catalogs or fragmentation or ETL or MessageBus

402) Catalogs can remain equally available to Application servers, API data and services and webservers.

403) Catalogs can also be made available behind the store for supply chain management and data warehouse analytics

404) Catalogs can be made available for browsing as well as searching via such facilitators as Lucene search index

405) Catalogs can support geo-sharding with persisted shard ids or more granular store ids for improving high availability

406) Catalogs support data modeling, data synchronization, data standardization, and flexible workflows. There are layer of information management using catalog starting with print/translation workflows at the bottom layer, followed by workflow or security for access to the assets, their editing, insertions and bulk insertions, followed by integrations or portals followed by flexible integration capabilities, full/data exports, multiple exports, followed by integration portals for integration with imports/exports, data pools and platforms, followed by digital asset management layer for asset on-boarding and delivery to channels, and lastly data management for searches, saved searches, channel based content, or localized content and the ability to author variants, categories, attributes and relationships to stored assets.