Cluster computing

Monday, February 11, 2019

Today we continue discussing the best practice from storage engineering:

449) Object storage can serve as the storage for graph databases and object databases. Object storage then transforms from being a passive storage layer to one that actively builds metadata, maintains organizations and rebuilds indexes from object rather than from files.

450) File-systems have long been the destination to store artifacts on disk and while file-system has evolved to stretch over clusters and not just remote servers, it remains inadequate as a blob storage. Data writers have to self-organize and interpret their files while frequently relying on the metadata stored separate from the files

451) Files also tend to become binaries with proprietary interpretations. Files can only be bundled in an archive and there is no object-oriented design over data. If the storage were to support organizational units in terms of objects without requiring hierarchical declarations and supporting is-a or has-a relationships, it tends to become more usable than files. This modular storage enhances the use of object storage and does not compete with the usages of elastic file stores.

452) Such object-storage will find a niche usage in spatial databases, telecommunications and scientific computing requiring large scale use of elementary organizational units that are not necessarily related. For example, spatial databases make use of polygons as a unit of organization and store large amounts of polygons

453) Traditional relational databases have long cherished an acceptance for storing data that requires interpretations. However, the chores associated with converting data to structured form and amenable to querying can be relaxed with native support for rich non-hierarchical data organization from storage layer and transformation to a different class of unstructured storage.

int GetCoinsDP(List<int> coins, int i, int j)
{
If (i > j) return 0;
If (i==j) return coins[i];
If (j == i+1) return max (coins [i], coins [j]);
return max (
coins[I] + GetCoinsDP(sequence, 1, j),
coins[j] + GetCoinsDP (sequence, 0,j-1));
}
The selections in each initiation level of GetCoinsDP can be added to a list and alternate additions can be skipped as belonging to the other player since the method remains the same for both.

Sunday, February 10, 2019

Today we continue discussing the best practice from storage engineering:

443) Sections of the file can be locked for multi-process access and even to map sections of a file on virtual memory systems. The latter is called memory mapping and it enables multiple processes to share the data. Each sharing process' virtual memory map points to the same page of physical memory - the page that holds a copy of the disk block.

444) File Structure is dependent on the file types. Internal file structure is operating system dependent. Disk access is done in units of block. Since logical records vary in size, several of them are packed in single physical block as for example at byte size. The logical record size, the physical block size and the packing technique determine how many logical records are in each physical block. There are three major methods of allocation methods: contiguous, linked and indexed. Internal fragmentation is a common occurrence from the wasted bytes in block size.

445) Access methods are either sequential or direct. The block number is relative to the beginning of the file. The use of relative block number helps the program to determine where the file should be placed and helps to prevent the users from accessing portions of the file system that may not be part of his file.

446) File system is broken into partitions. Each disk on the system contains at least one partition. Partitions are like separate devices or virtual disks. Each partition contains information about files within it and is referred to as the directory structure. The directory can be viewed as a symbol table that translates file names into their directory entries. Directories are represented as a tree structure. Each user has a current directory. A tree structure prohibits the sharing of a files or directories. An acyclic graph allows directories to have shared sub-directories and files. Sharing means there's one actual file and changes made by one user are visible to the other. Shared files can be implemented via a symbolic link which is resolved via the path name. Garbage collection may be necessary to avoid cycles.

447) Protection involves access lists and groups. Consistency is maintained via open and close operation wrapping.

448) File system is layered in the following manner:
1) application programs, 2) logical file system, 3) file-organization module, 4) basic file system, 4) i/o control and 5) devices. The last layer is the hardware. The i/o control is the consists of device drivers and interrupt handlers, the basic file system issues generic commands to the appropriate device driver. The file organization module knows about files and their logical blocks. The logical file system uses the directory structure to inform the file organization module. The application program is responsible for creating and deleting files.

#codingexercise:

a player can draw a coin from the ends of a sequence. Determine the winning strategy:

int GetCoinsDP(List<int> coins, int i, int j)
{
If (i > j) return 0;
If (i==j) return coins[i];
If (j == i+1) return max (coins [i], coins [j]);
return max (
coins[I] + GetCoinsDP(sequence, 1, j),
coins[j] + GetCoinsDP (sequence, 0,j-1));
}
The selections in each initiation level of GetCoinsDP can be added to a list and alternate additions can be skipped as belonging to the other player since the method remains the same for both.

# potential trend with object storage:

https://1drv.ms/w/s!Ashlm-Nw-wnWuQB7rNqURAxQf9hF

Saturday, February 9, 2019

Today we continue discussing the best practice from storage engineering :

441) File-Systems continue to be a good source of organizational information on storage systems. File attributes include name, type, location, size, protection, and time, date and user identification. Operations supported are creating a file, writing a file, reading a file, repositioning within a file, deleting a file, and truncating a file.

442) Data structures include two levels of internal tables: there is a per process table of all the files that each process has opened. This points to the location inside a file where data is to be read or written. This table is arranged by the file handles and has the name, permissions, access dates and pointer to disk block. The other table is a system wide table with open count, file pointer, and disk location of the file.

443) Sections of the file can be locked for multi-process access and even to map sections of a file on virtual memory systems. The latter is called memory mapping and it enables multiple processes to share the data. Each sharing process' virtual memory map points to the same page of physical memory - the page that holds a copy of the disk block.

444) File Structure is dependent on the file types. Internal file structure is operating system dependent. Disk access is done in units of block. Since logical records vary in size, several of them are packed in single physical block as for example at byte size. The logical record size, the physical block size and the packing technique determine how many logical records are in each physical block. There are three major methods of allocation methods: contiguous, linked and indexed. Internal fragmentation is a common occurrence from the wasted bytes in block size.

445) Access methods are either sequential or direct. The block number is relative to the beginning of the file. The use of relative block number helps the program to determine where the file should be placed and helps to prevent the users from accessing portions of the file system that may not be part of his file.

#codingexercise:
In a game of drawing two coins from either ends of a sequence between two players, determine the strategy to win:

Int GetCoins2DP(List<int> coins, int i, int j) { If (i > j) return 0; If (i==j) return coins[i];

If (j == i+1) return sum (coins [i], coins [j]);

return max (

coins[I] + coins[I+1] + GetCoins2DP(sequence, 2, sequence.size() - 1),

coins[j] + coins[j-1] + GetCoins2DP(sequence, 0, sequence.size() - 3),

coins[I] + coins[j] + GetCoins2DP(sequence, 1,sequence.size()-2));

}

The selections in each initiation level of GetCoins2DP can be added to a list and alternate additions to this list can be skipped as belonging to the other player since the method remains the same for both. Then it helps to determine whether to go first or second.

Friday, February 8, 2019

Today we continue discussing the best practice from storage engineering:

436) Event monitoring software can accelerate software development and test cycles. Event monitoring data is usually machine data generated by the IT systems. Such data can enable real-time searches to gain insights into user experience. Dashboards with charts can then help analyze the data. This data can be accessed over TCP, UDP and HTTP. Data can also be warehoused for analysis. Issues that frequently recur can be documented and searched more quickly with the availability of such data leading to faster debugging and problem solving.

437) Data is available to be collected, indexed, searched and reported. Applications can target specific interests such as security or correlations for building rules and alerts. Data is also varied such as from network, from applications, and from enterprise infrastructure. Powerful querying increases the usability of such data.

438) Queries for such key valued data can be written using PIG commands such as load/read, store/write, foreach/iterate, filter/predicate, group-cogroup, collect, join, order, distinct, union, split, stream, dump and limit.

439) Some of the differentiators of such software include the ability to have one platform, fast return on investment, ability to use different data collectors, use non-traditional flat file data stores, ability to create and modify existing reports, ability to create baselines and study changes, programmability to retrieve information as appropriate and ability to include compliance, security, fraud detection etc

440) Early warning notifications, running rules engine, detecting trends are some of the features that enhance not only popular use cases by providing feedback of deployed software but also increase customer satisfaction as changes are incremental

Thursday, February 7, 2019

Today we continue discussing the best practice from storage engineering :

431) There are two flavors of the release consistency model - the serialization consistency and processor consistency flavors. All of the models in this group allow a processor to read its own write early. However, the two flavors are the only ones whose straightforward implementations allow a read to return the value of another processor's write early. These models distinguish memory operations based on their type and provide stricter ordering constraints for some type of operations.

432) The Weak ordering model classifies memory operations into two categories: data operations and synchronization operations. Since the programmer is required to identify at least one of the operations as a synchronization operation, the model can reorder memory operations between these synchronization operations without affecting the program correctness.

433) The other category of models for relaxing all program orders such as Alpha, RMO and PowerPC - all provide explicit fence instructions as their safety nets. The alpha model provides two different fence instructions: the memory barrier and the write memory barrier. The memory barrier (MB) instruction can be used to maintain program order from any memory operation before the MB to any memory instruction after the MB. The write memory barrier instruction provides this guarantee only among write operations.

434) The PowerPC model provides a single fence instruction: the SYNC instruction. This is similar to the memory barrier instruction with the exception that when there are two reads to the same location, one may return the value of an older write than the first read. This model therefore requires read-modify-write semantics to enforce program order.

435) A key goal of the programmer centric approach is to define the operations that should be distinguished as synchronization. In other words, a user's program consists of operations that are to be synchronized or otherwise categorized as data operations in an otherwise sequentially consistent program.

#codingexercise:
We were discussing the game of drawing coins from a sequence of coins to maximize our collection against an opponent:

And the implementation for GetCoins3 where we return the combined value of 3 or less coins can be as follows:
Int GetCoins3OneTimeDraw (List<int> coins, int i, int j)
{
int n = coins.count;
If (i >= j) return 0;
If (i==j) return coins[i];
If (j == i+1) return sum (coins [i], coins [j]);
If (j == i+2) return sum(coins[i], coins [i+1], coins[j]);
// using handpicking
Var option1 = max (
coins[I] + coins[I+1] + coins[I+2],
coins[I] + coins[I+1] + coins[j],
coins[I] + coins[j-1] + coins[j],
coins[j-2] + coins[j-1] + coins[j]);
// using GetCoins2
Var option2 = max (
coins[I] + GetCoins2(coins, i+1, j),
GetCoins2(coins, i, j-1) + coins[j]);

//using GetCoins
Var option3 = max (
Coins[I] + coins[I+1] + GetCoins(coins, I+2, j),
Coins[I] + GetCoins(coins, I+1, j-1) + coins[j],
GetCoins(coins, I, j-2) + coins[j-1] + coins[j]
);
Return max(option1, option2, option3);
}

Wednesday, February 6, 2019

Today we continue discussing the best practice from storage engineering:

425) Another factor to improve data residency has been compression but this has required representations that are amenable to data processing internals..

426) There can be conflicts during replication cycle. For example, server A creates an object with a particular name at roughly the same time that Server B creates an object. with the same name. The conflict reconciliation process kicks in at the next replication cycle. The server looks for the version numbers of the updates and whichever is higher wins the conflict. If the version numbers are same, whichever attribute was changed at a later time wins the conflict.

427) If an object is moved to a parent that is now deleted, that object is placed in the lost and found container.

428) The single-master replication has following drawbacks: it has a single point of failure, there's geographic distance from master to clients performing the updates, and less efficient replication due to single originating location of updates. With multi-master replication, these can be avoided but they must be made part of a topology and the way the masters replicate with each other must be defined.

429) Some replication techniques use background loading where the data is loaded offline before being made available online. This is especially useful when the process of loading can take a long time.

430) Availability of a service is improved by adding a cluster instead of a server. On the other hand, processes involved in background loading can use a primary server together with secondary servers. In such cases, a primary server is authoritative but a secondary server can serve the content when primary is unavailable.

#codingexercise
In a game of collecting coins of different value from a sequence, is it better to go first or second ?

int getPlayer(List<int> sequence) {

Int player1Collection = GetCoins(sequence, 0, sequence.size() - 1);

Int player2Collection = Math.max(GetCoins(sequence, 1, sequence.size() - 1), GetCoins(sequence, 0, sequence.size() - 2));

List<int> collections = Arrays.asList(player1Collection, player2Collection);

Int max = Collections.max(collections);

Return collections.indexOf(max);

}

Tuesday, February 5, 2019

Today we continue discussing the best practice from storage engineering:

416) Key Value pairs are organized according to the key. Keys in turn are assigned to a partition. Once a key is assigned to a partition, it cannot be moved to a different partition. Since it is not configurable, the number of partitions in a store is decided upfront.

417) The number of storage nodes in use by the store can however be changed. When this happens, the store undergoes reconfiguration, the partitions are balanced between new and old shards, redistribution of partition between one shard and another takes place.

418) The more the number of partitions the more the granularity for the reconfiguration. It is typical to have ten to twenty partitions per shard. Since the number of partitions cannot be changed afterwards, it is decided at design time.

419) The number of nodes belonging to a shard called its replication factor improves the throughput. The higher the replication factor, the faster the read because of availability. The same is not true for writes since there is more copying involved. Once the replication factor is set, the storage product takes care of creating the appropriate number of replication nodes for each shard.

420) A topology is a collection of storage nodes, replication nodes, and the associated services. At any point of time, a deployed store has one topology. The initial topology is such that it minimizes the possibility of a single point of failure for any given shard. If the storage node hosts more than one replication node, those replication nodes will not be from the same shard. If the host machine goes down, the shard can continue for reads and writes.