Cluster computing

Monday, January 28, 2019

Today we continue discussing the best practice from storage engineering:
378) Finance data is also heavily regulated. The Sarbanes-Oxley act sought to bring control to the corporations in order to avoid accounting scandals. It specified disclosure controls, audit and the compliance terms

379) ETL tools are widely used for in house finance data. On the other hand, global indexes, ticker price and tradings are constantly polled or refreshed. In these cases, it is hard to unify storage across departments, companies and industries. A messaging framework is preferred instead. Object storage could be put to use in these global stores but their adoption and usage across companies is harder to enforce.

380) Web services and gateways are preferred to distribute data and different kinds of finance data processing systems evolve downstream. These systems tend to have their own storage while transforming and contributing to the data in flight. The web-services are also popular for using as data source in parallel analysis usages. Since distribution and flow is more important for the data, the origin is considered the source of truth and the data changes are not propagated back to the origin.

381) Health industry data is another example with its own needs around data compliance. The Health insurance portability and Accountability act required a lot of controls around who, when and where can get access to personally identifiable information

382) Health data is often tightly integrated into proprietary stacks and organizations. Yet they are also required to participate in providing web access to all the information surrounding an individual at the same place. This makes them require a virtualized cross company data storage.

383) Health data is not just sharded by customer but also maintained in isolated shared-nothing pockets with their own management systems. Integration of data to represent a whole for the same customer is the new and emerging trend in the health industry. Organizations and companies are looking to converge the data for an individual without losing privacy or failing to comply with government regulations.

Sunday, January 27, 2019

Today we continue discussing the best practice from storage engineering:

375) Most storage products don’t differentiate between human and machine data because it involves upper layers of data management. However, dedicated differentiation between human and machine data can make the products more customized for these purposes.

376) Data storage requirements change from industry to industry. Finance data storage is largely in the form of distributed indexes and continuous data transfers. A cold storage product does not serve its needs even if the objects are accessible over the web

377) Finance data is subject to a lot of calculations and proprietary and often well-guarded calculators that have largely relied on relational databases. Yet these same companies have also adopted NoSQL storage in favor of their warehouses. As their portfolios grow, they incubate new and emerging features and increasingly favor new technologies

378) Finance data is also heavily regulated. The Sarbanes-Oxley act sought to bring control to the corporations in order to avoid accounting scandals. It specified disclosure controls, audit and the compliance terms

379) Health industry data is another example with its own needs around data compliance. The Health insurance portability and Accountability act required a lot of controls around who, when and where can get access to personally identifiable information

380) Health data is often tightly integrated into proprietary stacks and organizations. Yet they are also required to participate in providing web access to all the information surrounding an individual at the same place. This makes them require a virtualized cross company data storage.

Saturday, January 26, 2019

Today we continue discussing the best practice from storage engineering
:
371) Data management software such as Cloudera can be deployed and run on any cloud. It offers an enterprise data hub, an analytics DB, and operational DB, data science and engineering and essentials. It is elastic and flexible, it has high performance analytics, it can easily provision over multiple clouds and it can be used for automated metering and billing. Essentially they allow different data models, real-time data pipelines and streaming applications with their big data platform. They enable data models to break free from vendor lockins and with the flexibility to let it be community defined.

372) The data science workbench offered from Cloudera involves a console on a web browser that users can authenticate themselves with using Kerberos against the cluster KDC. Engines are spun-up and we can seamlessly connect with Spark, Hive, and Impala. The engines are spun up based on engine kernels and profiles.

373) Cloudera Data Science workbench uses Docker and Kubernetes. Cloudera is supported on dedicated Hadoop hosts. Cloudera also adds a data engineering service called Altus. It’s a platform that works against a cloud by allowing clusters to be setup and torn down and jobs to be submitted to those clusters. Clusters may be Apache Spark, MR2 or Hive.

374) Containerization technologies and Backend as a service aka lambda functions can also be supported by products such as Cloudera which makes them usable with existing public clouds while it offers an on-premise solution

375) Most storage products don’t differentiate between human and machine data because it involves upper layers of data management. However, dedicated differentiation between human and machine data can make the products more customized for these purposes.

Friday, January 25, 2019

Today we continue discussing the best practice from storage engineering :

366) Sometimes performance drives the necessity to create other storage products. Social engineering utilizes storage products that are not typical to enterprise or cloud storage. This neither means that social engineering applications cannot be built on cloud services nor does it mean that the on-premise storage products necessarily have to conform to organizational hardware or virtualware needs.

367) To improve performance and scalability, Facebook had to introduce additional parallelization in the runtime and the shared contexts which they called "WorkerContext". Bottlenecks and overheads such as checkpointing were addressed by scheduling. This was finer level than what the infrastructure provided.

368) Facebook even optimized the memory utilization of the graph infrastructure because it allowed arbitrary vertex id, vertex value, edge and message classes. They did this with 1) serializing edges with byte array and 2) serializing messages on the server.

369) Facebook improved parallelization with sharded aggregators that provided efficient shared state across workers. With this approach, each aggregator gets assigned to a randomly picked worker which then gathers the values, performs the aggregation and distributes the final values to master and other workers. This distributes the load that was otherwise entirely on the master.

370) Many companies view graphs as an abstraction rather than an implementation of the underlying database. There are two reasons for this:
First, Key-value stores suffice to capture the same information in a graph and can provide flexibility and speed for operations that can be translated as queries on these stores. Then these can be specialized for the top graph features that an application needs.
Second, different organizations within the company require different stacks built on the same logical data for reasons such as business impact, modular design and ease of maintenance.

Thursday, January 24, 2019

Today we continue discussing the best practice from storage engineering :

360) A rewarding service increases appeal and usage by customers. This is what makes Blockchain popular.

361) Many database clusters are used as a failover cluster and not as a performance or scaleable cluster. This is primarily because the storage server is designed for scale-up versus scale-out. This is merely an emphasis on the judicious choice of technology for design.

362) Some database server uses SAN, it is a shared storage and does not go offline. Most storage products have embraced Network Access Storage. Similar considerations are true when the database server is hosted on a container and the database is on a shared volume

363) The clustering does not save space or efforts for backup or maintenance. And it does not scale out the reads for the database. Moreover, it does not give a 100% uptime.

364) Reporting stack is usually a pull and transformation operation on any database and is generally independent of the data manipulation from online transactions. Therefore, if a storage product can simplify its design by offloading reporting stack to say time-series database, grafana and charting stack, then it can focus on storage server related design.

365) The above is not necessarily true for analysis stacks which often produce a large number of artifacts during computations and as such are heavily engaged in the read-write on the same storage stack.

Wednesday, January 23, 2019

Today we continue discussing the best practice from storage engineering:

355) P2P is considered a top-heavy network. Top-heavy means we have an inverted pyramid of layers where the bottom layer is the network layer. This is the substrate that connects different peers. The overlay nodes management layer handles the management of these peers in terms of routing, location lookup and resource discovery. The layer on top of this is the features management layer which involves security management, resource management, reliability and fault resiliency.

356) Let us take a look at contracts. We discussed this earlier in reference to Merkle trees. A smart contract can even include code and evaluate dynamically just like a classifier with rules in program order.

357) A utility token issued for authorization need not be opaque. It can be made up of three parts – one which represent the user, another which represents the provider, and an irrefutable part that is like a stamp of authority. The section from the provider may also include scope for the resources authorized.

358) The attestation mechanism itself might vary. This might include, for example, Merkle tree where each node of the tree represents an element of Personally-Identifiable-Information (PII) along with the hash and the hash of the hashes of the child nodes. The root hash then becomes the fingerprint of the data being attested.

359) An immutable record which has its integrity checked and agreed on an ongoing basis provides a venerable source of truth.

360) A rewarding service increases appeal and usage by customers. This is what makes Blockchain popular.

List <List <int>> getAllArrangements (List <List <int>> selected) {
List <List <int>> ret = new List < List <int>>();
Selected.distinct ().forEach ( x -> ret.addAll (getPermutations (x)));
Return ret;
}

Tuesday, January 22, 2019

Today we continue discussing the best practice from storage engineering:

351) P2P can be structured or unstructured.

352) In a structured topology, the P2P overlay is tightly controlled usually with the help of a distributed hash table (DHT). The location information for the data objects is deterministic as the peers are chosen with identifiers corresponding to the data object's unique key. Content therefore goes to specified locations that makes subsequent query easier.

353) Unstructured P2P is composed of peers joining based on some rules and usually without any knowledge of the topology. In this case the query is broadcast and peers that have matching content return the data to the originating peer. This is useful for highly replicated items but not appropriate for rare items. In this approach, peers become readily overloaded and the system does not scale when there is a high rate of aggregate queries.

354) Recently there have been some attempts at standardization on the key based routing KBR API abstractions and OpenHash - an open publicly DHT service that helps with a unification platform.

355) P2P is considered a top-heavy network. Top-heavy means we have an inverted pyramid of layers where the bottom layer is the network layer. This is the substrate that connects different peers. The overlay nodes management layer handles the management of these peers in terms of routing, location lookup and resource discovery. The layer on top of this is the features management layer which involves security management, resource management, reliability and fault resiliency.