Cluster computing

Friday, February 1, 2019

Today we continue discussing the best practice from storage engineering:

A strategy for a game
Consider a row of n coins of values v1 . . . vn, where n is even. We play a game against an opponent by taking turns. In each turn, a player selects either the first or last coin from the row, removes it from the row permanently, and receives the value of the coin. Determine the maximum possible amount of money we can definitely win if we move first.
Let us now take an example.
For a sequence of 8, 15, 3, 7
we know that the maximum value can be 15 + 7
but the players cannot always be greedy. For example, if player one chooses 8 then opponent chooses 15, player one chooses 7 and the opponent chooses 3. Then there are no players left with the maximum. Instead, let us now consider a strategy where the players minimize the profit for the other where we make a choice which leads to progressively lower total for the opponent. When we take a smaller part of the coin sequence, we have the entirely same problem but on a smaller scale. Let us denote the solution for this subproblem with a function F. Then we can lay out the coins from position i to position j. Now a player going first can collect either
Then we make a recursive solution as maximum of the two choices
F(i,j) = max(Vi + min(F(i+1, j)) ,
Vj + min(F(i, j-1))
At the end of both player turns each, two coins have been eliminated with the globally poor choice going to the opponent and the relatively better choice being retained with us. Since the poor and the better are mutually exclusive and there is incremental progression towards the termination, we can also rewrite the recursion in only our own turns to be politically correct:
F(I,j) = max( Vi + max(F(i+2,j), F(i+1, j-1)),
Vj + max(F(i+1, j-1), F(I, j-2)))
The function terminates when there are only two coins left from the entire even set.
Int GetCoins(List<int> coins, int i, int j)
{
int n = coins.count;
If (i >= j) return 0;
If (i==j) return coins[i];
3 If (j == i+1) return max(coins[i], coins[j]);
Return max(coins[i] + max(GetCoins(coins, i+2, j), GetCoins(coins, i+1,j-1)) ,
coins[j] + max(GetCoins(coins,i+1, j-1), GetCoins(coins,i,j-2)));
}
Taking our example of 8, 15, 3, 7
we now have max(8 + outcome of (15,3,7) or 7 + outcome of (8,15, 3)) then we have an outcome.
Similarly, for (15,3,7) we have max (15 +outcome of (3,7) or 7 + outcome of (15,3))
Similarly, for (8,15,3) we have max (8 + outcome of (15,3) or 3 + outcome of (8,15))
We could also use a table to keep track of the choices and progression made.
int GetBest(List<int> coins)
{
int n = coins.count;
int table[n, n];
int i, j, k;
for ( k = 0; k < n; k++)
{
for (i = 0; j = k; j < n; i++; j++)
{
int x = ((i+2) <= j) ? table[i+2, j] : 0;
int y = ((i+1) <= j-1) ? table[i+1, j-1] : 0;
int z = (i <= (j-2)) ? table[i, j-2]: 0;
table[i,j] = max(coins[i] + max(x,y), coins[j]+max(y,z));
}
}

return table[0, n-1];
}

This problem can be modified to picking two coins at the same time
In such case the choices to pick the coins are
1) Two from left
2) Two from right
3) One from left and one from right
The first two cases degenerate to picking a coin with a combined value of the two coins.
The last case merely reduces the size of the original sequence:
Therefore this can be elaborated as :
Int GetCoins(List<int> coins, int i, int j)
{
int n = coins.count;
If (i >= j) return 0;
If (i==j) return coins[i];
If (j == i+1) return sum(coins[i], coins[j]);
Var option1 = coins[i] + coins [i+1] + max ( GetCoins(coins, i+4, j),
GetCoins(coins, i+2, j-2), GetCoins(coins, i+3, j-1));
Var option2 = max ( GetCoins(coins, i+2, j),
GetCoins(coins, i, j-4), GetCoins(coins, i+1, j-3)) + coins [j-1] + coins [j-2];
Var option3 = coins [i]+ max ( GetCoins(coins, i+3, j),
GetCoins(coins, i+1, j-3), GetCoins(coins, i+2, j-2)) + coins [j];
Return max(option1, option2, option3);
}
Taking our example of 8, 15, 3, 7,
The choices are 23 + 10, 10 +23, and 15+18

Thursday, January 31, 2019

Today we continue discussing the best practice from storage engineering:

395) Catalogs also need to be served to a variety of devices. Websites tailored for mobile and desktop differ even in the content that is presented and not just the style, markup, script or logic. There is virtually no restriction to how much resource can be stored in the object storage and these can co-exist.

396) Similar to catalogs but in the form of document collections, libraries of digital content are just as easy to collect in organizations as any other repository. Most of these document libraries are using relational databases but they have no difference from object storage in terms of the use of the content and since versioning is supported.

397) These libraries differ from the catalogs in that they not only read-only traffic but also read-write on the documents in the collection. It is also internal to the organization as opposed to public catalogs

398) These libraries also participate in a variety of workflows which were earlier subject to limitations of the service as well as the portal where users sign in to access their documents. The use of an object storage on the other hand removes such restrictions

399) Unlike catalogs, libraries have to provide significant resource access control. Object storage with its S3 api is suitable for this purpose.

400) Unlike catalogs libraries don’t need to be served to multiple devices. However, libraries tend to grow in number. Therefore, object storage can encompass them all and provide limitless storage.

Wednesday, January 30, 2019

Today we continue discussing the best practice from storage engineering:

391) Some companies in the retail industry have a lot of catalogs. Although there is significant investment in Master Data management, solutions similar to those can be built on top of object storage. This is definitely a niche space and one that can support an emerging trend.

392) These retail companies process significant read-only traffic for their catalogs with the help of http proxies and web services. The investment can be maintained the same so long as the read only operations on the backend translate to fetching objects from the object store. This can help ease the transition to directly serving it from the object storage.

393) Catalogs participate in a variety of workflows such as rewards service, promotions and campaigns and so on. When the catalogs are served from the service, they are subject to the limitations of the service. When the catalog is directly served from the object storage, then it becomes far easier to start new services.

394) Catalogs typically require no access controls since they are served to the public. This makes it more appealing to move it to object storage where content distribution, replication and multi-site support is available out of the box.

395) Catalogs also need to be served to a variety of devices. Websites tailored for mobile and desktop differ even in the content that is presented and not just the style, markup, script or logic. There is virtually no restriction to how much resource can be stored in the object storage and these can co-exist.

Tuesday, January 29, 2019

Today we continue discussing the best practice from storage engineering:

383) Health data is not just sharded by customer but also maintained in isolated shared-nothing pockets with their own management systems. Integration of data to represent a whole for the same customer is the new and emerging trend in the health industry. Organizations and companies are looking to converge the data for an individual without losing privacy or failing to comply with government regulations.

384) Health data has numerous file types for the data captured from the patients. These could range from small text documents to large images. Unlike cluster file systems that consolidate data to a cluster, these data artifacts are scattered throughout repositories. In addition, there is a lot of logic to who can access what data leading to some bulky user interface for read and edit by providers, insurance, administrators and end users.

385) In addition to access over health data, agencies and providers frequently exchange health records which leads to a high traffic of data from all the data sources. Virtually no data is erased from the system and historical records going back several years are maintained. The accumulation of data records also has no chance to go to a warehouse because it is always active and online.

386) Retail industry has traditionally embraced mammoth sized databases and some even on large Storage Area Networks. Their embrace of databases and data warehouses have been banner use cases for online transaction processing and online analytical processing. Yet vectorized execution models are gaining ground in nascent retail companies where they want to wrap all purchases, rental payments and servicing fees as billing events that flow to processors. It is highly unlikely that they will switch to management and analytics solutions overnight that are based on key value stores or object stores.

387) Unlike health industry data stores, Retail industry data stores are all self contained homogenous and full service management systems. Writing a new service for retail industry merely points to other existing services as data stores or shared databases. Even store front devices such as point of sale registers point to queues which inevitably process their messages from back-end databases

388) While these industries may view data stores, queues, services and management systems as data sources, they did not have the opportunity until recently to consolidate their data sources with storage first design.

#codingexercise

maximize the coin collection when two players take turns when picking out the coins from either end

Int GetCoins(List<int> coins, int i, int j)
{
int n = coins.count;
If (i >= j) return 0;
If (i==j) return coins[i];

If (j == i+1) return max(coins[i], coins[j]);
Return max(coins[i] + max(GetCoins(coins, i+2, j), GetCoins(coins, i+1,j-1)) ,
coins[j] + max(GetCoins(coins,i+1, j-1), GetCoins(coins,i,j-2)));
}

Monday, January 28, 2019

Today we continue discussing the best practice from storage engineering:
378) Finance data is also heavily regulated. The Sarbanes-Oxley act sought to bring control to the corporations in order to avoid accounting scandals. It specified disclosure controls, audit and the compliance terms

379) ETL tools are widely used for in house finance data. On the other hand, global indexes, ticker price and tradings are constantly polled or refreshed. In these cases, it is hard to unify storage across departments, companies and industries. A messaging framework is preferred instead. Object storage could be put to use in these global stores but their adoption and usage across companies is harder to enforce.

380) Web services and gateways are preferred to distribute data and different kinds of finance data processing systems evolve downstream. These systems tend to have their own storage while transforming and contributing to the data in flight. The web-services are also popular for using as data source in parallel analysis usages. Since distribution and flow is more important for the data, the origin is considered the source of truth and the data changes are not propagated back to the origin.

381) Health industry data is another example with its own needs around data compliance. The Health insurance portability and Accountability act required a lot of controls around who, when and where can get access to personally identifiable information

382) Health data is often tightly integrated into proprietary stacks and organizations. Yet they are also required to participate in providing web access to all the information surrounding an individual at the same place. This makes them require a virtualized cross company data storage.

383) Health data is not just sharded by customer but also maintained in isolated shared-nothing pockets with their own management systems. Integration of data to represent a whole for the same customer is the new and emerging trend in the health industry. Organizations and companies are looking to converge the data for an individual without losing privacy or failing to comply with government regulations.

Sunday, January 27, 2019

Today we continue discussing the best practice from storage engineering:

375) Most storage products don’t differentiate between human and machine data because it involves upper layers of data management. However, dedicated differentiation between human and machine data can make the products more customized for these purposes.

376) Data storage requirements change from industry to industry. Finance data storage is largely in the form of distributed indexes and continuous data transfers. A cold storage product does not serve its needs even if the objects are accessible over the web

377) Finance data is subject to a lot of calculations and proprietary and often well-guarded calculators that have largely relied on relational databases. Yet these same companies have also adopted NoSQL storage in favor of their warehouses. As their portfolios grow, they incubate new and emerging features and increasingly favor new technologies

378) Finance data is also heavily regulated. The Sarbanes-Oxley act sought to bring control to the corporations in order to avoid accounting scandals. It specified disclosure controls, audit and the compliance terms

379) Health industry data is another example with its own needs around data compliance. The Health insurance portability and Accountability act required a lot of controls around who, when and where can get access to personally identifiable information

380) Health data is often tightly integrated into proprietary stacks and organizations. Yet they are also required to participate in providing web access to all the information surrounding an individual at the same place. This makes them require a virtualized cross company data storage.

Saturday, January 26, 2019

Today we continue discussing the best practice from storage engineering
:
371) Data management software such as Cloudera can be deployed and run on any cloud. It offers an enterprise data hub, an analytics DB, and operational DB, data science and engineering and essentials. It is elastic and flexible, it has high performance analytics, it can easily provision over multiple clouds and it can be used for automated metering and billing. Essentially they allow different data models, real-time data pipelines and streaming applications with their big data platform. They enable data models to break free from vendor lockins and with the flexibility to let it be community defined.

372) The data science workbench offered from Cloudera involves a console on a web browser that users can authenticate themselves with using Kerberos against the cluster KDC. Engines are spun-up and we can seamlessly connect with Spark, Hive, and Impala. The engines are spun up based on engine kernels and profiles.

373) Cloudera Data Science workbench uses Docker and Kubernetes. Cloudera is supported on dedicated Hadoop hosts. Cloudera also adds a data engineering service called Altus. It’s a platform that works against a cloud by allowing clusters to be setup and torn down and jobs to be submitted to those clusters. Clusters may be Apache Spark, MR2 or Hive.

374) Containerization technologies and Backend as a service aka lambda functions can also be supported by products such as Cloudera which makes them usable with existing public clouds while it offers an on-premise solution

375) Most storage products don’t differentiate between human and machine data because it involves upper layers of data management. However, dedicated differentiation between human and machine data can make the products more customized for these purposes.