Cluster computing

Sunday, March 3, 2019

We were discussing the implementation of a ledger with our first and second posts. We end it today.
There have been two modes to improve sequential access. First is the batch processing which allows data to be partitioned in batches on parallel threads. This works very well for data where the results can also behave as if they were data for the same calculations and the data can be abstracted into batches This is called summation form. Second, the batches can be avoided if the partitions are tiled over. This is called streaming access and it uses a window over partitions to make calculations and adjust them accordingly as the windows slides over the data in a continuous manner. This works well for data which is viewed as continuous and limitless such as from a pipeline.
Operations on the writer side too can be streamlined when it has to scale to large volumes. Some form of parallelization is also used here after the load is split into groups of incoming requests. To facilitate faster and better ledger writes, they are written once and as detailed as possible to avoid conflicts with others and enable more operations to be read-only. This separation of read-write and read-only activities on the ledger improve not only the ledger but also let it remain the source of truth. Finally, ledgers have grown to be distributed even while most organizations continue to keep the ledger in-house and open up only for troubleshooting, inspection, auditing and compliance.
Translations are one of the most frequently performed operations in the background. An example of translation is one where two different entries are to be reconciled the same as one uniform entry. These entries so that the calculations can be simpler.
Some of these background operations involve forward only scanning of a table or list with no skipping. They achieve this with the help of a progress marker for themselves where they keep track of the sequence number that they last completed their actions on. This works well in the case where the listing order remains unchanged.
Let us consider a case where this progressive scan may skip range. Such a case might arise when the listing is ordered but not continuous. There are breaks in the table as it gets fragmented between writes and the scanner does not see the writes between the reads. There are two ways to handle this. The first way to handle it is to prevent the write between the reads. This can be enforced with a simple sealing of the table prior to reading so that the writes cascade to a new page. The second way is to revisit the range and see if the count of processed table entries matches the sequence and redo it when it doesn’t agree. Since the range is finite, the retries are not very expensive and requires no alteration of the storage. Both approaches will stamp the progress marker at the end of the last processed range. Typically there is only one progress marker which moves from the ends on one range to the next.
Sometimes it is helpful to take actions to check that the table is stable and serving even for analysis. A very brief lock and release is sufficient in this regard.

#codingexercise
int GetPermutations (int n, int k) {
If (n == 0 || k > n) return 0;
if (k == 0 || k == n)
return 1;
return Factorial (n) / Factorial (n-k);
}

Saturday, March 2, 2019

Today we continue discussing the best practice from storage engineering:

528) Live updates versus backup traffic, for instance, qualify for separate products. Aging and tiering of data also qualify for separate storage. Data for reporting can similarly be separated into its own stack. Generated data that drains into logs can similarly feed diagnostic stacks.

529) The number of processors or resources assigned to a specific stack is generally earmarked with T-shirt sizing. This is helpful for cases where the increment or decrement of resources doesn’t have to be done by a notch by notch level.

530) Public cloud and hybrid cloud storage discussions are elaborated on many forums. The hybrid storage provider is focused on letting the public cloud appear as front-end to harness the traffic from the users while allowing storage best practice for the on-premise data.

531) Data can be pushed or pulled from source to destination. If it’s possible to pull, it helps in relieving the workload to another process.

532) Lower level data transfers are favored over higher level data transfers involving say HTTP.

533) The smaller the data transfers the larger the number which results in more chatty and potentially fault prone traffic. We are talking about very small amount of data per request.

534) The larger size reads and writes are best served by multiple parts as opposed to long running requests with frequent restarts

535) The up and down traversal of the layers of the stack are expensive operations. These need to be curtailed.

static int GetNChooseKDP(int n, int k)
{
if ( n == 0 || k > n)
return 0;
if (k == 0 || k == n)
return 1;
return GetNChooseKDP(n - 1, k - 1) + GetNChooseKDP(n - 1, k);
}

Friday, March 1, 2019

Today we continue discussing the best practice from storage engineering :

517) Customers also prefer ability to switch products and stacks. They are willing to try out new solutions but have become increasingly wary of tying to any one product or the increasing encumbrances

518) Customers have a genuine problem with data being sticky. They cannot keep up with data transfers

519) Customers want the expedient solution first but they are not willing to pay for re- architectures

520) Customers need to evaluate the cost of even data transfer over the network. Their priority and severity is most important to them.

521) Customers have concerns with the $/resource whether it is network, compute or storage. They have to secure ownership of data and yet have it spread out between geographical regions. This means they have trade-offs from the business perspectives rather than the technical perspectives

522) Sometimes the trade-offs are not even from the business but more so from the compliance and regulatory considerations around housing and securing data. Public cloud is great to harness traffic to the data stores but there are considerations when data has to be on-premise.

523) Customers have a genuine problem with anticipating growth and planning for capacity. The success of an implementation done right enables future prospect but implementations don’t always follow the design and it is also hard to get the design right.

524) Similarly, customers cannot predict what technology will hold up and what won’t in the near and long term. They are more concerned about the investments they make and the choices they have to face.

525) Traffic, usage and patterns are good indicators for prediction once the implementations is ready to scale.

Thursday, February 28, 2019

Today we continue discussing the best practice from storage engineering:

515) Most of the storage products have embraced APIs in one form or the other. Their usage for protocols with external agents, internal diagnostics and manageability are valuable as online tools and merit the same if not better appreciation than scripts and offline tools.

516) Storage products solve a piece of the puzzle. And customers don’t always have boilerplate problems. Consequently, there needs to be a bridging somewhere.

517) Customers also prefer ability to switch products and stacks. They are willing to try out new solutions but have become increasingly wary of tying to any one product or the increasing eencumbrances

518) Customers have a genuine problem with data being sticky. They cannot keep up with data transfers

519) Customers want the expedient solution first but they are not willing to pay for re- architectures

520) Customers need to evaluate the cost of even data transfer over the network. Their priority and severity is most important to them.

Wednesday, February 27, 2019

Today we continue discussing the best practice from storage engineering:

510) The storage product just like any other software product is a culmination of efforts from a forum of roles and people playing those roles. The recognition and acceptance of the software is their only true feedback.

511) Almost every entry in the storage for users data is sandwiched between a header and a footer in some container and the data segments read with offset and length. This mechanism is repeated at various layers and becomes all the more useful when data is encrypted.

512) Similarly entries of data are interspersed with routine markers and indicators from packaging and processing perspectives. Many background jobs frequently stamp what’s relevant to them in between data segments so that they can continue their processing in a progressive manner.

513) It must be noted that the packaging of data inside the storage product has many artifacts that are internal to the product and certainly not readable in raw form. Therefore some offline command line tool to dump and parse the contents could prove very helpful.

514) The argument above also holds true for message passing between shared libraries inside the storage product While logs help capture the conversations, their entries may end up truncated. An offline tool to fully record, replay and interpret large messages would be helpful for troubleshooting.

515) Most of the storage products have embraced APIs in one form or the other. Their usage for protocols with external agents, internal diagnostics and manageability are valuable as online tools and merit the same if not better appreciation than scripts and offline tools.

Tuesday, February 26, 2019

We continue with the discussion on ledger parser
Checks and balances need to be performed as well by some readers. The entries may become spurious and there are ways a ghost entry can be made. Therefore, there is some trade-off in deciding whether the readers or the writers or both do the cleanup work.
The trouble with cleanup is that it requires modifications to the ledger and while they can be done independently by designated agents, they don’t give as much confidence as when they came from the source of the truth as the online writer. Since cleanup is a read-write task, it involves some form of writer.
If all the writes were uniform and clean entries, it does away with the need for designated agents and their lags and delays. At this end of the writer only solution, we have traded-off performance. At the other end of high-performance online writing, we are introducing all kinds of background processors, stages and cascading complexity in analysis.
The savvy online writer may choose to use partitioned ledgers for clean and accurate differentiated entries or quick translations to uniform-entries by piggy-backing the information on the transactions where it can keep a trace of entries for that transaction. The cleaner the data, the simpler the solution for downstream systems.
The elapsed time for analysis is the sum of whole. Therefore, all lags and delays from background operations are included. The streamlining of analysis operations requires data to be clean so that the analysis operations are read only transactions themselves.
If the operations themselves contribute to delays, the streamlining of operations for the overall analysis requires the use of persisting results from these operations where the input does not change. Then the operations need not be run each time and the results can be re-used for other queries. If the stages are sequential, the overall analysis becomes streamlined.
Unfortunately, the data from users cannot be confined to a size. Passing the results from one stage to another breaks down when there is a large amount of data to be processed in that stage and sequential processing does not scale. In such cases, partial results retrieved earlier help enable downstream systems to have something meaningful to do. Often, this is adequate even if all the data has not been processed.

Monday, February 25, 2019

Ledger Parser
Bookkeeping is an essential routine in software computing. A ledger facilitates book-keeping between components. The ledger may have entries that are written by one component and read by another. The syntax and semantics of the writes need to be the same for the reads. However, the writer and the reader may have very different loads and capabilities. Consequently, some short-hand may sneak into the writing and some parsing may be required on the reading side.
When the writer is busy, it takes extra effort to massage the data with calculations to come up with concise and informative entries. This is evident in the cases when the writes are part of online activities that the business depend on. On the other hand, it is easier for the reader to take the chores of translations and routines that are expensive because the reads are usually for analysis which do not have to impact online transactions.
This makes the ledger more like a table with different attributes so that all kinds of entries can be written. The writer merely captures the relevant entries without performing any calculations. The readers can use the same table and perform selection or rows or projection of columns to suit their needs. The table sometimes ends up as being very generic and expands its columns.
The problem is that if the writers don’t do any validations or tidying, the ledger becomes bloated and dirty. On the other hand, the readers may find it more and more onerous to look back a few records because there is more range to cover. This most often manifests as delays and lags between when an entry was recorded and when it made into the reports.
Checks and balances need to be performed as well by some readers. The entries may become spurious and there are ways a ghost entry can be made. Therefore, there is some trade-off in deciding whether the readers or the writers or both do the cleanup work.
The trouble with cleanup is that it requires modifications to the ledger and while they can be done independently by designated agents, they don’t give as much confidence as when they came from the source of the truth as the online writer. Since cleanup is a read-write task, it involves some form of writer.
If all the writes were uniform and clean entries, it does away with the need for designated agents and their lags and delays. At this end of the writer only solution, we have traded-off performance. At the other end of high-performance online writing, we are introducing all kinds of background processors, stages and cascading complexity in analysis.
The savvy online writer may choose to use partitioned ledgers for clean and accurate differentiated entries or quick translations to uniform-entries by piggy-backing the information on the transactions where it can keep a trace of entries