Thursday, February 28, 2019

Today we continue discussing the best practice from storage engineering:

515) Most of the storage products have embraced APIs in one form or the other. Their usage for protocols with external agents, internal diagnostics and manageability are valuable as online tools and merit the same if not better appreciation than scripts and offline tools.

516) Storage products solve a piece of the puzzle. And customers don’t always have boilerplate problems. Consequently, there needs to be a bridging somewhere.

517) Customers also prefer ability to switch products and stacks. They are willing to try out new solutions but have become increasingly wary of tying to any one product or the increasing eencumbrances 

518) Customers have a genuine problem  with data being sticky. They cannot keep up with data transfers 

519) Customers want the expedient solution first but they are not willing to pay for re- architectures 

520) Customers need to evaluate the cost of even data transfer over the network. Their priority and severity is most important to them.

Wednesday, February 27, 2019

Today we continue discussing the best practice from storage engineering:

510) The storage product just like any other software product is a culmination of efforts from a forum of roles and people playing those roles. The recognition and acceptance of the software is their only true feedback.

511) Almost every entry in the storage for users data is sandwiched between a header and a footer in some container and the data segments read with offset and length. This mechanism is repeated at various layers and becomes all the more useful when data is encrypted.

512) Similarly entries of data are interspersed with routine markers and indicators from packaging and processing perspectives. Many background jobs frequently stamp what’s relevant to them in between data segments so that they can continue their processing in a progressive manner.

513) It must be noted that the packaging of data inside the storage product has many artifacts that are internal to the product and certainly not readable in raw form. Therefore some offline command line tool to dump and parse the contents could prove very helpful.

514) The argument above also holds true for message passing between shared libraries inside the storage product  While logs help capture the conversations, their entries may end up truncated. An offline tool to fully record, replay and interpret large messages would be helpful for troubleshooting.

515) Most of the storage products have embraced APIs in one form or the other. Their usage for protocols with external agents, internal diagnostics and manageability are valuable as online tools and merit the same if not better appreciation than scripts and offline tools.

Tuesday, February 26, 2019

We continue with the discussion on ledger parser
Checks and balances need to be performed as well by some readers. The entries may become spurious and there are ways a ghost entry can be made. Therefore, there is some trade-off in deciding whether the readers or the writers or both do the cleanup work.
The trouble with cleanup is that it requires modifications to the ledger and while they can be done independently by designated agents, they don’t give as much confidence as when they came from the source of the truth as the online writer. Since cleanup is a read-write task, it involves some form of writer.
If all the writes were uniform and clean entries, it does away with the need for designated agents and their lags and delays. At this end of the writer only solution, we have traded-off performance. At the other end of high-performance online writing, we are introducing all kinds of background processors, stages and cascading complexity in analysis.
The savvy online writer may choose to use partitioned ledgers for clean and accurate differentiated entries or quick translations to uniform-entries by piggy-backing the information on the transactions where it can keep a trace of entries for that transaction. The cleaner the data, the simpler the solution for downstream systems.
The elapsed time for analysis is the sum of whole. Therefore, all lags and delays from background operations are included. The streamlining of analysis operations requires data to be clean so that the analysis operations are read only transactions themselves.
If the operations themselves contribute to delays, the streamlining of operations for the overall analysis requires the use of persisting results from these operations where the input does not change. Then the operations need not be run each time and the results can be re-used for other queries. If the stages are sequential, the overall analysis becomes streamlined.
Unfortunately, the data from users cannot be confined to a size. Passing the results from one stage to another breaks down when there is a large amount of data to be processed in that stage and sequential processing does not scale. In such cases, partial results retrieved earlier help enable downstream systems to have something meaningful to do. Often, this is adequate even if all the data has not been processed.

Monday, February 25, 2019

Ledger Parser
Bookkeeping is an essential routine in software computing. A ledger facilitates book-keeping between components. The ledger may have entries that are written by one component and read by another. The syntax and semantics of the writes need to be the same for the reads. However, the writer and the reader may have very different loads and capabilities. Consequently, some short-hand may sneak into the writing and some parsing may be required on the reading side.
When the writer is busy, it takes extra effort to massage the data with calculations to come up with concise and informative entries. This is evident in the cases when the writes are part of online activities that the business depend on. On the other hand, it is easier for the reader to take the chores of translations and routines that are expensive because the reads are usually for analysis which do not have to impact online transactions.
This makes the ledger more like a table with different attributes so that all kinds of entries can be written. The writer merely captures the relevant entries without performing any calculations. The readers can use the same table and perform selection or rows or projection of columns to suit their needs. The table sometimes ends up as being very generic and expands its columns.
The problem is that if the writers don’t do any validations or tidying, the ledger becomes bloated and dirty. On the other hand, the readers may find it more and more onerous to look back a few records because there is more range to cover. This most often manifests as delays and lags between when an entry was recorded and when it made into the reports.
Checks and balances need to be performed as well by some readers. The entries may become spurious and there are ways a ghost entry can be made. Therefore, there is some trade-off in deciding whether the readers or the writers or both do the cleanup work.
The trouble with cleanup is that it requires modifications to the ledger and while they can be done independently by designated agents, they don’t give as much confidence as when they came from the source of the truth as the online writer. Since cleanup is a read-write task, it involves some form of writer.
If all the writes were uniform and clean entries, it does away with the need for designated agents and their lags and delays. At this end of the writer only solution, we have traded-off performance. At the other end of high-performance online writing, we are introducing all kinds of background processors, stages and cascading complexity in analysis.
The savvy online writer may choose to use partitioned ledgers for clean and accurate differentiated entries or quick translations to uniform-entries by piggy-backing the information on the transactions where it can keep a trace of entries 

Sunday, February 24, 2019

Today we continue discussing the best practice from storage engineering:

500) There are no containers for native support of decision tree, classified and outlier data in unstructured storage but since they can be represented in key values, they can be assigned to objects themselves or maintained in dedicated metadata.
501) The instructions for setting up any web application on the object storage are easy to follow because they include the same steps. On the other hand, performance optimization for such web application depends on a case by case basis.
502) Application optimization is probably the only layer that truly remains with the user even in a full-service stack using a storage product. Scaling, availability, backup, patches, install, host maintenance, rack maintenance do remain with the storage provider.
503) The use of http headers, attributes, protocol specific syntax and semantics, REST conventions, OAuth and other such standards are well-known and covered in their respective RFCs on the net. Content-Delivery-Network can be provisioned straight from the object storage. Application optimization is about using both judiciously
504) An out of box service which facilitates an administrator defined rules for enabling the type of optimizations to perform.  Moreover, rules need not be written in the form of declarative configuration. They can be dynamic in the form of a module.
505) The Application Optimization also acts as a gateway when appropriate. Any implementation of gateway has to maintain a registry of destination addresses. As http access enabled objects proliferate with their geo-replications, this registry becomes granular at the object level while enabling rules to determine the site from which they need to be accessed. Finally, they gather statistics in terms of access and metrics which come very useful for understanding the http accesses of specific content within the object storage

Saturday, February 23, 2019

  1. Today we continue the discussion on the best practice from storage engineering:

  2. 496) Tags can  be used to make recommendations against the data to be searched. Tags point to groups and the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of  1 to 5 to denote the preferences of the participants in the selected group.   

  1. 497) A useful data structure for mining the logical data model is the decision tree. Structure involves interior nodes = set (A1, … An) of categorical attributes . The leaf is the class label from domain(C). The edge is a value from domain(Ai), Ai associated with parent node. The property is a search tree. The tuples in R -> leafs in class labels . The decision tree's property is that it associates the tuples in R to the leafs i.e. class labels. The advantage of using a decision tree is that it can work with heterogeneous data and the decision boundary is parallel to the axis. 

  1. 498) Clustering is a technique for categorization and segmentation of tuples. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R. Find a set of those groups of rows in R with the objectives that the groups should be cohesive and not coupled. The tuples within a group are similar to each other. The tuples across group are dissimilar. The constraint is that the number of clusters may be given and the clusters should be significant. 

  1. 499) Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given.   

  1. 500) There are no containers for native support of decision tree, classified and outlier data in unstructured storage but since they can be represented in key values, they can be assigned to objects themselves or maintained in dedicated metadata.  

Friday, February 22, 2019

Today we continue discussing the best practice from storage engineering:

493) To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group. These scores can then be used with tags. The same applies to resources.

494) To determine the closeness a couple of mathematical formula help. In this case, we could use the Euclidean distance or the Pearson co-efficient. The Euclidean distance finds the distance between two points in a multidimensional space by taking the sum of the square of the differences between the coordinates of the points and then calculating the square root of the result.

495) The Pearson correlation co-efficient is a measure of how highly correlated the two variables are. It’s generally a value between -1 and 1 where -1 means that there is a perfect inverse correlation and 1 means there is a perfect correlation while 0 means there is no correlation.  It is computed with the numerator as the sum of the two variables taken together minus the average of their individual sums and this is divided by the square-root of the product of the squares of the substitutions to the numerator by using the same variable instead of the other.

496) Tags can  be used to make recommendations against the data to be searched. Tags point to groups and the preferences of the group is used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of  1 to 5 to denote the preferences of the participants in the selected group.  

Thursday, February 21, 2019

Today we continue discussing the best practice from storage engineering:

489) Tags can generate more tags. Background processing and automation can work with tags to generate more tags. For example, a clustering operation on the existing data using similarity measures on existing tags will generate more tags.

490) Tags also work as friendly names for resources that are not visible or tracked at the billing level. For example, if a virtual machine has several network interface cards (NIC) then keeping track of the different models of the virtual machines may not be sufficient granularity for the tags. On the other hand keeping track of all the models of the NIC albeit software device with their identifiers may be too many to keep track off. Instead tags could represent hierarchical information by masking different tags at lower levels.  Thus hierarchical tags can be used to have a sliding scale of granularity on the associated resources. This way search can be expanded to include sub-resources

491) We can assign tags only to resources that already exist. If we add a tag that has the same key as an existing tag on that resource, the new value overwrites the old value. We can edit tag keys and values, and we can remove tags from a resource at any time. We can set a tag's value to the empty string, but we can't set a tag's value to null. We can even control who can see these tags.

492)Tagging unlike relational data can come in very helpful for NoSQL like querying and batch processing. Since it does not involve operational data on the resources for the cloud provider, it does not have any performance impact and is more suited for analytics, offline processing and reporting. 

Wednesday, February 20, 2019

Today we continue discussing the best practice from storage engineering:

485) Tags don't have any semantic meaning to the functional aspects of the resource and are interpreted strictly as a string of characters. Also, tags are not automatically assigned to our resources.

486) Tags can easily be authored and managed by console, command line interface or API

487) Resources have their identifiers but the metadata on the resources can even be added after the instance has been created. If we treat tags as friendly names for these data types then we have more tags than earlier and thus expanding the options mentioned above.

488) Tags are also lines of search. When a user gives a search term or terms, very often she is trying to find one item that is not being found. The user has to improve the search terms or invoke a lot more options or dig through voluminous results. Instead, if the lines of search were available as intentions, then we can show results corresponding to them.

489) Tags can generate more tags. Background processing and automation can work with tags to generate more tags. For example, a clustering operation on the existing data using similarity measures on existing tags will generate more tags.

Tuesday, February 19, 2019

Today we continue discussing the best practice from storage engineering:

481) The nature of the query language determines the kind of resolving that the data virtualization needs to do. In addition, the type of storage that the virtualization layer spans also depend on the query language.

482) In order to explain the difference between data virtualization over structured and unstructured storage types, we look at metadata in structure storage. All data types used are registered.  Whether they are system builtin types or user defined types, the catalog helps with the resolution.

 483) A query describing the selection of entries with the help of predicates does not necessarily have to be bound to structured or unstructured query languages. Yet the convenience and universal appeal of one language may dominate another. Therefore, in such cases whether the query language is agnostic or predominantly biased, it can be modified or rewritten to suit the needs of the storage stacks described earlier.

484) Delegation doesn’t have to be the only criteria for the virtualization layer. Both the administrator and the system may maintain rules and configurations with which to locate the store for the data. More importantly the rules can be both static and dynamic. The former refers to rules that are declared ahead of the launch of the service and the service merely loads it in. The latter refers to the evaluations that dynamically assign queries to store based on classifiers and connection attributes. 

Monday, February 18, 2019

Today we continue discussing the best practice from storage engineering:


473) Storage products are also prone to increasing their test matrix with new devices such as solid state drive and emerging trends such as IoT

474) Storage products have to be limitless for their customers but they cannot say how they will be used. They will frequently run into usages where customers use them inappropriately and go against their internal limits such as the number of policies that can be applied to their organizational units.

475) There was a time when content addressable storage was popular. It involved generating a PEA file to save contents that could be looked up by their hash. The use of object storage made it easier to access the objects directly.

476) Data is increasingly being produced as fixed content Emails and faxes are examples of these.  The lifecycle of content such as from system, personal computing, Network centric and content centric are progressively higher and higher in their durations

477) Drop and create of user artifacts helps the user to cleanup. This is not the case for say system catalog. Still the storage artifacts used on behalf of the user is also the same as the storage artifacts used for system itself. Creating and dropping such artifacts would be helpful even if they are internal.

478) The retention policy is typically 6 months for email,  3 years for financial data, 5 years for legal. The retention period for object storage is user defined.

479) Object Storage is touted as best for static content. Data that changes often is then said to be preferred in NoSQL or other unstructured storage. With object versioning, API and SDK, this is no longer the case.

480) Data Transfers have never been considered a virtual storage since they belong to the source. Data in transit can live in queues, cache and object storage which is good for vectorized execution .

Sunday, February 17, 2019

Today we continue discussing the best practice from storage engineering :

469) Storage products are tested for size and usage under all circumstances. These can be fine grained or aggregated and can be queried at different scopes and levels.

470) Storage product is generally the source of truth for all upstream data sources and workflows.

471) Storage products manage to be the source of truth even with different consistency models. They just need to meet their usages.

472) Storage products have evolved from being purely disk based solutions to compute and network embracing software defined stacks. Consequently they are better able to be tested but the complexity  increases.

473) Storage products are also prone to increasing their test matrix with new devices such as solid state drive and emerging trends such as IoT

474) Storage products have to be limitless for their customers but they cannot say how they will be used. They will frequently run into usages where customers use them inappropriately and go against their internal limits such as the number of policies that can be applied to their organizational units.


#codingexercise
Friends Pairing problem:
Given n friends, each one can remain single or can be paired up with some other friend. Each friend can be paired only once so ordering is irrelevant
The total number of ways in which the friends can be paired is given by;
Int GetPairs(int n)
{
If (n <=2) return n;
Return GetPairs(n-1) + GetPairs(n-2)*(n-1);
}

Saturday, February 16, 2019

Today we continue discussing the best practice from storage engineering :

466) Storage products have multiple paths of data entry usually via protocols. These are each tested using their respective protocol tools

467) Storage products are usually part of tiered storage. As such data aging and validation needs to be covered

468) Storage products are tested with different batches of loads. They are also tested using continuous loads with varying rate over time

469) Storage products are tested for size and usage under all circumstances. These can be fine grained or aggregated and can be queried at different scopes and levels.

470) Storage product is generally the source of truth for all upstream data sources and workflows.

#codingexercise
The coin selection problem can scale to any constant number of coins that can be picked in each turn using the methods below.
int GetCoinsKWithDP(List<int> coins, int i, int j, int k)
{
if (i > j) return 0;
if (i==j) return coins[i];
if (j - i +1 <= k) {
     List <int>  change = new ArrayList <int>();
    for (int c = i ; c <= j; c++) {
           change.add (coins [c]) ;
    }
    return Collections.sum (change);
}
List <int> options = new ArrayList ();
for (int m = 0; m < k; m++) {
      int change = 0;

      for (int left = 0; left < k; left++) {
             change += coins [i+left];   
             int option = change + GetCoinsDP (sequence, i+left, j);
     options.add (option);
      }

      for (int right = 0; right < k; right++) {
             change += coins [j-right];
int option = change + GetCoinsDP (sequence, i, j-right);
     options.add (option);
      }
   
        for (int left = 0; left < k; left++) {
             for (int right = 0; right < k-left right++) {
             change += coins [j-right];
int option = change + GetCoinsDP (sequence, i+left, j-right);
     options.add (option);
      }
}
}
return Collections.max (options);
}
The selections in each initiation level of this GetCoinsKWithDP can be added to a list and alternate additions can be skipped as belonging to the other player since the method remains the same for both.

Friday, February 15, 2019

Today we continue discussing the best practice from storage engineering:

460) The use of injectors, proxies, man-in-the-middle test networking aspect but storage is more concerned with temporary and permanent outages, specific numbers associated with minimum and maximum limits and inefficiencies when the limits are exceeded.

461) Most storage products have a networking aspect. Testing covers networking separately from the others. This means timeouts, downtime, resolutions and traversals up and down the networking layers on a host. It also includes location information.

462) The control and data path traces and execution cycles statistics are critical to capture and query with a tool so that the testing can determine if the compute is at fault. Most such tools provide data over the http.

463) Responsiveness and accuracy are verified not only with repetitions but also from validating against different sources of truth. The same is true for logs and read-only data

464) When data is abundant, reporting improves the interpretations. Most reports are well-structured beforehand and even used with templates for different representations.

465) Testing provides the added advantage of sending reports by mail for scheduled runs. These help human review of the bar for quality

#codingexercise
int binary_search(String input, int start, int end, char val)
{
int mid = (start + end)/2;
if (input[mid] == val) return mid;
if (start == end && input[mid] != val) return -1;
if (input[mid] < val)
return binary_search(nums, mid+1, end, val);
else
return binary_search(nums, start, mid, val);

}


Thursday, February 14, 2019

Today we continue discussing the best practice from storage engineering:

455) The use of a virtual machine image as a storage artifact only  highlights the use of large files in storage. They are usually saved on the datastore in the datacenter but nothing prevents the end user owning he machine take periodic backups of the vm image with tools like duplicity. These files can then be stashed in storage products like object storage. The ability of S3 to take on multi-part upload eases the use of large files.

456) The use of large files helps test most bookkeeping associated with the logic that depends on the size of the storage  artifact. While performance optimizations remove redundant operations in different layers to streamline a use case, the unoptimized code path is better tested with large files.

457) In the next few sections, we cover some of the testing associated with a storage product. The use of large number of small data files and a small number of large data files serves the most common case of data ingested by a storage product. However, duplicates, order and attributes also matter. Latency and throughput are also measured with their data transfer.

458) Cluster based topology testing differs significantly from peer-to-peer networking-based topology testing. One represents the capability and the other represents distribution. The tests have to articulate different loads for each.

459) The testing of software layers is achieved with simulation of lower layers. However, integration testing is closer to real life scenarios. Specifically, the testing of data corruption, unavailability or loss is critical to the storage product

460) The use of injectors, proxies, man-in-the-middle test networking aspect but storage is more concerned with temporary and permanent outages, specific numbers associated with minimum and maximum limits and inefficiencies when the limits are exceeded.

#algorithm
MST-Prim
// this grows a tree
A = null
for each vertex v in G, initialize the key and the parent
Initialize a min-priority queue Q with vertices
while the Queue is not empty
       extract the vertex with the minimum edge distance connecting it to the tree
       for each adjacencies v of this vertex u
              set the key to the weight(u,v) and parent


Print Fibonacci using tail recursion:
uint GetTailRecursiveFibonacci(uint n, uint a = 0, uint b = 1)
{
    if (n == 0)
        return a;
    if (n == 1)
        return b;
    return GetTailRecursiveFibonacci(n-1, b, a+b);
}


Wednesday, February 13, 2019

Today we continue discussing the best practice from storage engineering:

453) While container platforms for Platform-as-a-service (PaaS) have enabled software to be deployed without any recognition of the host and frequently rotated from one host to another, the end users adoption of PaaS platform depend on the production readiness of the applications and services The force for PaaS adoption has made little or no changes to the use and proliferation of virtual machines by individual users

454)  The cloud services provider can package services such as additional storage, regular backup schedule, patching schedule, system management, securing and billing at the time of request for each asset. However such services depend on the cloud where the services are requested. For private cloud a lot of the service is in-house adding to the costs even if the inventory is free.

455) The use of a virtual machine image as a storage artifact only  highlights the use of large files in storage. They are usually saved on the datastore in the datacenter but nothing prevents the end user owning he machine take periodic backups of the vm image with tools like duplicity. These files can then be stashed in storage products like object storage. The ability of S3 to take on multi-part upload eases the use of large files.

456) The use of large files helps test most bookkeeping associated with the logic that depends on the size of the storage  artifact. While performance optimizations remove redundant operations in different layers to streamline a use case, the unoptimized code path is better tested with large files.


#codingexercise
int GetCount(uint n)
 {
 if ( n == 0) return 0;
 if (n == 1) return 1;
 if (n == 2) return 2;
 return GetCount(n-1)+GetCount(n-2);
 }

Tuesday, February 12, 2019

Today we continue discussing the best practice from storage engineering:

451) Many organizations use one or more public clouds to meet the demand for the compute resource by its employees. A large number of these requests are fine grained where customers request a handful of virtual machines for their private use. Usually not more than twenty percent of the customers have demands that are very large ranging to about a hundred or more virtual machines.

452) The virtual machines for the individual customers are sticky. Customers don’t usually release their resource and even identify it by their name or ip address for their day to day work.  They host applications, services and automations on their virtual machines and often cannot let go of their virtual machine unless files and programs have a migration path to another compute resource. Typically they do not take this step to create regular backups and keep moving the resource.

453) While container platforms for Platform-as-a-service (PaaS) have enabled software to be deployed without any recognition of the host and frequently rotated from one host to another, the end users adoption of PaaS platform depend on the production readiness of the applications and services The force for PaaS adoption has made little or no changes to the use and proliferation of virtual machines by individual users

454) The cloud services provider can package services such as additional storage, regular backup schedule, patching schedule, system management, securing and billing at the time of request for each asset. However such services depend on the cloud where the services are requested. For private cloud a lot of the service is in-house adding to the costs even if the inventory is free.

Monday, February 11, 2019

Today we continue discussing the best practice from storage engineering:

449) Object storage can serve as the storage for graph databases and object databases. Object storage then transforms from being a passive storage layer to one that actively builds metadata, maintains organizations and rebuilds indexes from object rather than from files.

450) File-systems have long been the destination to store artifacts on disk and while file-system has evolved to stretch over clusters and not just remote servers, it remains inadequate as a blob storage. Data writers have to self-organize and interpret their files while frequently relying on the metadata stored separate from the files

451) Files also tend to become binaries with proprietary interpretations. Files can only be bundled in an archive and there is no object-oriented design over data. If the storage were to support organizational units in terms of objects without requiring hierarchical declarations and supporting is-a or has-a relationships, it tends to become more usable than files. This modular storage  enhances the use of object storage and does not compete with the usages of elastic file stores.

452) Such object-storage will find a niche usage in spatial databases, telecommunications and scientific computing requiring large scale use of elementary organizational units that are not necessarily related.  For example, spatial databases make use of polygons as a unit of organization and store large amounts of polygons

453) Traditional relational databases have long cherished an acceptance for storing data that requires interpretations. However, the chores associated with converting data to structured form and amenable to querying can be relaxed with native support for rich non-hierarchical data organization from storage layer and transformation to a different class of unstructured storage.

int GetCoinsDP(List<int> coins, int i, int j)
{
If (i > j) return 0;
If (i==j) return coins[i];
If (j == i+1) return max (coins [i], coins [j]);
return max (
coins[I] + GetCoinsDP(sequence, 1, j),
coins[j] + GetCoinsDP (sequence, 0,j-1));
}
The selections in each initiation level of GetCoinsDP can be added to a list and alternate additions can be skipped as belonging to the other player since the method remains the same for both.


Sunday, February 10, 2019

Today we continue discussing the best practice from storage engineering:

443) Sections of the file can be locked for multi-process access and even to map sections of a file on virtual memory systems. The latter is called memory mapping and it enables multiple processes to share the data. Each sharing process' virtual memory map points to the same page of physical memory - the page that holds a copy of the disk block.

444) File Structure is dependent on the file types.  Internal file structure is operating system dependent. Disk access is done in units of block. Since logical records vary in size, several of them are packed in single physical block as for example at byte size. The logical record size, the physical block size and the packing technique determine how many logical records are in each physical block. There are three major methods of allocation methods: contiguous, linked and indexed. Internal fragmentation is a common occurrence from the wasted bytes in block size.

445) Access methods are either sequential or direct. The block number is relative to the beginning of the file. The use of relative block number helps the program to determine where the file should be placed and helps to prevent the users from accessing portions of the file system that may not be part of his file.

446) File system is broken into partitions. Each disk on the system contains at least one partition. Partitions are like separate devices or virtual disks. Each partition contains information about files within it and is referred to as the directory structure. The directory can be viewed as a symbol table that translates file names into their directory entries. Directories are represented as a tree structure. Each user has a current directory. A tree structure prohibits the sharing of a files or directories. An acyclic graph allows directories to have shared sub-directories and files. Sharing means there's one actual file and changes made by one user are visible to the other. Shared files can be implemented via a symbolic link which is resolved via the path name. Garbage collection may be necessary to avoid cycles.

447) Protection involves access lists and groups. Consistency is maintained via open and close operation wrapping.

448) File system is layered in the following manner:
1) application programs, 2) logical file system, 3) file-organization module, 4) basic file system, 4) i/o control  and 5) devices. The last layer is the hardware. The i/o control is the consists of device drivers and interrupt handlers, the basic file system issues generic commands to the appropriate device driver. The file organization module knows about files and their logical blocks. The logical file system uses the directory structure to inform the file organization module. The application program is responsible for creating and deleting files. 

#codingexercise: 
a player can draw a coin from the ends of a sequence. Determine the winning strategy:
int GetCoinsDP(List<int> coins, int i, int j)
{
If (i > j) return 0;
If (i==j) return coins[i];
If (j == i+1) return max (coins [i], coins [j]);
return max (
coins[I] + GetCoinsDP(sequence, 1, j), 
coins[j] + GetCoinsDP (sequence, 0,j-1));
}
The selections in each initiation level of GetCoinsDP can be added to a list and alternate additions can be skipped as belonging to the other player since the method remains the same for both.



# potential trend with object storage: 
https://1drv.ms/w/s!Ashlm-Nw-wnWuQB7rNqURAxQf9hF

Saturday, February 9, 2019

Today we continue discussing the best practice from storage engineering :

441) File-Systems continue to be a good source of organizational information on storage systems. File attributes include name, type, location, size, protection, and time, date and user identification. Operations supported are creating a file, writing a file, reading a file, repositioning within a file, deleting a file, and truncating a file.

442) Data structures include two levels of internal tables: there is a per process table of all the files that each process has opened. This points to the location inside a file where data is to be read or written. This table is arranged by the file handles and has the name, permissions, access dates and pointer to disk block. The other table is a system wide table with open count, file pointer, and disk location of the file.

443) Sections of the file can be locked for multi-process access and even to map sections of a file on virtual memory systems. The latter is called memory mapping and it enables multiple processes to share the data. Each sharing process' virtual memory map points to the same page of physical memory - the page that holds a copy of the disk block.

444) File Structure is dependent on the file types.  Internal file structure is operating system dependent. Disk access is done in units of block. Since logical records vary in size, several of them are packed in single physical block as for example at byte size. The logical record size, the physical block size and the packing technique determine how many logical records are in each physical block. There are three major methods of allocation methods: contiguous, linked and indexed. Internal fragmentation is a common occurrence from the wasted bytes in block size.

445) Access methods are either sequential or direct. The block number is relative to the beginning of the file. The use of relative block number helps the program to determine where the file should be placed and helps to prevent the users from accessing portions of the file system that may not be part of his file.

#codingexercise:
In a game of drawing two coins from either ends of a sequence between two players, determine the strategy to win:
Int GetCoins2DP(List<int> coins, int i, int j) { If (i > j) return 0; If (i==j) return coins[i]; 
If (j == i+1) return sum (coins [i], coins [j]); 
return max ( 
coins[I] + coins[I+1] + GetCoins2DP(sequence, 2, sequence.size() - 1),  
coins[j] + coins[j-1] + GetCoins2DP(sequence, 0, sequence.size() - 3),  
coins[I] + coins[j] + GetCoins2DP(sequence, 1,sequence.size()-2)); 
} 
The selections in each initiation level of GetCoins2DP can be added to a list and alternate additions to this list can be skipped as belonging to the other player since the method remains the same for both.  Then it helps to determine whether to go first or second.