Cluster computing

Sunday, November 18, 2018

Today we continue discussing the best practice from storage engineering:

Serialization: There is nothing simpler than bytes and offsets to pack and persist any data structure. The same holds true in storage engineering. We have referred to messages as a necessity for communication between layers and components. When these messages are written out, it is irrelevant whether the destination is local or remote. Serialization comes useful in both cases. Consequently, serialization and deserialization are required for most entities.

Directories: The organization we expect from the user to maintain their storage artifacts is also the same we utilize ourselves within the storage layer so that we don’t have to mix and match different entities. Folders help us organize key values in their own collections.

Replication strategy: We have referred to replication in storage organization and replication groups earlier but there may be more than one strategy used for replication. The efficiency in replication is closely tied to the organization and the data transfer requirements. Simple file synchronization techniques include events and callbacks to indicate progress, preview the changes to be made, handle conflict resolution and have graceful error handling per unit of transfer.

Number of replications: Although replication groups are decided by the user and they correspond to sites that participate in keeping their contents similar, every data and metadata units of storage are also candidates for replication whose number does not need to be configured and has to be system defined. A set of three copies is the norm for most such objects and their metadata.

Topology: Most storage products are deployed as single instances and usually comprising of a cluster or software defined stacks. However, the layout that the user might choose to have should remain as flexible as possible so that it can scale to their requirements. In this regard, each storage product/server/appliance must behave well with other instances in arrangements such chaining or federation.

Saturday, November 17, 2018

Today we continue discussing the best practice from storage engineering:

61) Diagnostic queries: As each layer and component of the storage server create and maintain their own data structures during their execution, it helps to query these data structures at runtime to diagnose and troubleshoot erroneous behavior. While some of the queries may be straightforward if the data structures already support some form of aggregation, others may be quite involved and include a number of steps. In all these cases, the queries will be against a running system in as much permitted with read-only operations.

62) Performance counter: Frequently subsystems and components take a long time. It is not possible to exhaust diagnostic queries to discover the scope that takes the most time to execute. On the other hand, the code is perfectly clear about call sequences, so such code blocks are easy to identify in the source. Performance counters help measure the elapsed time for the execution of these code blocks.

63) Statistics counter: In addition to the above-mentioned diagnostic tools, we need to perform aggregation over execution of certain code blocks. While performance counters measure elapsed time, these counters help with aggregation such as count, max, sum, and so on.

64) Locks: In order to perform thread synchronization, these primitives are often used. If their use cannot be avoided, they are best taken as few as possible universally. Partitioning and coordination solve this in many cases. Storage server relies on the latter approach and versioning.

65) Parallelization: Generally there is no limit enforced to the number of parallel workers in the storage server or the number of partitions that each worker operates on. However, the scheduler that interleaves workers works best when there is one active task to perform in any timeslice. Therefore, the number of tasks is ideal when it is one more than the number of processor. A queue helps hold the tasks until their execution. This judicious use of task distribution improves performance in every layer.

Friday, November 16, 2018

Today we continue discussing the best practice from storage engineering:

55) Https Encryption not only helps secure data at rest but also secured data in transit. However, it comes with the onus of key and certificate management. Https by default is not just a mandate over internet but also a requirement even between departments in the same organization.

56) KeyManagement: We have emphasized that keys are needed for encryption purposes. This calls for keys to be kept secure. With the help of standardized key management interfaces, we can use external keySecure managers. Keys should be rotated every now and then.

57) API security: it is almost undeniable to have APIs with any storage service. Every request made over the web must be secured. While there are many authentication protocols including OAuth, each request will be sufficiently secured if it has an authorization and a digital signature. ApiKeys are not always required.

58) Integration with authentication provider: File System protocol has been integrated with Active Directory. This enables organization to take advantage of authorizing domain users. Identity and Access management for cloud services can also be referred.

59) Auditing: Audit serves to detect unwanted access and maintain compliance with regulatory agencies. Most storage services enable auditing by each and every component in the control path. This is very much like the logging for components. In addition, the application exposes a way to retrieve the audits.

60) Offloading: Every bookkeeping, auxiliary and routine activity that takes up system resources could be candidate for hardware offloading so long as it does not have significant conditional logic and is fairly isolated. This improved performance in the data path especially when the activities can be consolidated globally.

#codingexercise
int GetNodeWithHeavierRightLeaves(Node root, ref List<Node> result)
{
if (root == null) return 0;
if (root.left == null && root.right == null) return 1;
int left = GetNodeWithHeavierRightLeaves(root.left, ref result);
int right = GetNodeWithHeavierRightLeaves(root.right, ref result);
if (right > left+2)
{
result.Add(root);
}
return left + right;

}

Thursday, November 15, 2018

Today we continue discussing the best practice from storage engineering:

50) Hardware arrangement: One of the most overlooked considerations has been the implications of the choice of hardware for storage servers. For example, a chassis with the expansion bays for solid state drive in the front is going to be more popular than the others and will set up the storage servers to take advantage of storage improvements.

51) SSD: Virtually all layers of a storage server can take improvements from faster access on Solid State Device. SSD unlike flash drives have no seek time. Both offer faster random read access than disks.

52) Faster connections: Networking between storage servers and components may not always be regulated if they are not on-premise. Consequently it is better to set up Direct connections and faster network wherever possible.

53) Direct Connections: This helps to have better control over communication between two endpoints. A dedicated TCP connection comes with the benefit of congestion control and ordering which translates to efficiency in data writes at the destination.

54) Virtual private network: Virtual private networks only add an IP header over the existing header so they may not improve the latency or bandwidth over the network but they certainly secure the network.

55) Https Encryption not only helps secure data at rest but also secured data in transit. However, it comes with the onus of key and certificate management. Https by default is not just a mandate over internet but also a requirement even between departments in the same organization.

Wednesday, November 14, 2018

Today we continue discussing the best practice from storage engineering:

46) Security – This is an integral part of every storage product. The artifacts from the user need to have proper access control list otherwise there may be undesirable access. The mechanism for access control has traditionally differed based on operating systems but most agree with role based access control mechanism.

47) Performance- Storage operations need to support a high degree of concurrent and super fast operations. These operations may even be benchmarked. Although local operations are definitely cheaper than remote operations, they are not necessarily a bottleneck in most modern cloud storage services.

48) Row level security although storage objects have granular access control lists, there is nothing preventing extension of security to individual key values with the help of tags and labels that can be universally designated.

49) Workload alignment: Public clouds pride themselves in the metrics they set and compete with each other, most also tune it to their advantage. It is important however to align the benchmarks with the workloads on the system.

50) Hardware arrangement: One of the most overlooked considerations has been the implications of the choice of hardware for storage servers. For example, a chassis with the expansion bays for solid state drive in the front is going to be more popular than the others and will set up the storage servers to take advantage of storage improvements.

Tuesday, November 13, 2018

Today we continue discussing the best practice from storage engineering:

41) Allocations: Although a storage organization unit such as file, blob or table seems like a single indivisible logical unit to the user, it translates to multiple physical layer allocations. Files have hierarchical organization and low-level drivers translate them to file location and byte offset on disk. This has been a traditional architecture and primarily driven by hierarchy and naming. Storage units have more than names. They have tags and metadata and designing file system that utilizes alternate forms of organization that leverages tags helps simultaneous use of different nomenclature. This is an example where master data management can bring significant advantages such as the use of attributes to lookup files.

42) Catalogs: Physical organization does not always have to directly co-relate with the way users save them. A catalog is a great example of utilizing the existing organization to serve various ways in which the content is looked up or correlated. Moreover, custom tags can help increase the ways in which the files can be managed and maintained. While lookups have translated to queries, content indexers have provided an alternate way to look up data. Here we refer to organization of metadata so that the storage architecture can be separated from the logical organization and lookups.

43) System metadata – Metadata is not specific only to the storage artifacts from the user. Every layer maintains entities and bookkeeping in the immediately lower layer and these are often just 6 useful to query as some of the queries of the overall system. This metadata is internal and for system purposes only. Consequently, they are the source of truth for the artifacts in the system.

44) User metadata – We referred to metadata for user objects. However, such metadata is usually in the form of predetermined fields that the system exposes. In some cases, users can add more labels and tags and this customization is referred to as user metadata. User metadata helps in cases outside the system where users want to group their content that can then be used in classification and data mining.

45) User defined functions, callbacks and webhooks – Labels and tags are only as much useful to the user as they can be used with their queries. If the system does not support intensive or involved logic, the user is left to implement their own. Such expressions may involve custom user defined operators, and callbacks. These can be executed on a subset of the user-data or all of the data including those of the user. They can also be executed where the results can be streamed.

Monday, November 12, 2018

Today we continue discussing the best practice from storage engineering:
36) Container Docker containers are also immensely popular for deployment and every server benefit from portability because it makes the server resilient from the issues faced by the host.

37) Congestion control: when requests are queued to the storage server, it can use the same sliding window technique that is used for congestion control in a tcp connection. Fundamentally there is no difference between giving each request a serial number in either case and handling them with the boundaries of the received and the processed at a rate that can keep the bounds manageable.

38) Standard query operators: many bookkeeping operations can be translated to aggregates over simple key-value collections. These aggregates so do not necessarily have to be dedicated customized logic. Instead if there was a generic way to perform standard query operations, many of the accounting can simply become similar query patterns

39) Queue: Most requests to the storage server are processed on a first come first served basis. This naturally suits the use of a queue as a data structure to hold the requests. Queues may be distributed in order to handle large volumes of requests. With distributed queues, the requests may be sent to partitions where they can be served best. Semantically all distributed queue processors behave the same in terms of the handling of request. They simply get the requests relevant to their partition

40) Task Schedulers: Queues are used not just with storage partitions. They are also used for prioritizing workloads. Background task processors usually have long running jobs. These jobs may take several quantum of time slices. Even if the job were to be blocking when executed, they may need to be interleaved with other jobs. The purpose of the task scheduler is to decide which job runs next on the processor. In order to facilitate retries and periodic execution, a cron tab may be set up for the job.

41) Coordinator for nodes: Just like a scheduler, a co-ordinator hands out tasks to agents on remote nodes. This notion has been implemented in many forms and some as services over http. In some cases, the web services have given way to a registry of tasks that all nodes can read and update states for an individual job.
#codingexercise
int GetNodeWithRightImbalancedLeaves(Node root, ref List<Node> result)
{
if (root == null) return 0;
if (root.left == null && root.right == null) return 1;
int left = GetNodeWithRightImbalancedLeaves(root.left, ref result);
int right = GetNodeWithRightImbalancedLeaves(root.right, ref result);
if (right > left)
{
result.Add(root);
}
return left + right;
}