Cluster computing

Sunday, July 11, 2021

A note about naming convention:

One of the hallmarks of good automation is the granularity of reusable actions. The code behind these actions, their associated artifacts, resources and even configurations must be specific to the action. With many actions, it might become hard to locate any one of them. A good naming convention overcomes this hurdle. Many developers are very particular about the style used in their code. There are even helper files like stylecop.json that help with bringing consistency to how code is written by different developers, so they look the same across contributions. Similarly, a naming convention brings order to the madness from proliferating code snippets in the automation recipes. When a playbook uses consistent naming convention, it becomes more readable and easier to maintain and use. Top notch automations will always bear readable and consistent naming.

There are quite a few conventions to choose from. There’s Camel case where the first letter of every word is capitalized with no spaces or symbols between words like in UserAccount. There’s also snake case where the words are separated by an underscore such as in user_account. The Kebab case is like the Snake case but overcomes the difficulty of using special characters with certain systems by merely replacing the underscore with a hyphen. The Hungarian convention uses lowercase letters to indicate the intention first before using the name with Pascal case such as in strUserAccount.

The use of different conventions is necessary for different purposes in the system. For example, Camel case or Pascal case is widely popular for readability across languages such as Pascal, Java and .NET. Resources in environments such as the private cloud or the public cloud use the Kebab case.

Architecture for any automation system is enhanced by its ability to introduce entities into existing collections. When these collections have proper naming schemes with approriate prefix/suffix, the name alone is sufficient to give enough information about the entity without having to look it up. This is a real saving in costs in addition to the convenience it brings. The use of naming conventions must be practiced diligently.

Saturday, July 10, 2021

The NuGet package resolution:

Introduction: This article introduces us to NuGet which is an essential tool for modern development platforms. It is a mechanism through which developers create share and consume useful code since the code is bundled along with its DLL and package information. A NuGet package is a single zip file with a .nupkg extension. It contains code as well as the package manifest. Organizations support sharing and publishing NuGet packages to a global, central, and public repository by the name nuget.org. The public repository can be complemented with a private repository. The package can be uploaded to either the public or private hosts when the package is downloaded the corresponding source is specified. The flow of packages between creators, hosts, and consumers is discussed next. The central repository has over 100,000 unique packages and they are frequently downloaded by developers every day. With a such large collection of packages, package browsing, lookups, identification, versioning, and compatibility must be clearly called out. This must be done in both the public and the private repository. When the package is downloaded, it is downloaded to the local cache on the developer’s system. When the application is built, these dependencies can be copied over to the target folder where the compiled code is dropped. This allows for the assembly to be found locally and loaded locally thus creating an isolation mechanism between applications referencing those packages on the same host. Enabling this isolation of application dependencies or assemblies is one way in which applications can safeguard that they will work on any host. One of the frequently encountered routines with package consumption is that the application must always use the same compatible versions of the packages. When the dependencies are updated to the next versions, sometimes, they don't work well together. In order for the application to continue working, it must maintain compatibility between the application dependencies which can be set once and maintained as and when the packages are updated. The initial assembly compatibility is ironed out at the build time in the form of compilation failures and resolutions. Subsequent package updates must always target incremental and higher versions than when they were initially built. If the version increment causes a compatibility break, the application has the choice to remain at the current version of the assembly and wait for subsequent versions that fix. The dependencies must be declared clearly with the project file used to contain and build the source. The version compatibility can be made more deterministic with the help of versions associated with those packages as well as their substitution policies. Packages might get updated for many reasons both from the publisher for the sake of defects found and fixed by the publisher as well as by revisions recommended from the common vulnerabilities database. Code and binary analysis tools help with these recommendations for the packages to be updated. Different versions of the package can coexist side-by-side for the same host as long as they are located in different folders. Package versioning and automatic redirect of versions are some techniques that help with this versioning when some or all of the packages get updated for an application. The manifests of package dependencies with their versions as well as the binding redirects in the application configuration file enable this compatibility to be maintained on revisions such that the application will always compile and run on any host. Certain assemblies are part and parcel of the runtime that is required to execute the application. These system assemblies are also resolved in the same way as application dependency when they are specified by targets except that their location is different from the local NuGet package cache. The system assemblies are bundled in different target frameworks such as the .Net core and .Net framework. The target framework with a moniker. such as 'netcoreapp3.1', provides an exhaustive collection of system assemblies that make it universal for applications to run and provide all the features it might expect. The target framework with a moniker, such as 'net48', is more lightweight and geared towards the portability of applications and the latest features of the runtime. With fewer assemblies, they have a smaller footprint. When the dependencies are not found, the package sources must be provided to download these dependencies. Once the packages are downloaded, they will be resolved and loaded. As discussed earlier external and internal package sources might both be required within the development environment to resolve different assemblies. Certain organizations restrict the package sources to just one otherwise it might violate the point of origin of the package and spurious packages might be introduced into the build and application binaries. They provide a workaround to target different package sources via proxying but they rely on a single authoritative source to control the overall package sourcing. The transparency in the package resolution from its source is a security consideration rather than a functionality consideration and is a matter of policy. When the assemblies are downloaded, a different set of security considerations come into play. The package cache on the local host is subject to safeguarding just like any other data asset. A developer can choose to clean the cache and start downloading all over again to improve package health and hygiene during rebuilds. This technique is called restore and it is available as an option during builds. Since the destination of the packages on the localhost where the code is built and run, is always the file system, there might be issues locating them even after the dependencies are downloaded. The resolution of the path where the assemblies are found and loaded from must also be made deterministic. There are two ways to go about this – one technique is the mention of the specific path for the assembly and the second option is to register it to the global assembly cache which is unique to every host and must maintain system assemblies. Registering the assembly to the global assembly cache provides a way to resolve them independent of the file system but it interferes with the application isolation policy. Finally, assemblies loaded into the runtime might still be incorrect and some troubleshooting might need to be done. This is not always easy but there are techniques to help with this case. The assembly loading process can be made more transparent by requiring the app domain to log to the console on how it finds and loads this assembly. Standard events can be bound and the state of the app domain can be viewed. If the name and version of the assembly are not enough to know about it, the assembly can also be enquired about its types by a technique known as reflection. The resolving of the assembly location and the order of the loading of assemblies can be made transparent with help of the assembly log viewer. The log viewer writes to a well-known location which can also be customized. The level of logging can be set with the help of registry settings on the host but care must be taken to not turn on logging for an extended period of time. Since the loading of assemblies is quite a common occurrence the logging must be turned on or off only for the duration of investigation otherwise the logs will typically tend to grow very large very quickly. Together with all these techniques, a developer can not only consume packages from well-known source but also publish packages for others to use in a more deterministic manner.

Friday, July 9, 2021

A note about automation continued...

The landscape of automation has also evolved. At one time, they were bound to the hosts and the programmability offered by the components on the hosts. In the Linux world, automation relies on shell scripts often invoked with SSH. In the Windows world, PowerShell added SSH support recently. Cross-platform support is still lacking but organizations have inventory and core functionalities deployed on both platforms. Fortunately, more and more automation now rely on microservice APIs for programmatic and shell-based access (think curl commands) to features that are not limited to the current host

Public cloud computing infrastructure hosts customer workloads as well as the ever-increasing portfolio of services offered by the cloud to its customers. These services from publishers both external and internal to the cloud require automation over the public cloud. They write and maintain this logic and bear the cost of keeping up with the improvements and features available for this automation logic. This article investigates the implementation platform for a global multi-tenant automation-as-a-service offering from the public cloud.

Multi-tenancy and software-as-a-service model is possible only with a cloud computing infrastructure. The automation logic for a service for a cloud differs significantly from that for a desktop. A cloud expects more conformance than a desktop or enterprise automation justifying the need for a managed program. As Cloud service developers struggle to keep up with the speed of software development for cloud-savvy clients, they face automation as a necessary evil that draws their effort from their mission. Even when organizations pay the cost up upfront in the first version released with a dedicated staff, they realize that the cloud is evolving at a pace that rivals their own release timeframes. Some may be able to keep up with the investments year after year but for most, this is better outsourced so that they spend less time on rewriting with newer automation technologies or embracing the enhancements features to the cloud.

Thursday, July 8, 2021

A note about Automation and infrastructure capabilities:

Infrastructure capabilities used to be expanded with pre-specified requirements, deterministic allocations based on back-of-the-envelope calculations, and step by step system development. The architecture used to involve service-oriented design and compositions within organizational units such as datacenters. The emerging trend, on the other hand, uses concurrent definitions of requirements and solutions, rapid integration into software-defined components with automation, and heuristics-based evolution. Demand can be managed with services that are redirected to different systems and load-balancing can ensure that all systems are utilized in deployment. With the move to zero-downtime maintenance, capabilities are serviced without any outages or notifications to the customer. Active and stand by servers are rotated so that the incoming traffic is handled appropriately.

Data backup, migration, and replication has brought significant improvements to the practice of system engineering. Content can easily be distributed between sites and near-real-time replication enables customers to find their data when they need it and not some later point. This has had an impact on the lifecycles of the platform components where newer hardware or upgraded components are much easier to bring in than ever before. With the ease and convenience of updating or upgrading parts or whole systems at lower or intermediary levels, the total cost of ownership of the infrastructure is reduced while the features and capabilities boost productivity. Security of the infrastructure also improves as recent versions of hardware and software bring more robustness and compliances with vulnerability mitigations.

The use of commodity servers and expansion based on a cluster-node design also allows infrastructure to scale out with little or no disruption to existing services. As long as the traffic from higher levels is consolidated towards known entry points, the expansion of the lower level can proceed independently. With an increase in automation of the often-repeated chores to build, deploy and test the system, the overall time to the readiness of the system also reduces bring orders of magnitude scale-out capabilities at the same time that was budgeted.

Every unit of deployment is also carefully studied to create templates and designs that can be reused across components. Virtual machines and their groups are easily t-shirt sized to meet the varying needs of different workloads and the steps taken to provision entire stacks on those leased resources also get executed automatically. This instantiates most resources customers demand from departments provisioning the infrastructure in a self-service model. Behind the scenes, the resources, their deployments, provisioning, usage calculations, and billing are automated to deliver reports that enable smarter use of those resources. With dry-runs and repeated walkthrough of execution steps, even the datacenters can be re-provisioned. This amount of flexibility resulting from templates and automation has proved critical for cloud-scale.

The container orchestration frameworks provide a good reference for automation and infrastructure design.

Wednesday, July 7, 2021

Some more points for consideration from previous post

Advantages of resource provisioning design:

1. It enables deployment, scaling, load balancing, logging, and monitoring of all resources both built-in and client authored.

2. There needs to be only one resource type in the control plane for a new adapter service that stands for the desired device configuration in terms of action and state. A controller can reconcile the device to its device configuration. There needs to be only one controller for a resource type. The commands, actions and state of a device may be quite involved but the adapter service has a chance to consolidate a logical resource type.

3. The device statistics can be converted to metrics and monitoring and offloaded for collection and analysis to its own stack. The metrics are pushed from other services. This has an opportunity to use an off the shelf solution. The data in written once but it can be read-only for analysis stacks. The Read-Write and Read-Only separation will help maintain the source of truth.

4. Separation of control plane and data plane also provides opportunity to use other infrastructure at a later point of time that allows all the services to be cloud friendly and portable.

5. One of the ways to enforce backward compatibility is to have specific versions for the APIs. In addition to having data syntax and semantics compatibility described via data contracts that can be verified offline via independent tools, it might be helpful to support multiple versions of the service from the start. Exposing resources via Representational State Transfer is a well-known design pattern

6. The reconciliation of the resource can be achieved with the help of a dedicated controller. A controller for the device control plane resource object is a control loop that watches for the state of the resource and then makes, or request changes as needed. Each controller tries to move the current state to the desired state by reading the spec field associated with the resource object. The control plane resource object can be created, updated or deleted just like any other resource with representational state transfer API design. The controller can set additional states on the resource such as marking it as finished.

Tuesday, July 6, 2021

Architecture trade-offs between Designer Workflows and Resource Provider Contracts

Introduction: Automations in the computing and storage industries often require infrastructure that supports the extensibility of logic that is specified by their clients. The infrastructure may support a limited number of predefined workflows that are applicable across clients, but they cannot rule out the customization that individual clients need. There are two ways to support the extensibility of any infrastructure platform. One technique involves the use of a dependency graph of workflows while the other technique involves the use of custom resources that are provisioned by resource providers external to the infrastructure. This article compares the two techniques for their pros and cons.

Description: When an Automation platform supports dependency elaboration via a flowchart, the user dictates the set of steps to be taken in the form of a controlled sequence including conditions on existing built-in workflows. By composing the steps in various modules, a client can write sophisticated logic to perform custom actions that can be re-used across their own modules. This extension allows them to make the infrastructure do tasks that were not available out of the box and the infrastructure records these specifications with the help of tasks and dependencies. The flowchart defined by the automation client gets persisted in the form of a dependency graph on tasks that can be either built-ins or client-defined.

The alternative to this approach is the representation of the result of these activities as resources to be provisioned. This technique generalizes automation as a system that provisions various resources and reconciles their states to the desired configuration defined by its clients. The representation of the resource allows the separation of control and data plane activities for the automation platform enabling many more capabilities for the platform without affecting any of the tasks involved on the data plane. Control and data plane refer to activities about provisioning versus usages of a resource. A resource once provisioned can participate in various data activities without requiring any changes to its provisioning or configuration which eliminates the need to interact with the automation or its client for the usages of the resource. Additionally, control plane activities are subject to governance, management, programmability, and security enhancements that are not easy to specify and manage using logic that gets encapsulated by the automation clients in their customizations without hooks on the automation side.

These two techniques have their own advantages. For example, the dependency-based orchestrations technique provides the following advantages:

1. Coordination and orchestration of activities across workflow boundaries are useful when workflows are componentized into multiple sub-workflows.

2. Orchestration of the workflows’ activities even when they have dependencies on external services which is useful some external service needs to be available for the activity to complete.

3. Enablement of "replay" when upstream artifacts change which prevents rewriting logic when those artifacts change

4. Generating regular and common names in the event catalog provides the ability to namespace, and map existing and new associations and their discoverability.

On the other hand, the resource provisioning architecture supports:

1. A nomenclature and discovery of resources that can be translated with export and import for portability.

2. It provides an opportunity to offload all maintenance to the reconciliation logic in-built into the corresponding operators that the platform maintains.

3. Scope and actions become more granular with export-import capabilities.

4. It improves visibility of all resources in the control plane that makes it easy to manage some or all of them with filters.

Eventually, both techniques require definitions and manifests to be declarative which is wonderful for their testing and validations or for their visualizations in viewers.

Conclusion: The universalization of logic via control and data plane separation enables the automation to be more platform-oriented and increase its portfolio of capabilities with little or no impact on the client's logic. This provides more opportunities for platform development than the dependency graph-based technique.

Monday, July 5, 2021

Kusto continued

Kusto query language and engine discussion continued from the previous article.

Let us now look at the Kusto query engine data ingestion path. Again, the admin node is the one to receive the command. This time the command is for data ingestion such as a log for a destination table. The admin node determines if the table has the right schema.

The admin node scans all nodes and finds an available data node to perform the processing and forwards the command to the target. The data node creates the data extent, copies the data, and sends back the extent information to the admin node. The admin node then adds the new shard reference to the table metadata and commits the new snapshot of the metadata. These extents now become sealed and immutable.

When the data is deleted, the actions are in reverse order. First, the reference tree is deleted before the actual data. The references are garbage collected and the actual delete is deferred until the threshold of the container is reached. Since the deletion of the data is deferred, it gives an opportunity to reclaim the reference tree. Queries involving deleted data are using a previous snapshot of the metadata which remains valid for the soon-to-be-deleted data.

Kusto query engine has a similar query parsing and optimization logic as the SQL Server. First, it parses the incoming script into an abstract syntax tree or AST for short. Then it performs a semantic pass over this tree. In this process, it checks the names, resolves them, verifies that the user has the permissions to access the relevant entity, and then checks the data type and reference. After the semantic pass, the query engine will build an initial relational operator, a tree based on the AST Query engine will further attempt to optimize the query by applying one or multiple predefined rewriting rules. These rewriting rules involve pushdown predicates, replacing table access with extent union structure, splitting aggregation OPS into the leaves, and using top and operators that are replicated to each data extent Together with this parsing an optimization logic Kusto achieves a common abstract syntax tree that is suitable for the query to be executed on the cluster. Let us next look at executing this query on continuously increasing tabular data such as usage data

Kusto has certain limitations it was originally designed as an ad hoc query engine with immense support for fast querying and text data processing capabilities but, as a big data platform, it does not really replace traditional databases like SQL Server. When we attempt to use Kusto as a replacement to the SQL Server, we will run into some limitations these are mentioned below with their potential solutions. There are limits on query concurrency because the cluster runs usually on a collection of eight cores VM nodes. It can execute up to 8 * 10 queries concurrently. The actual number can also be determined by using a Kusto command of showing the cluster policy along with the query throttling limitation of 10 queries per core or node is necessary for a healthy operation of the Kusto query engine. There are limits on the node memory which is set to a number that cannot be larger than the node's physical memory. If it is, then the setting will have no effect. This also implies that a query can take up almost all the node's memory for its processing, but it cannot go beyond the limits of what is available as the node’s physical memory. There are also limits on memory a join or a summarized operation which protects queries from taking too much memory. Finally, there is a limit on the result set size. The number of datasets cannot exceed 500,000 rows and the data size itself cannot exceed 64 megabytes. If the script hits this limitation, it will result in a query error with a partial query failure message, and this can be overcome by summarizing the data to output so that only the interesting result is propagated. This can be done with techniques such as using a take operator to see a small sample of the result and using a project operator to output only the columns of interest. There are no limits on query complexity, but it is not advisable to have more than 5000 conditions in the where class. Lastly, all these limitations are settings that can be adjusted to suit the workload