Cluster computing

Monday, January 10, 2022

Fault injection Testing:

Stability and Resiliency of software is critical for smooth running of an application. Fault injection testing is the deliberate introduction of errors and faults to a system to observe its behavior. The goal is for the software to work correctly despite errors encountered from calls made to dependencies such as other APIs, system calls and so on. By introducing intermittent failure conditions over time, the application behaves as realistically as in production where hardware and software faults can occur randomly, but the services must remain available, and the business continuity must be maintained.

System needs to be resilient to the conditions that cause production disruptions. The dependencies might include infrastructure, platform, network, 3^rd party software, or APIs. The risk of impact from dependency failure may be direct or cascading. Fault injection methods are a way to increase coverage and validate software robustness and error handling, either at build time or at run-time with the intention of embracing failure as a part of development lifecycle. These methods assist service teams in designing and continuously validating for failure, accounting for known and unknown failure conditions, architect for redundancy and employ retry and back-off mechanisms. Together with the introduction of intermittent failures and continuous monitoring in the stage environment of service deployments, these methods promote near total coverage of known and unknown faults that can impact the service in production. The purpose of the monitoring aspect during these experiments is the observation of fault and its recovery time, overview of symptoms in related components and the determination of the threshold and values with which alerts can be set.

Fault engineering is equally applicable to software, protocol, and infrastructure. Software faults include error-handling code paths and in-process memory management for which edge-case unit-tests, integration tests and stress and soak load tests are written. Protocol faults include the vulnerabilities in communication interfaces such as command line parameters or APIs. Examples of tests that mitigate this includes fuzzing which provides invalid, unexpected, or random data as input and we can access the level of protocol stability of a component. Infrastructure faults include outages, networking, and hardware failures. The tests that mitigate these cause fault in the underlying infrastructure such as shutting down virtual machines, crashing processes, expiring certificates and others.

One of the challenges with these methods is the signal to noise ratio from the errors. A fault is a hypothesis of an error. An error is a failure in the system and can lead to other errors. Since they occur in a cycle, the fault-error-failure cycle can lead to many errors from which the ones that must be fixed to improve system resilience and reliability need to be discerned. When these experiments are run for short durations, the number of errors to investigate is usually low. The leveraging of automation to continuously validate what matters during the experiment allows the detection of even errors that are hard to find manually.

Such automation can even be introduced into the pipeline to release software. This promotes a shift-left approach where the testing occurs as early in the development and project timeline as when the code is written. It follows the test early and often principle and the benefit is in the possibility to troubleshoot the issues encountered via debugging.

The outcomes of the fault injection testing are the measurement and definitions of a steady healthy state for the system’s interoperability, finding the difference between the baseline state and the anomalous state and documenting the processes and observations to identify and act on the result.

Sunday, January 9, 2022

Monitoring is a critical aspect for any service in the cloud both internal and customer facing. Metrics and alerts are part of the monitoring dashboard.

Each resource provides metrics to monitor specific aspects of the operations. These metrics can be viewed with the Azure Monitor Service or explored and plotted with the Azure Monitor Metrics Explorer. The metrics include QueryVolume, RecordSetCount, RecordSetCapacityUtilization. The last one is a percentage while the first two are counts. The QueryVolume is a sum of all queries received over a period. It can be viewed by browsing the metrics explorer, scoping down to the resource and selecting the metric with sum for aggregation. The RecordSetCount shows the number of Recordsets in Azure DNS for the DNS zone. All the recordsets are counted and the aggregation is the maximum of all the recordsets. The RecordSet capacity utilization shows the percent used for the RecordSet capacity of a DNS Zone. Each zone has a RecordSet limit that defines the maximum number of RecordSets allowed for the zone. The aggregation type is maximum.

Resource metrics can be used to raise alerts. They can be configured from the monitor page in the Azure portal. It must be scoped to a resource which is the DNS zone in this case. The signal logic can be configured by selecting a signal and configuring the threshold and frequency of evaluation for the metric.

Continuous monitoring of API is also possible via Synthetic monitoring. It provides proactive visibility into API issues before customers find the issues themselves. This is automated probing to validate build-out of deployments, monitoring a service or a mission critical scenario independent of the service deployment cycle and testing the availability of dependencies. It ensures end-to-end coverage of specific scenarios and can even perform validation of the response body not just the status code and headers. By utilizing all properties of making a web request and checking its response as well as a sequence of requests, the monitoring logic begins to articulate the business concerns that must remain available. Synthetic is not just active monitoring of a service. It is a set of business assets that take away the onus from business continuity assurance.

The steps to set up a Synthetic monitoring includes an onboarding, provisioning and deployment. The onboarding is required to isolate all the data structures and definitions specific to the customer and referred to by an account id. The provisioning is the setup of all Azure resources that are necessary to execute the logic. The deployment of the logic involves both the code and the configuration. The code is a .Net assembly and the configuration is a predefined json. It can specify more than one region to deploy and the regions can be changed from deployment to deployment of the same logic.

The use of active and passive monitoring completes the overall probes needed to ensure the smooth running of the services.

Saturday, January 8, 2022

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure DNS which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category.

DNS servers used with Active Directory can be primary or secondary.
The primary stores all the records while the secondary gets the contents from primary
The contents of a zone file are stored hierarchically, and this structure can be replicated among all the DCs.
It is updated via LDAP operations or DDNS (Dynamic DNS must have AD integration). A common misconfiguration issue is the island issue when an IP address for a DNS, changes and it is updated only locally. To do a global update instead, they must point to a root server other than themselves. Delegation options are granted to DNS servers or DCs. Simple is when DNS namespaces are delegated to DCs, and DC hosts a DNS zone. The records in a DNS server as opposed to DC are autonomously managed. DNS servers need to allow DDNS by DC. DC does DDNS to prevent updates to the DNS records in the server. Support and maintenance are minimal with DDNS. Standalone AD is used to create test or lab networks. A forest is created, a DC is assigned, DNS Service is installed. DNS zone is added, unresolved requests are forwarded to an existing corporate server. The primary DNS for all clients points to the DC. Background loading of DNS Zones makes it even easier to load DNS zones while keeping the zone available for DNS updates / queries.
Active directory has a feature where by one or more IP address can be specified to forward name resolutions to that are not handled by the local DNS server. The conditional forwarder definitions are also replicated via Active Directory. Together with the forward and reverse lookup zones in the active directory these can be set via the DNS mmc management console. The DNS servers are usually primary or secondary in nature. The primary stores all the records of the zone and the secondary gets the contents of its zone from the primary. Each update can flow from the primary to the secondary or the secondary may pull the updates periodically or on demand. All updates must be made to the primary. Each type of server can resolve name queries that come from hosts for the zones. The contents of the zone file can also be stored in the active directory in a hierarchical structure. The DNS structure can be replicated among all DCs of the domain, each DC holds a writeable copy of the DNS data. The DNS objects stored in the Active Directory could be updated on any DC via LDAP operations or through DDNS against DCs that act as DNS servers when the DNS is integrated with the Active Directory.
The DNS "island" issue sometimes occurs due to improper configuration. AD requires proper DNS resolution to replicate changes and when using integrated DNS, the DC replicates DNS changes through AD replication. This is the classic chicken and egg problem. If the DC configured as name server points to itself and its IP address changes, the DNS records will successfully be updated locally but other DCs cannot resolve this DC's IP address unless they point to it. This causes replication fail and effectively renders the DC with the changed IP address an island to itself. This can be avoided when the forest root domain controllers that are the name servers are configured to point at root servers other than themselves.
Application partitions are user defined partitions that have a custom replication scope. Domain controllers can be configured to host any application partition irrespective of their domains so long as they are in the same forest. This decouples the DNS data and its replication from the domain context. You can now configure AD to replicate only the DNS data between the domain controllers running the DNS service within a domain or forest.
The other partitions are DomainDnsZones and ForestDnsZones. The system folder is the root level folder to store DNS data. The default partitions for Domain and Forest are created automatically.
Aging and scavenging When the DNS records build up, some of the entries become stale when the clients have changed their names or have moved. These are difficult to maintain as the number of hosts increases. Therefore, a process called scavenging is introduced in the Microsoft DNS server that scans all the records in a zone and removes the records that have not been refreshed in a certain period. when the clients register themselves with the dynamic DNS, their registrations are set to be renewed every 24 hours by default. Windows DNS will store this timestamp as an attribute of the DNS record and is used with scavenging. Manual record entries have timestamps set to zero, so they are excluded from scavenging.
"A "no-refresh interval" for the scavenging configuration option is used to limit the amount of unnecessary replication because it defines how often the DNS sever will accept the DNS registration refresh and update the DNS record.
This is how often the DNS server will propagate a timestamp refresh from the client to the directory or filesystem. Another option called the refresh interval specifies how long the DNS server must wait to follow a refresh for a record to be eligible for scavenging and this is typically seven days.

Friday, January 7, 2022

Azure DNS allows hosting a DNS zone and managing the DNS records for a domain in Azure. The domain must be delegated to the Azure DNS from the parent domain so that the DNS queries for that domain can reach Azure DNS. Since Azure DNS isn't the domain registrar, delegation must be configured properly. A domain registrar is a company who can provide internet domain names. An internet domain is purchased for legal ownership. This domain registrar must delegate to the Azure DNS.

The domain name system is a hierarchy of domains which starts from the root domain that starts with a ‘.’ followed by the top-level domains including ‘com’, ‘net’, ‘org’, etc. The second level domains are ‘org.uk’, ‘co.jp’ and so on. The domains in the DNS hierarchy are hosted using separate DNS zones. A DNS zone is used to host the DNS records for a particular domain.

There are two types of DNS Servers: 1) An authoritative DNS Server that hosts DNS zones and it answers the DNS queries for records in those zones only and 2) a recursive DNS server that doesn’t host DNS zones but queries the authoritative servers for answer. Azure DNS is an authoritative DNS service.

DNS clients in PCs or mobile devices call a recursive DNS server for the DNS queries their application needs. When a recursive DNS server receives a query for a DNS record, it finds the nameserver for the named domain by starting at the root nameserver and then walks down the hierarchy by following CNAMEs. The DNS maintains a special type of name record called an NS record which lets the parent zone point to the nameservers for a child zone. Setting up the NS records for the child zone in a parent zone is called delegating the domain. Each delegation has two copies of the NS records: one in the parent zone pointing to the child, and another in the child zone itself. These records are called authoritative NS records and they sit at the apex of the child zone.

The DNS records help with name resolution of services and resources. It can manage DNS records for external services as well. It supports private DNS domains as well which allows us to use custom domain names with private virtual networks.

It supports record sets where we can use an alias record that is set to refer to an Azure resource. If the IP address of the underlying resource changes, the alias record set updates itself during DNS resolution.

The DNS protocol prevents the assignment of a CNAME records at the zone apex. This restriction presents a problem when there are load balanced applications behind a Traffic Manager whose profile requires the creation of a CNAME record. This can be mitigated with Alias records which can be created at the zone apex.

Azure DNS alias records are qualifications on a DNS record set. They can reference other Azure resources from within the DNS zone. For example, an alias record set points to a public ip address instead of an A record. This pointing is dynamic. When the IP addresses change, the record sets update dynamically during name resolution. An alias record set can exist for A, AAAA, CNAME record types. An A record set also known as a resource record set is the collection of DNS records in the zone that have the same name and are of the same type. An AAAA record is for ipv6 address. The SOA and CNAME record types are exceptions The DNS Standard does not permit multiple records with the same name for these types. These record sets can only contain a single record.

Azure DNS supports wild card records. These get returned in response to any query with a matching name. CAA records allow domain owners to specify which certificate authorities are authorized to issue certificates for their domain.

CAA records allow domain owners to specify which certificate Authorities are authorized to issue certificates for their domain and this allows them to avoid issuing incorrect certificates in some cases. The CNAME record sets can’t co-exist with the other record sets with the same name. Also, CNAME record sets can’t be created in the zone apex (name = ‘@’) which will always contain the NS and SOA record sets during the creation of the zone. The NS records are bound to the creation and deletion of the zone. It contains the name of the Azure DNS name servers assigned to the zone. This only applies to the NS record set to support cohosting domains with more than one DNS provider

Some of the validations that can be performed on these records include:

1. Parent Zone with conflicting child records fails

2. Delegation with no conflicts pass

3. Delegation with already configured zone passes

4. Delegation with different configured zone fails

5. Delegations with trailing dot in record set pass

6. Zone with intermediate delegation fails

7. Zone with root wild card fails

8. Zone with root Txt wild card fails

9. Zone with intermediate wild card fails

10. A record can be created in the zone

11. A record can be created in an empty zone

12. A record can be created with the same data

13. A record can be created with compatible existing records

14. A record with conflicting another A record fails

15. A record with conflicting CNAME record fails

16. A record with conflicting wild card record fails

17. A CNAME record can be created

18. A CNAME record can be created in an empty zone

19. A CNAME record with same data succeeds

20. A CNAME record with conflicting record fails

21. A CNAME record with conflicting wild card fails

Thursday, January 6, 2022

Wednesday, January 5, 2022

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure DNS. This is a hosting service for DNS domains that provides name resolution by using Microsoft Azure infrastructure. This lets us manage DNS records, but it is not something to buy a domain name with. For that, App Service domain works well. When the domains are available, they can be hosted in the Azure DNS for records management.

While Azure DNS has many features including activity logs, resource locking and Azure RBAC, the DNSSEC is not supported because HTTP/TLS is available instead. If it is required, then DNS zones can be hosted with third party DNS hosting providers.

Each resource must be given a name. With records in the domain name server, the resource becomes reachable and resolvable from a virtual network. The following options are available to configure the DNS settings for a resource: 1) using only the host file that resolves the name locally 2) using a private DNS zone to override the DNS resolution for a private endpoint with the zone linked to the virtual network, and 3) using a DNS forwarder with a rule to use the DNS Zone in another DNS server. It is preferrable to not override a zone that resolves public endpoints because if the connectivity to the public DNS goes down, those public endpoints will remain unreachable. That is why a different domain name with a suffix such as “core.windows.net” is recommended. Multiple zones with the same name for different virtual networks would need manual operations to merge the DNS records.

A common problem with traditional DNS records is dangling records. The DNS records that haven’t been updated to reflect changes to IP addresses are called dangling records. With a traditional DNS zone record, the target IP or CNAME no longer exists. It requires manual updates which can be costly. A delay in updating DNS records can potentially cause an extended outage for the users. Alias records avoid this situation by tightly coupling the lifecycle of a DNS record with an Azure resource.

Tuesday, January 4, 2022

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style.

This article talks about data ingestion from one location to another in an Azure Data Lake Gen 2 using Azure Synapse analytics. The Gen 2 is a source data store and will require a corresponding storage account. Azure Synapse analytics provides many features for data analysis and integration, but its pipelines are even more helpful to working with data.

In the Azure Synapse Analytics, we create a linked service which is a definition of a connection information to another service. When we add Azure Synapse Analytics and Azure Data Lake Gen 2 as linked services, we enable the data to flow continuously over the connection without requiring additional routines. The Azure Synapse Analytics UX has a manage tab where the option to create a linked services is provided under External Connections. The Azure Storage Data Lake Gen 2 connection will require an Account Key, a service principal, a managed identity and supported authentication types. The connection can be tested prior to use.

The pipeline definition in the Azure Synapse describes the logical flow for an execution of a set of activities. We require a copy activity in the pipeline to ingest data from Azure Data Lake Gen 2 into a dedicated SQL pool. A pipeline option is available under the Orchestrate tab which must be selected to associate activities with. The Move and Transform option in the activities pane has a copy-data option that can be dragged onto the pipeline canvas. The copy activity must be defined with a new source data store as the Azure Data Lake Storage Gen 2. The delimited text as the format must be specified along with the filepath as the source data and whether the first row has a header

With the pipeline configured this way, a debug run can be executed before the artifacts are published which can verify if everything is correct. Once the pipeline has run successfully, the publish-all option can be selected to publish entities to the Synapse Analytics service. When the successfully published message occurs, we can move on to triggering and monitoring the pipeline.

A trigger can be manually invoked with the Trigger Now option. When this is done, the monitor tab will display the pipeline run along with links under the Actions column. The details of the copy operation can then be viewed. The data written to the dedicated SQL pool can then be verified to be correct.