Cluster computing

Sunday, June 26, 2022

Exception StackTrace associations for root cause analysis   

Problem statement: Given a method to collect root causes from many data points in errors in logs, can there be a determination of associations between root causes? 

Solution: There are two stages to solving this problem:  

Stage 1 – discover root cause and create a summary to capture it  

Stage 2 – use an association data mining algorithm on root causes.

Stage 1:  

The first stage involves a data pipeline that converts log entries to exception stacktraces and hashes them into buckets. Sample included.  When the exception stack traces are collected from a batch of log entries, we can transform them into a vector representation and using the notable stack frames as features. Then we can generate a hidden weighted matrix for the neural network  

We use that hidden layer to determine the salience using the gradient descent method.       

All values are within [0,1] co-occurrence probability range.      

The solution to the quadratic form representing the embeddings is found by arriving at the minima represented by Ax = b using the conjugate gradient method.    

We are given input matrix A, b, a starting value x, several iterations i-max, and an error tolerance epsilon < 1    

This method proceeds this way:     

set I to 0     

set residual to b - Ax     

set search-direction to residual.    

And delta-new to the dot-product of residual-transposed.residual.    

Initialize delta-0 to delta-new    

while I < I-max and delta > epsilon^2 delta-0 do:     

    q = dot-product(A, search-direction)    

    alpha = delta-new / (search-direction-transposed. q)     

    x = x + alpha.search-direction    

    If I is divisible by 50     

        r = b - Ax     

    else     

        r = r - alpha.q     

    delta-old = delta-new    

    delta-new = dot-product(residual-transposed,residual)    

     Beta = delta-new/delta-old    

     Search-direction = residual + Beta. Search-direction    

     I = I + 1     

Root cause capture – Exception stack traces that are captured from various sources and appear in the logs can be stack hashed. The root cause can be described by a specific stacktrace, its associated point of time, the duration over which it appears, and the time of fix introduced, if known.   

Stage 2:

Association data mining determines whether two root causes occur together. The computation involves two computed columns namely Support and Probability. Support defines the percentage of cases in which a rule must exist before it is considered valid. We define that a rule must be found in at least 1 percent of cases.

Probability defines how likely an association must be before it is considered valid. We will consider any association with a probability of at least 10 percent.

Bayesian conditional probability and confidence can also be used. Associations have association rules formed with a pair of antecedent and consequent item-sets, so named, because we want to find the value of taking one item with another. Let I be a set of items, T be a set of transactions. Then an association A is defined as a subset of I that occurs together in T. Support (S1) is a fraction of T containing S1. Let S1 and S2 be subsets of I, then the association rule to associate S1 to S2 has support(S1->S2) defined as Support(S1 union S2) and a confidence (S1->S2) = Support(S1 union S2)/ Support(S1). A third metric Lift is determined as Confidence(S1->S2)/Support(S2) and is preferred because a popular S1 gives high confidence for any S2 and lift corrects that by having a value greater than 1.0 when S2 is also significant.

Certain databases allow the creation of association models that can be persisted and evaluated against each incoming request. Usually, a training/testing data split of 70/30% is used in this regard.
Sample: https://jsfiddle.net/g2snw4da/

Saturday, June 25, 2022

This is a continuation of series of articles on hosting solutions and services on Azure public cloud with the most recent discussion on Multitenancy here This article discusses the resource organization for multi-tenant resources.

Resource organization helps a multi-tenant solution with tenant-isolation and scale. There are specific tradeoffs to consider with multi-tenant isolation and scale-out across multiple resources. Azure’s resource limits and quotas and scaling the solution beyond these limits will be discussed.

When a multi-tenant solution is deployed, a decision needs to be taken whether the resources should be dedicated or shared. There are many categories of resources and there are many options and trade-offs. There are a range of options for tenant isolations. Considerations for tenancy model for a multitenant solution will provide more guidance and decide on the isolation policy. Multitenancy approaches and service specific guidelines are both applicable to the isolation policy.

The ability to scale must be planned for. There are limits and quotas to overcome and they vary with resource types, skus and subscriptions. Both scaling out and bin packing must be considered. Scaling, unlike tenant isolation, is dependent on growth. If the number of tenants are going to increase rapidly, there is no need to over-engineer the scale-out strategy. But if it can be planned, then a scale-out strategy can be thought through.

When there is an automated deployment process and there is a need to scale across resources, the way to deploy and assign tenants must be decided. As we are approaching the number of tenants that can be assigned to a specific resource, we must detect the threshold. When we plan to deploy new resources, it must be decided whether they will be ready just in time or ready ahead of time.

When assumptions are made in code and configuration, they can limit the ability to scale. There might be a need to scale out to multiple storage accounts, but the application tier might be assuming a single storage account for all tenants.

Azure resources are deployed and managed through a hierarchy. Most resources are deployed into resource groups which are contained in subscriptions. This hierarchy pertains to a tenant. When we deploy the resources, we have the option to isolate them at different levels. Different models can be used in different components of the same solution.

Resources that are shared across multiple instances can still achieve isolation on a single instance for all the workloads from the tenants. When we run a single instance of resource, the service limits, subscription limits and the quota applies. When these limits are encountered, the shared resources must be scaled out.

In all these cases, the application code must be fully aware of multitenancy, and it restricts access to the data for a specific tenant.

Resources can also be dedicated to a single tenant where a single copy of the application is provided to the tenant. A clear naming convention, strategy, resource tags, or a tenant catalog database is needed.

Friday, June 24, 2022

The previous article talked about the subdomains and custom domains for multitenants such that the requests can be routed to the respective tenant. This article talks about all other aspects of routing.

Mapping the request to a tenant is a necessity when the multitenant solution is hosted on different geographical regions. The physical infrastructure that hosts the tenant’s resources must receive the request.

Domain names identify the tenants. The request to a tenant can be mapped to a tenant using the Host header or another HTTP header that includes the original hostname for each request but the following considerations need to be made.Will the users know which domain name to access the solution with? Is the landing page or login page common to all tenants? What is required to verify access to a tenant? Is it just authorization tokens or is it tenant-specific domain names as well.

The HTTP request properties include the url path structure, a query string and custom headers. The tenant information can be specified in all of these.

The resolution of tenant varies between subdomains and custom domain names. In a multitenant application, tenants might want to bring their own domain names. This might be important for branding for business purposes. It might also be technical that they might need to supply their own TLS certificates which bear subject names. These custom domain names for tenants require additional considerations than subdomain names.

Name resolution is one of the considerations. The name resolution to an IP address depends on whether there is a single instance or many instances of the multitenant application. For example, a CNAME for the custom domain of a tenant might have a value pointing to a multi-part subdomain of the multitenant application solution provider. Since this provider might want to set up proper routing to multiple instances, they might have a CNAME record for subdomains of their individual instance to route to that instance. They will also have an A name record for that specific instance to point to the ip address of the provider’s domain name. This chain of records resolves the requests for the custom domain to the ip address of the instance within the multiple instances deployed by the provider.

Host header resolution is also significant. All the web components need to be aware of how to handle the requests that arrive with the provider’s domain name in their host request header. Each tenant’s domain name might be a subdomain or a custom domain and this adds operational overhead to the onboarding of tenants. Host headers can also be rewritten by say the Azure FrontDoor so that the web server receives a single Host header. The example of Azure FrontDoor also propagates the original value of the host header in a X-Forwarded-Host header so the multitenant application can properly resolve the tenant.

Validation of custom domains is a necessity for the tenants to be onboarded. Without validation, tenants might accidentally or maliciously park a domain name. Typos in custom domain names are encountered often. Parking leads to an error for others wanting to use their custom domain with the message that the domain name is already in use. Domain names especially withing a self-service or automated process require a domain verification step. A CNAME record or a DNS TXT record might be added to reserve the domain name until the verification is completed.

Dangling DNS and subdomain takeover attacks are more likely to hit custom domains. This attack happens only when customers disassociate their custom domain name from the service but they don’t delete the record from their DNS server. In this case, the DNS entry points to a non-existent resource and is vulnerable to a takeover. This can be avoided if the CNAME record for the tenant is deleted from the DNS server before the domain name can be removed from the tenant’s account.

Thursday, June 23, 2022

In a multitenant application, tenants might want to bring their own domain names. This might be important for branding for business purposes. It might also be technical that they might need to supply their own TLS certificates which bear subject names. These custom domain names for tenants require additional considerations than subdomain names.

Wednesday, June 22, 2022

Border Gateway Protocol:

This is a continuation of series of articles on hosting solutions and services on Azure public cloud with the most recent discussion on Multitenancy here This discusses networking considerations in Multitenant applications.  

This protocol can be configured on a Windows Server with Routing and Remote Access Service Gateway in multitenant mode. It gives the ability to manage the tenant’s vm networks and their remote sites.

BGP is a dynamic routing protocol. It learns the route between sites that are connected using site-to-site VPN connections. It eliminates the need for manual route configuration on routers. When configured as a multi-tenant BGP router to exchange tenant and Cloud Service Provider subnet routes, the RAS gateway is deployed on a vm or a set of vms for high availability. The single tenant edge gateway deployment can be on a physical computer in a LAN deployment.

The Powershell script to configure the multitenant mode looks like this:

$foo_RoutingDomain = “FooTenant”

$bar_RoutingDomain = “BarTenant”

Install-RemoteAccess -MultiTenancy

Enable-RemoteAccessRoutingDomain -Name $foo_RoutingDomain -Type All -PassThru

Enable-RemoteAccessRoutingDomain -Name $bar_RoutingDomain -Type All -PassThru

There can be several modes of deployment between Enterprise sites and a Cloud Service Provider Datacenter. This involves dynamic routing information exchange between an Enterprise and the multiple gateways of the CSP. A few modes of deployments are enumerated below:

RAS VPN site-to-site gateway with BGP at the Enterprise site edge.

Third Party Gateway with BGP at the Enterprise site edge

Multiple Enterprise sites with Third Party gateways

Separation Termination points for BGP and VPN

The last mode of deployment supports internal BGP (iBGP) and external BGP (eBGP) segregation. The iBGP is only used with the separation of termination points for BGP and VPN. BGP is used for peering and maintains a separate routing table different from those for internal networks. The route metrics are based on shortest AS paths rather than distance or cost between hops. Unlike OSPF or interior Gateway Protocol that provides fault tolerance or redundancy and direct connections to external Autonomous Systems, BGP handles multiple connections to an external Autonomous System while allowing the existing router to handle the additional demands. It is an admission control protocol based on path-vector routing.

The way BGP works are that it establishes neighbor relationships called peers between routers called speakers. If the relationships are all within the same AS, it is called internal BGP. If it connects separate autonomous systems, it is called external BGP. Initially, peers share full routing tables. Afterward, they share only the updates.

The features of the BGP Router using Windows Server include:

Independent deployment of just the BGP routing role service and not the Remote Access Service which leads improved router performance.

Collection of statistics using Message counters and Route Counters. The Get-BGPStatistics cmdlet provides this information.

Equal Cost multipath routing support for redundant networks

Hold Time Configuration- The BGP Router supports configuration of the Hold Timer Value according to the network requirements.

Internal BGP and external BGP segregation – The local and remote BGP routers are distinct supporting iBGP and eBGP peering. The iBGP is only used with the fourth mode of deployment listed which is the separation of termination points for BGP and VPN.

Latest RFC compliance – RFC-4271 aka BGP-4 protocol compliant implementation implies the product is interoperable with third party vendors.

Ipv4 and ipv6 peering supported- this support comes from ipv4 and ipv6 peering while the BGP router is assigned an ipv4 address.

Ipv4 and ipv6 advertisement capability or Multiprotocol Network layer Reachability Information NLRI is supported

Mixed mode and passive mode peering is supported. The former refers to the BGP Router serving as both the initiator as well as the responder. The latter mode is just responsive so it helps with debugging and troubleshooting.

Route attribute rewrite capability is provided. The BGP routing policies Next-Hop, MED, Local-Pref and Community are supported.

Route filtering – The BGP router supports filtering ingress or egress route advertisements.

Tuesday, June 21, 2022

Identity Management in Multitenant applications:  

Tenants and their users are recognized by their identity in a multi-tenant application. Every user belongs to a tenant. A user signs in with her organizational credentials. She may have access to the data from her organization but not to the data from other tenants. She can register with the multitenant application/service and then after her account is created, she can assign roles to other members.  

Identity and access management provide built-in features to support all these scenarios. So, they simplify the logic that the multi-tenant application must execute to log them in. Let us say there are two users alice@contoso.com and bob@fabrikam.com who want to login to a multitenant SaaS application. Since they belong to different tenants, the application must map the user to the right tenant. Alice cannot have access to Fabrikam data in this case.  

Azure AD can handle sign-in and authentication of different users and the multitenant application is the same physical instance that recognizes and isolates the tenants to whom the users belong. 

During authentication, such as when a user accesses a resource, the application must determine the user’s tenant. If the tenant is already onboarded to the application, then such a user does not need to create a profile. Users within an organization are part of the same tenant.  Any indicators for the tenancy that comes from the user such as the domain name of the email address used to login cannot be trusted. The identity store must be used to resolve the tenant.   

During the authorization, the tenant determination for the identity involved must be repeated.  Users must be assigned roles to access a resource. The role-based access control relies on the role assignments from the tenant administrator or the user if the resource is owned by the user. The role assignments are not made by the tenant provider.  

The signup process is critical to multitenant applications which allows a customer to sign up their organization for the application. It allows an AD administrator to consent for the customer’s entire organization, to use the application. It collects credit card payment or other customer information. It performs any one-time per-tenant setup needed by the application.
Usually, a multitenant application has an AccountController class where the sign-in action returns a ChallengeResult. This enables the OpenID connect middleware to redirect to the authentication endpoint. The default way is to trigger authentication in ASP.Net core. The Signup action that is different from the signin action also returns a ChallengeResult but it adds state information to the AuthenticationProperties in the ChallengeResult This is relayed to the OpenID Connect state parameter which round trips during the authentication flow from the Azure AD. When the user authenticates with the AzureAD, it gets redirected back to the application. The authentication ticket contains the state. The admin consent flow is triggered by adding a prompt to the query string in the authentication request. This prompt is only needed during sign-up and the regular sign-in should not include it. 
Since many users can map to the same tenant, the application database could include a tenants table with id and issuerValue attributes and a user table with Id, tenantId, ObjectId, DisplayName, and Email properties for taking the following actions. If the tenant’s issuer value is not in the database, the tenant has not signed up. If the user is signing up, the tenant is added to the database and the authenticated user is added to the corresponding table. Otherwise, the normal signin process is completed.

Application roles are used to assign permissions to the users. It defines the following roles: Administrator, Creator and Reader. These roles imply permissions during the authorizations. There are three main options to assign roles: Azure AD App Roles, Azure AD security groups, Application Role Manager