Cluster computing

Monday, June 13, 2022

This is a continuation of series of articles on hosting solutions and services on Azure public cloud with the most recent discussion on Service Fabric as per the summary here. This article discusses the architectural approaches for AI and ML in a multitenant solution.

Both conventional and modern applications can be multitenant without sacrificing any aspects of their dedication to solving core business requirements. The infrastructure and the architecture approaches differ from a multitenancy without a change in the overall purpose of the technology to meet business requirements. A modern multitenant application can leverage AI/ML-based capabilities to any number of tenants. These tenants continue to remain isolated from one another and can’t see each other’s data but the curated model that they use can be shared and may have been developed from a comprehensive training set and perhaps in a pipeline different from where the model is currently hosted.

When a multitenant application needs to consider the requirements for data and model for AI/ML purposes, it must consider the requirements around both the training and the deployment. The compute resources required for training are significantly different from those required for deployment. For example, For example, an AI/ML model written in TensorFlow might require a Keras layer. A Keras layer is like a backend and can run on Colab environment. Keras can help author the model and deploy it to an environment such as Colab where the model can be trained on a GPU. Once the training is done, the model can be loaded and run anywhere else including a browser. The power of TensorFlow is in its ability to load the model and make predictions in the browser itself. As with any ML learning example, the data is split into 70% training set and 30% test set. There is no order to the data and the split is taken over a random set. With the model and training/test sets defined, it is now as easy to evaluate the model and run the inference. The model can also be saved and restored. It is executed faster when there is GPU added to the computing.

When the model is trained, it can be done in batches of predefined size. The number of passes of the entire training dataset called epochs can also be set up front. These are called model tuning parameters. Every model has a speed, Mean Average Precision and output. The higher the precision, the lower the speed. It is helpful to visualize the training with the help of a high chart that updates the chart with the loss after each epoch. Usually there will be a downward trend in the loss which is referred to as the model is converging.

When the model is trained, it might take a lot of time say about 4 hours. When the test data has been evaluated, the model’s efficiency can be predicted using precision and recall, terms that are used to refer to positive inferences by the model and those that were indeed positive within those inferences.

In a multitenant application, the tenancy model affects each stage of the AI/ML model lifecycle. The overall solution provides accurate results only when the model runs correctly.

One of the best practices around AI/ML models is to treat them just as sensitive as raw data that trained them. The tenants understand how the data Tenants must also understand how their data is used to train the model and how the model trained on others data is used for inference purposes on their workloads.

There are three common approaches for working with AI/ML models that are: 1. Tenant-specific models, 2. Shared models and 3. Tuned shared models and these are similar to the resource sharing for tenants.

Sunday, June 12, 2022

Governance, Regulations and Compliance:

Cloud computing has proliferated VMs while security standards have been trying to catch up from the resource centric IT environments earlier to more datacenter oriented environments now. Fortunately, this has evolved based on the clouds are offered – public, community, private and hybrid. A public cloud has the highest risk due to lack of security control, multi-tenancy, data management, limited SLA and lack of common regulatory controls. A community cloud has moderate risk due to multi-tenancy, however it has less risk than public cloud due to shared legal/regulatory compliance issues. A private cloud has the least risk due to single ownership and strong shared mission goals along with legal/regulatory requirements. A Hybrid cloud has risk that depends upon combined models. A combination of private/community is lowest risk while a combination of public/community poses greatest risk. The scope also matters. A public cloud serves several hundreds of organizations. A community cloud works with the private network of two or more organizations. A private cloud is entirely internal to the organization’s private network. A hybrid cloud can have a private/community cloud entirely within the organization’s private network with spillover capacity to a public/community cloud.

The Information security governance framework is primarily Plan, Do, Check, Act cycle of continuous improvement and is comprised of seven management processes. These are strategy & planning, policy portfolio management, Risk management, management overview, Communication & outreach, compliance and performance management, awareness & training. The management processes govern the implementation and operation of the functional processes, which vary based on the cloud environment.

Central to the implementation of the functional processes, is the scheduled sweep of resources for GRC purposes. These sweeps involve tightening the configurations of Virtual Machines in all forms and flavors. These cover such things as the network connectivity configurations and System Security Services. When a user logs into the VM, whether his password has expired or not, whether he is still an active employee or not, whether a login can be granted or not etc. are all part of the server hardening requirements. Yet the boiler plate configurations at each Virtual machine often escape the scrutiny that otherwise falls on the public cloud. When a public cloud is set up to run an Active Directory, it is usually done as a managed service. The connectivity from the virtual machines depends on their configurations. The access provider, the id provider and the change password provider specified in the sssd configuration determine how the virtual machines enable accounts. A careful scrutiny of this configuration can itself eliminate several vulnerabilities and maintenance activities. The cost of workflows and implementations increases significantly as and when the ripples reach downstream systems or later point of time. Therefore early and proactive mitigation by compliance and governance processes is immensely beneficial. When done right, it does not even require to change very often.

Saturday, June 11, 2022

Identity Management in Multitenant applications:

Tenants and their users are recognized by their identity in a multi-tenant application. If we hide the complexity of identity store behind a well-known product such as Azure AD, the best practices for authentication and identity management stand out.

Every user belongs to a tenant. This is the foremost challenge to address. A user signs in with her organizational credentials. She may have access to the data from her organization but not to the data from other tenants. She can register with the multitenant application/service and then after her account is created, she can assign roles to other members.

Identity and access management provide built-in features to support all these scenarios. So, they simplify the logic that the multi-tenant application must execute to log them in. The mapping of these functionalities to the Software-as-a-service is now discussed.

Let us say there are two users alice@contoso.com and bob@fabrikam.com who want to login to a multitenant SaaS application. Since they belong to different tenants, the application must map the user to the right tenant. Alice cannot have access to Fabrikam data in this case.

Azure AD can handle sign-in and authentication of different users and the multitenant application is the same physical instance that recognizes and isolates the tenants to whom the users belong. A tenant can be considered a group of users in a B2B and has one-to-one mapping with a user in a B2C scenario. The application itself may have several physical resources such as VMs or storage but each tenant gets its own logical instance of an application. Application data is shared among the users within a tenant but not with other tenants. In a single-tenant architecture, each tenant gets dedicated physical resources and new instances of the app can be created for tenants by scaling out the number of instances.

This approach of horizontal scaling or scaling out is best possible in a web application. More traffic can be handled when there are more server VMs, or containers and they are put behind a load balancer. Each VM or container runs a separate instance of the web application. Requests are routed to any instance. The whole instance functions as a single logical instance to the users. Scaling in or out does not affect the users in this case. If one instance goes down, it should not affect any tenant.

During authentication, such as when a user accesses a resource, the application must determine the user’s tenant. If the tenant is already onboarded to the application, then such a user does not need to create a profile. Users within an organization are part of the same tenant. Any indicators for the tenancy that comes from the user such as the domain name of the email address used to login cannot be trusted. The identity store must be used to resolve the tenant.

During the authorization, the tenant determination for the identity involved must be repeated. Users must be assigned roles to access a resource. The role-based access control relies on the role assignments from the tenant administrator or the user if the resource is owned by the user. The role assignments are not made by the tenant provider.

When Azure AD is used for identity management, the customers store their user profiles in the Azure AD even if they were an Office 365 tenant or a Dynamics CRM tenant. Once the profiles are in the AD, the multi-tenant application can look up. If a customer with on-premises Active Directory wants to use the application, the AD connector can sync the data from on-premises to the cloud. This identity federation empowers the Azure AD to be a single store for cloud applications to use.

Data partitioning and per-tenant configuration can be decided by the multi-tenant application independent of the identity and access management solution.

Reference: This document is a continuation in a series of articles regarding multi-tenancy with the most recent article linked here

Friday, June 10, 2022

This is a continuation of a series of articles on Microsoft Azure from an operational point of view that surveys the different services from the service portfolio of the Azure public cloud. The most recent article discussed Service Fabric. This document talks about core startup stack architecture.

A startup’s stack does not have the luxury of leveraging all the lessons learned by established companies. It needs to optimize deployment for speed, cost, and changing business needs. When a business starts out, it establishes areas of development, This suits a service-oriented or microservice architecture. This type of architecture is not sustainable for a product that does not have commercial traction.

A core startup stack is monolithic but simple design and helps focus on the business needs. The design limits the time spent managing infrastructure but provides the ability to scale as the startup gains customer base.

The components of the startup stack include the following:

1. A persistence layer comprising of Database, log aggregation, and Continuous Integration / Continuous Deployment

2. A middle layer comprising App Servers and Static content

3. A CDN comprising of distribution network

The CDN might come across as premature for a startup stack that has yet to grain customers but adding it later to an existing product comes with a lot of costs and inefficiencies that can be avoided by putting a façade behind which we can iterate on APIs and architecture,

The App server is the code that needs to run somewhere. It should be easily deployable and require least operational input. The app server should scale horizontally, but some manual intervention is required for scaling during the initial stages.

Traditionally, this was an web server on a virtual machine or BareMetal but Platform-as-a-service and container orchestration framework improved the infrastructure and lowered the operational overhead.

Serving static content from app server wastes resources but when we configure a CI/CD pipeline, the work to build and deploy static assets with each release is trivial. Most production web frameworks deploy static assets with CI/CD so it's worthwhile to start out by aligning with the best practice.

The Database is the inevitable store for all the business data. A relational database brings maturity to the data architecture, access and practice. Some use cases need a document database or other form of cloud stores

Along with data, logs are equally important because they provide insights into the operations and assist with troubleshooting. If something goes wrong with the application, there is little time to diagnose the problem. Log aggregation and application tracing help the development team focus on the problems rather than poring over logs to retrieve the information.

A well-configured CI/CD pipeline enables repeated and rapid deployment of the software for incremental additions and testing. Quick and easy deployments is beneficial to iterative and agile software development. Frequent integration avoids divergent branches that require forward and reverse integrations.

Thursday, June 9, 2022

This is a continuation of a series of articles on Microsoft Azure from an operational point of view that surveys the different services from the service portfolio of the Azure public cloud. The most recent article  discussed the Dataverse and solution layers. This document talks about automation using Power Automate, and Microsoft Dataverse. 

Microsoft Dataverse is a data storage and management system for the various Power Applications so that they are easy to use with Power Query. The data is organized in tables some of which are built-in and standard across applications, but others can be added on a case-by-case basis for applications. These tables enable applications to focus on their business needs while providing a world-class, secure, and cloud-based storage option for the data that are 1. Easy to manage, 2. Easy to secure, 3. Accessible via Dynamics 365, has rich metadata, logic, and validation, and comes with productivity tools. Dynamics 365 applications are well-known for enabling businesses to quickly meet their business goals and customer scenarios and Dataverse makes it easy to use the same data across different applications. It supports incremental and bulk loads of data both on a scheduled and on-demand basis. 

Solutions are used to transport applications and components from one environment to another or to add customizations to an existing application. It can comprise applications, site maps, tables, processes, resources, choices, and flows. It implements Application Lifecycle management and powers Power Automate. There are two types of solutions (managed and unmanaged) and the lifecycle of a solution involves creating, updating, upgrading, and patching.  

Managed and unmanaged solutions can co-exist at different levels within a Microsoft Dataverse environment. They form two distinct layer levels. What the user sees as runtime behavior, comes from the active customizations of an unmanaged layer which in turn might be supported by a stack of one or more user-defined managed solutions and system solutions in the managed layer.  Managed solutions can also be merged. The solution layers feature enables one to see all the solution layers for a component. 

The Power Automate enables low-code automation capabilities. It empowers even experienced developers to collaborate and create business solutions. Power Automate came with rapid growth and release of features which further enhance its capabilities. More recently, it provided integration for applications using Microsoft Dataverse for Teams which is a subset of the Microsoft Dataverse. While Dataverse provided out-of-box common tables, extended attributes, semantics and an open ecosystem, Dataverse for Teams offers a suite of embedded Power Platform tools within the Teams. These tools include Power Automate. Collaboration tools like Teams have become the centerpiece for automation productivity automation that includes process automations and chatbots. The Teams Admin Center provides a governance rule for access controls. The ‘Solution Explorer’ view in the Teams UI lists all the flows created using Dataverse for Teams. A team may own the apps, bots and flows but they can be shared organization wide. Power Automate’s UI is a designer for automation that provides incredible choices for steps to be executed linearly. Some of these automations involves sending messages to distribution groups or teams and format the content of the automation.

Wednesday, June 8, 2022

Decision Tree modeling on root cause analysis  

Problem statement: Given a method to collect root causes from many data points in errors in logs, can there be a determination of relief time?

Solution: There are two stages to solving this problem: 

1. Stage 1 – discover root cause and create a summary to capture it 

2. Stage 2 – use a decision tree modeling to determine relief time. 

Stage 1: 

The first stage involves a data pipeline that converts log entries to exception stacktraces and hashes them into buckets. Sample included. When the exception stack traces are collected from a batch of log entries, we can transform it into a vector representation and using the notable stacktraces as features. Then we can generate a hidden weighted matrix for the neural network

We use that hidden layer to determine the salience using the gradient descent method.      

All values are within [0,1] co-occurrence probability range.     

The solution to the quadratic form representing the embeddings is found by arriving at the minima represented by Ax = b using conjugate gradient method.   

We are given input matrix A, b, a starting value x, a number of iterations i-max and an error tolerance  epsilon < 1   

This method proceeds this way:    

set I to 0    

set residual to b - Ax    

set search-direction to residual.   

And delta-new to the dot-product of residual-transposed.residual.   

Initialize delta-0 to delta-new   

while I < I-max and delta > epsilon^2 delta-0 do:    

    q = dot-product(A, search-direction)   

    alpha = delta-new / (search-direction-transposed. q)    

    x = x + alpha.search-direction   

    If I is divisible by 50    

        r = b - Ax    

    else    

        r = r - alpha.q    

    delta-old = delta-new   

    delta-new = dot-product(residual-transposed,residual)   

     Beta = delta-new/delta-old   

     Search-direction = residual + Beta. Search-direction   

     I = I + 1    

Root cause capture – Exception stack traces that are captured from various sources and appear in the logs can be stack hashed. The root cause can be described by a specific stacktrace, its associated point of time, the duration over which it appears, and the time of fix introduced, if known.  

Stage 2: Decision Tree modeling can help predict relief time. involves both a classification and a regression tree. A function divides the rows into two datasets based on the value of a specific column. The two list of rows that are returned are such that one set matches the criteria for the split while the other does not. When the attribute to be chosen is clear, this works well.

To see how good an attribute is, the entropy of the whole group is calculated. Then the group is divided by the possible values of each attribute and the entropy of the two new groups are calculated. The determination of which attribute is best to divide on, the information gain is calculated which is the difference between the current entropy and the weighted-average entropy of the two new groups. The algorithm calculates the information gain for every attribute and chooses the one with the highest information gain.

Each set is subdivided only if the recursion of the above step can proceed. The recursion is terminated if a solid conclusion has been reached which is a way of saying that the information gain from splitting a node is no more than zero. The branches keep dividing, creating a tree by calculating the best attribute for each new node. If a threshold for entropy is set, the decision tree is ‘pruned’.

When working with a set of tuples, it is easier to reserve the last one for results during a recursion level. Text and numeric data do not have to be differentiated for this algorithm to run. The algorithm takes all the existing rows and assumes the last row is the target value. A training/testing dataset is used with the application for each dataset. Usually, a training/testing data split of 70/30% is used in this regard.

This is a continuation of a series of articles on Microsoft Azure from an operational point of view that surveys the different services from the service portfolio of the Azure public cloud. The most recent article  discussed the Dataverse and solution layers. This document talks about managing the application lifecycle using Power Apps, Power Automate, and Microsoft Dataverse in the organization.

The foremost scenario for Application Lifecycle Management strategy is one that involves creating a new project. The task involved include 1) determining the environments that are needed and establishing an appropriate governance model, 2) creating a solution and a publisher for that solution, 3) setting up the DevOps project that involves one or more pipelines to export and to deploy the solution, 4) creating a pipeline to export an unmanaged solution to a managed solution 5) configuring and building applications within the solution 6) adding any additional customizations 7) creating a deployment pipeline and granting access to the application and 8) granting access to the application. With these steps, it becomes easy to get started with dataverse solutions and applications.

The next scenario targets the legacy app makers and flow makers in Power Apps and Power Automate, respectively, who work in a Microsoft dataverse environment without a Dataverse database. The end goal, in this case, is a successful migration to a managed ALM model by creating apps and flows in a Dataverse solution. Initial app migration can target the default Dataverse environment but shipping the entities and data model require a robust DevOps with multiple environments each dedicated to the development, testing, and release of applications. It will require the same steps as in the creation of a new project but it requires the business process and environment strategy to be worked out first.