Cluster computing

Saturday, April 1, 2023

The Design Security for data:

One of the ways to secure data is to control access. These involve generating access keys, shared access signatures, and Azure AD access. Access keys provide administrative access to an entire resource such as storage accounts. Microsoft recommends these are only used for administrative purposes. Shared access signatures are like tokens that can be used to generate granular access to resources within a storage account. Access is provided to whoever or whatever has this signature. Role based access control can be used to control access to both the management and data layer. Shared access signatures can be generated to be used in association with an Azure AD identity, instead of being created with a storage account access key. Access to files from domain joined devices are secured using the Azure AD identities that can be used for authentication and authorization.

Another way to secure the data is to protect the data using storage encryption, Azure Disk Encryption, and Immutable storage. Storage encryption involves server-side encryption for data at rest and Transport Layer Security based encryption for data in transit. The default encryption involves an Account Encryption Key which is Microsoft managed but security can be extended using customer-managed keys which are usually stored in a Key Vault. The volume encryption encrypts the boot OS and data volumes to further protect the data.

Some controls are available from the databases and data storage products. For example, a database inside a SQL Server is secured with the help of a SQL login. This SQL login is created inside the master database and then it is mapped to a database user associated with the database to be protected and then the data can be accessed with this credential. Every database will require to create its own mapping for that same SQL login. If the authentication is to be based on Azure Active Directory, then the mechanism varies slightly between a cloud SQL managed instances and a managed SQL database. Both of them require Azure AD-based administrator for the product and then the user accounts to work with the databases. When working with SQL managed instances, we give the read access to the Azure AD tenant to the admin account, map the user accounts to SQL logins in the master database similar to the traditional SQL Server approach and then create database users that is mapped one-on-one to the SQL logins. When working on the managed SQL database, all we need is the database user mapped to the Azure identity and so the entries in the master database and the read access to the admin account are completely avoided.

Azure SQL Services also provide ways to protect the data. These include Transparent Data Encryption, Always Encrypted, and Dynamic Data Masking. The encryption for the database is like that for the storage discussed just before the discussion on Azure SQL products in that the database is stored on the Azure storage infrastructure which guarantees the encryption at rest and the TLS guarantees the encryption in transit. The encryption supports Bring Your Own Key aka TDE protectors which are managed by the customers and saved in a Key Vault which helps meet compliance requirements around access and audit. A managed identity enables the Azure SQL Database service to access the Key Vault. With Always Encrypted, the data is encrypted within columns within the tables when storing, say credit card information. An encryption key is used for one or more columns and the data is not in the clear even for administrators. When the data needs to be hidden even from a genuine user, data masking is used so that only partial information is displayed, say the last four digits of the credit card number.

These are some of the ways to secure data on the Azure Data Platform.

Friday, March 31, 2023

This is a continuation of the discussion on Azure Data Platform.

GitHub actions are highly versatile and composable workflows that can be used to respond to events in the code pipeline. These events can be a single commit or a pull-request that needs to be build and tested or the deployment of a code change that has been pushed to production. While automations for continuous integration and continuous deployment are well-established practices in DevOps, GitHub Actions goes beyond DevOps by recognizing events that can include the creation of a ticket or an issue. The components of a GitHub actions are the workflow triggered for an event and the jobs that the workflow comprises of. Each job will run inside its own virtual machine runner, or inside a container and has one or more steps that either run a script or an action that can be reused.

A workflow is authored in the .github/workflows/ directory and specified as a Yaml file. It has a sample format like so:

name: initial-workflow-0001

run-name: Trying out GitHub Actions

on: [push]

jobs:

check-variables:

runs-on: ubuntu-latest

environment: nonProd

steps:

- name: Initialize

uses: actions/setup-node@v3

with:

node-version: '16'

- name: Display Variables

env:

variable: 0001

if: github.ref == ‘refs/heads/main’ && github.event_name == ‘push’

run: |

echo 000-“$variable”

echo 001-“${{ env.variable }}”

And whenever there is a commit added to the repository in which the workflows directory was created, the above workflow will be triggered. It can be viewed with the Actions tab in the repository where the named job can be selected, and it steps expanded for the activities performed.

With this introduction, let us check out a scenario for using it with the Azure Data Platform. Specifically, this scenario calls for promoting the Azure Data Factory to higher environments with the following steps:

1. Development data factory that is created and configured with GitHub.

2. Feature/working branch is used for making changes to pipelines, etc.

3. When changes are ready to be reviewed, a pull request is created from the feature/working branch to the main collaboration branch.

4. Once the pull request is approved and changes merged with the main collaboration branch, the changes are published to the development factory.

5. When the changes are published, data factory saves its ARM templates on the main publish branch (adf_publish by default). A file called ARMTemplateForFactory.json contains parameter names used for resources like key vault, storage, linked services, etc. These names are used in GitHub Actions workflow file to pass resource names for different environments.

6. Once the GitHub Actions workflow file has been updated to reflect the parameter values for upper environment and changes pushed back to GitHub branch, the GitHub Actions workflow is started manually and changes pushed to the upper environment.

Some observations to make are as follows:

GitHub is configured on development data factory only.

Integration Runtime names and types need to stay the same across all environments.

Self-hosted Integration Runtime must be online in upper environments before deployment or else it will fail.

Key Vault secret names are kept the same across environments, only the vault name is parameterized.

Resource naming needs to avoid spaces.

Secrets stored in GitHub Secrets section.

There are two environments – dev and prod in this scenario.

Dev branch is used as collaboration branch.

Feature1 branch is used as a working branch.

Main branch is used as publish branch.

Thursday, March 30, 2023

This continues from the previous post:

In addition to data security, monitoring plays an important role in maintaining the health and hygiene of data.

Azure Monitor is a centralized management interface for monitoring workloads anywhere. The monitoring data comprises of metrics and logs. With this information, there are built-in capabilities to support responses to the monitoring information in several ways. Monitoring everything from code through to the platform provides holistic insights.

The key monitoring capabilities include: Metrics Explorer to view and graph small, time-based metric data, Workbooks for visualization, reporting and analysis of monitoring data, Activity logs for REST API write actions performed on Azure resources, Azure monitor logs for advanced, holistic analytics, using Kusto query language, Monitoring insights for resource specific monitoring solutions, and alerts and action groups for alerting, automation and incident management.

The monitoring information cannot all be treated the same as for those from resources. For this reason, there are diagnostic settings available that help us to differentiate the treatment we provide to certain types of monitoring information. Platform monitoring diagnostic setting helps us to route data for platform logs, and metrics. Similarly, there are multiple data categories available which enable us to treat them differently. Some of the treatments involve sending the data to a storage account to retain and analyze them, sending the data to a Log Analytics workspace for powerful analytics or sending the data to Event Hubs to stream to external systems.

One of the most interesting aspects of Azure Monitoring is that it collects metrics from Applications, Guest OS, Azure resource monitoring, Azure subscription monitoring, and Azure tenant monitoring to include the depth and breadth of the systems involved. Alerts and Autoscale help determine the appropriate thresholds and actions that become part of the monitoring stack, so the data and the intelligence are together and easily navigated via the dashboard. Azure Dashboards provide a variety of eye-candy charts that better illustrate the data to the viewers than the results of a query. Workbooks provide a flexible canvas for data analysis and the creation of rich visual reports in the Azure Portal. The analysis is not restricted to just these two. Power BI remains the robust solution to provide analysis and interactive visualizations across a variety of data sources and it can automatically import log data from Azure monitor. Azure Event Hubs is a streaming platform and event ingestion service which permits real-time analytics as opposed to batching or storage-based analysis. APIs from the Azure monitor help with reading and writing data as well as configure and retrieve alerts.

Wednesday, March 29, 2023

Another way to secure the data is to protect the data using storage encryption, Azure Disk Encryption, and Immutable storage. Storage encryption involves server-side encryption for data at rest and Transport Layer Security based encryption for data in transit. The default encryption involves an Account Encryption Key which is Microsoft managed but security can be extended through the use of customer-managed keys which are usually stored in a Key Vault. The volume encryption encrypts the boot OS and data volumes to further protect the data.

A checklist helps with migrating sensitive data to the cloud and provides benefits to overcome the common pitfalls regardless of the source of the data. It serves merely as a blueprint for a smooth secure transition.

Characterizing permitted use is the first step for data teams need to take to address data protection for reporting. Modern privacy laws specify not only what constitutes sensitive data but also how the data can be used. Data obfuscation and redacting can help with protecting against exposure. In addition, data teams must classify the usages and the consumers. Once sensitive data is classified, and purpose-based usage scenarios are addressed, role-based access control must be defined to protect future growth.

Devising a strategy for governance is the next step; this is meant to prevent intruders and is meant to boost data protection by means of encryption and database management. Fine grained access control such as attribute or purpose-based ones also help in this regard.

Embracing a standard for defining data access policies can help to limit the explosion of mappings between users and the permissions for data access; this gains significance when a monolithic data management environment is migrated to the cloud. Failure to establish a standard for defining data access policies can lead to unauthorized data exposure.

When migrating to the cloud in a single stage with all at once data migration must be avoided as it is operationally risky. It is critical to develop a plan for incremental migration that facilitates development testing and deployment of a data protection framework which can be applied to ensure proper governance. Decoupling data protection and security policies from the underlying platform allows organizations to tolerate subsequent migrations.

There are different types of sanitizations such as redaction, masking, obfuscation, encryption tokenization and format preserving encryption. Among these static protection in which clear text values are sanitized and stored in their modified form and dynamic protection in which clear text data is transformed into a ciphertext are most used.

Finally defining and implementing data protection policies brings several additional processes such as validation, monitoring, logging, reporting, and auditing. Having the right tools and processes in place when migrating sensitive data to the cloud will allay concerns about compliance and provide proof that can be submitted to oversight agencies.

Tuesday, March 28, 2023

Data Pipelines are specific to organizational needs, so it is hard to come up with a tried and tested methodology that suits all, but standard practice continues to be applicable across domains. One such principle is to focus on virtualizing the different sources of data, say into a data lake, so that there is one or at most a few pipeline paths. Another principle is to be diligent about consistency and standardization to prevent unwieldy or numerous customizations. For example, if a patient risk score needs to be calculated, then a general scoring logic must be first applied that is not source specific and then apply an override for a specific source. Reuse can be boosted by managing configurations stored in a database. This avoids a pipeline per data source antipattern.

Pipelines also need to support scalability. One approach to scale involves an event driven stack. Each step picks up its task from a messaging queue and also sends its results to a queue and the processing logic works on an event-by-event basis. Apache Kafka is a good option for this type of setup and works equally well for both stream processing and batch processing.

Another approach to support scalability involves the use of a data warehouse. These help to formalize extract-transform-load operations from diverse data sources and support many types of read-only analytical stacks.

Finally, on-premises solutions can be migrated to the cloud for scalability because of elasticity and higher rate limits. And there is transparency and pay-as-you-go pricing that appeals to the return on investment. Some apprehension about data security precedes many design decisions about on-cloud solutions but security and compliance in the cloud is unparalleled and provides better opportunities for hardening.

Monitoring and alerting increases transparency and visibility into the application and are crucial to health checks and troubleshooting. A centralized dashboard for metrics and alerts tremendously improves the operations of the pipeline. It also helps with notifications.

There are so many technological stacks and services to use in the public cloud that there is always some with missing expertise on the team. Development teams must focus on skills and internal cultural change. Some of the sub-optimal practices happen when leadership is not prioritizing cloud cost optimizations. For example, developers ignore small administrative tasks that may significantly improve operating costs, architects select suboptimal designs that are easier and faster to run but are more expensive to implement, the algorithms and code has not been streamlined and tightened to leverage the best practices in the cloud, deployment automation is neglected or even skipped altogether when they could have correctly adjusted the size of the resources deployed and finally, finance and procurement teams are viewing misplacing their focus on the numbers in the cloud bill and creating tension between them and the IT/development teams. A non-committal mindset towards cloud technologies is a missed opportunity for business leaders because long-term engagements are more cost friendly.

Monday, March 27, 2023

Azure data platform continued

While the discussion on SQL and NoSQL stacks for data storage and querying has relied on the separation of transactional processing versus analytical processing, there is another angle to this dichotomy from data science perspective. The NoSQL storage, particularly key value stores are incredibly fast and efficient at ingesting data but their queries are inefficient. The SQL store, on the other hand, is just the opposite. They are efficient at querying the data but ingest data slowly and inefficiently.

Organizations are required to have data stacks that provide the results from heavy data processing in a limited time window. SQL databases are also overused for historical reasons. When organizations want the results of the processing within a fixed time window, the database’s inefficiency to intake data delays the actual processing which in turn creates an operational risk. The limitation comes from the data structure used in the relational database. Storing a 1 TB table requires the B+ tree to grow to six levels. If the memory used by the database server is of the order of 125 GB, less than 25% of the data will remain in cache. For every insertion, it must read an average of three data blocks from disk to read the leaf node. This will cause the same number of blocks to be evicted from the cache. This dramatic increase in I/O makes these databases so inefficient to ingest data.

When the data is stored in the NoSQL stores, the querying is inefficient because a single horizontal data fragment could be spread across many files which increases duration in reading the data. This improves if there is an index associated with the data but the volume of data is still quite large. If a single B+ tree is assumed to have 1024 blocks, each search will need to access log (1024) = 10 blocks and there could be many files each with its own index, which means if there are 16 B+ trees, a total of 160 blocks would be read instead of the 10 blocks in one index in a database server. Cloud document stores are capable of providing high throughput and low latency for queries by providing a single container for unlimited sized data and charging based on reads and writes. To improve the performance of the cloud database, we must set the connection policy to direct mode, set the protocol to TCP, avoid startup latency on first request, collocate clients in the same Azure region for performance and increase the number of threads/tasks.

Ingestion is also dependent on the data sources. Different data sources often require different connectors but fortunately they come out-of-the box from cloud services. Many analytical stacks can easily connect to the storage via existing and available connectors reducing the need for integration. Services for analysis from the public cloud are rich, robust and very flexible to work with.

When companies want to generate value and revenue from accumulated data assets, they are not looking to treat all data equally like some of the analytical systems do. Creating an efficient pipeline that can scale to different activities and using appropriate storage and analytical systems for specific types of data helps meet the business goals with rapid development. Data collected in these sources for insurance technology software could be as varied as user information, claim details, social media or government data, demographic information, current state of the user, medical history, income category or credit score, agent customer interaction, and call center or support tickets. ML templates and BI tools might be different for each of these data categories but the practice of using data pipelines and data engineering best practices bring cloud-first approach that delivers on the performance requirements expected for those use cases.

Sunday, March 26, 2023

Business process automations provide immense opportunities for improvements as ever and more so when there is pressure both inside and outside an organization as is the case in the insurance sector that infuse new ways of doing business and continually challenge traditional ways. Some of these trends can be seen with examples, such as, when Lemonade has streamlined the claims experience by using AI and chatbots, Ethos issues life insurance policies in minutes, Hippo issues home quotes in under a minute, Telematics based auto insurance companies like Metromile and Root offer usage based insurance, Fabric offers value-added services in its niche market, Figo Pet empowers a completely digital pet insurance platform and Next Insurance simplifies small business insurance with a 100% online experience. In this sector, incumbent insurers have insufficient customer-centric strategies and while the establishment view the niche product from these incumbents as merely threats, they run the risk of increasing their debt with their outdated processes and practices which results in lower benefits to cost ratio. The gap also widens between the strategies offered by the incumbents and those provided by the established. Incumbents channel technology best practices and are not bound by industry traditions. Zero-touch processing, self-service and chatbots are deployed to improve customer -centric offerings. Industry trends are also pushing customers more and more to digital channels. The net result is that the business is increasingly driven towards digital-enabled omnichannel, and customer-aligned customer experience and established business have the advantage to strive for scale. This is evident from the increased engagement with digital leaders from inside and outside the industry and with the new and improved portfolio of digital offerings.

A digital transformation leader sees the opportunities and challenges for these business initiatives with a different point of view. It can be broken down into legacy and new business capabilities and with different approaches to tackle them. While established companies in this sector have already embraced the undisputed use of cloud technologies and have even met a few milestones on their cloud adoption roadmap, digital leaders increasingly face the challenges of migrating and modernizing legacy applications both for transactional processing as well as analytical reports. The journey towards the cloud has been anything but steady due to the diverse technological stack involved in these applications. A single application can be upwards of a hundred thousand lines of code and many can be considered boutique or esoteric in terms of their use of third-party statistical software and libraries. What use to be streamlined and tuned analytical queries against data warehouses, has continually evolved to using Big Data and stream analytics software often with home-grown applications or internal facing automations.