Cluster computing

Friday, April 7, 2023

This is a continuation of the articles on Azure Data Platform and discusses data ingestion for Data Lakes.

ADLS Gen2 or data lake for short can hold petabytes of data and store it as files and folders with file-level security and scale. ADF can load data into the data lake with high throughput and massive parallelization. A point to point copy activity can take several minutes for gigabytes of data over the internet and the scale out provided by ADF makes it all the more appealing to use. This article discusses some of the considerations made towards data ingestion that inevitably occurs for loading data into a data lake.

First, the prerequisites for data ingestion must be called out. An Azure subscription and a storage account with Azure data lake gen2 enabled are required at the minimum. The source of the data can be another cloud storage or on-premises. ADF can work with many types of data sources but we will focus on the case of ingesting files and folders of varying size and number and to the tune of petabytes in size.

Creating an ADF from the Azure portal is easy to follow with the steps outlined in the user interface. It has an Ingest tile to launch the Copy Data tool. In the properties window, the built-in copy task is available which can be run on-demand or periodic basis. The data source must be specified and this can point to an existing S3 compatible storage on-premises that provides the file-system used for the files and folders to be copied. This will require an integration runtime, an access key and secret. There are three types of integration runtime: Azure, self-hosted and Azure-SSIS. A self-hosted integration runtime is capable of running copy activity between a cloud data store and a data store in private network. The transform activities are dispatched against the compute resources on-premises. The Azure integration runtime is used for data flows while the Azure-SSIS is for relational data. The self-hosted integration runtime makes outbound HTTP connections to the internet so it can sit behind the firewall and not require a direct link to the cloud. It runs on Windows Operating System and a single logical instance can be associated with multiple physical on-premises machines in active-active mode.

The transfer of data must be in binary mode, and it must recursively traverse the files and folders. The destination must be configured as the data lake by pointing to the Azure subscription and the storage account with the ADLS Gen2 option specified. The destination folder structure can preserve that of the origin and a preview option ensures that the data can be copied correctly. ADF can extract zip files prior to sending them to the data lake. The copy operation is tracked with a task. The pipeline comprises of the copy task and the monitor task. When the pipeline run completes successfully, its activity details can be viewed to rerun it if necessary.

All the activity runs have status, copy duration, throughput, data read, files read, data written, files written, peak connections for both read and write, the parallel copies used, the data integration units, the queue and the transfer durations to provide complete information on the activities performed for monitoring or troubleshooting.

Thursday, April 6, 2023

This is a continuation of the articles on Azure Data Platform and discusses Data Lakes:

Access Control Model:

This section covers the access control model in Azure Data Lake Storage Gen2 which supports the following mechanisms: shared key authorization, shared access signature authorization aka SAS, role-based access control aka RBAC, attribute-based access control aka ABAC and access control lists aka ACL. Out of these the RBAC and ACLs on one hand and shared-key and SAS on the other hand are complimentary. The former has no effect on the latter. Shared-key and shared access both grant access to a user without the need for an identity. RBAC grants coarse-grained access to users such as for read-write of all data. ABAC refines RBAC role assignments by adding conditions. ACLs grant fine-grained access such as a write access to a specific directory or file. The security principals recognized by the Azure are a user, group, service principal or managed identity that is defined in the Azure Active Directory. A permission set grants coarse level access such as a read or write access to all the data in a storage account or all the data in a container. Permission sets are granted based on the roles and some well-known roles are Owner, Contributor, Reader, and Storage Account Contributor. The first three can access the data and all four can manage the storage account. They cannot grant access to other security principals but they can provide shared-key and shared access signatures except for the reader role. The order of resolving access grant or denial is RBAC first, ABAC next and ACLs last and applies to all operations such as list, get, create, update, and delete. Security groups are particularly useful to add ACLs. For example, if Azure Data Factory aka ADF ingests data into a folder named /LogData and specific service engineering team upload data in a container and various users analyze the data, then we create a LogsWriter group and LogsReader group to enable these activities. The POSIX permissions assigned to the directory will show rwx permissions assigned the LogWriter group and r-x permissions assigned to the LogReader group. The service principal or the managed service identity aka MSI for ADF will be part of the LogWriter group but the service principal or MSI for Databricks will be part of the LogsReader group. Groups facilitate addition or removal of members without disturbing the assignments. They also help to avoid exceeding the maximum number of role assignments per subscription and the maximum number of ACL entries per file or directory. The limits are 4000 Azure role assignments in a subscription and 32 ACL entries per file or directory.

Premium tier:

ADLS Gen2 supports premium tier and premium block blob storage accounts that are ideal for Big Data analytics and workloads that require low latency and support a high number of operations such as in the case of machine learning. The premium tier supports hierarchical namespace that accelerates big data analytics workloads and enables file-level access control lists. It also supports Azure Blob file system drive for Hadoop.

Wednesday, April 5, 2023

The topic discussed here are some key considerations for the use of Azure Data Lake Storage Gen2 which is a highly scalable and cost-effective data lake solution for big data analytics.

The first and foremost consideration is that a data lake is not appropriate where a data warehouse is suitable. They are complimentary solutions that work together to help us derive key insights from our data. A data lake is a store for all types of data from various sources besides the unstructured versus structured data differentiation. A retail company can store the past 5 years’ worth of sales data in a data lake, and in addition they can process, and they can process data from social media to extract the new trends in consumption and intelligence from retail analytics solutions and then generate a highly structured data set that is suitable to store in the warehouse. ADLS Gen2 can offer faster performance and Hadoop compatible access with the hierarchical namespace, lower cost, and security with fine grained access controls and native AAD integration. It can become a hyperscale repository that can store up to petabytes of data and hundreds of Gbps in throughput.

Other considerations include what will the data lake store and for how long, what portion of the data is used in analytical workloads, who needs access to what parts of the data lakes, what are the various analytical workloads and what am I going to run on the data lake, what are the types of workloads that will use the data, what are the transaction patterns and the analytics patterns and what is the budget I’m working with.

Organization and managing of data in the data lake is also a key concern for this single data store. Some users of this store can claim end-to-end ownership of the pipeline and others have a central team that manages, operates and governs the data lake. This calls for different topologies of the data lake such as centralized or federated data lake strategies, implementing with single or multiple storage accounts, and with globally shared or regionally specific footprints.

When the customers of the data lake are both internal and external, the scenarios may be subject to different requirements from security and compliance perspectives, query patterns, access requirements and cost and billing. Storage accounts can be created in different subscriptions for development and production environments and subject to different SLAs. The account boundaries determine the management of logical sets of data in a unified or isolated manner. Subscription limits and quotas apply to resources used with the Azure data lake such as the VM cores and ADF instances. Managed identities and SPNs must have different privileges than those who are merely reading the data.

Backup and archive is done with the help of scripts that create action and filter objects to apply blob tiering to block blobs matching a certain criteria and storage accounts must be provisioned in a brand-new resource group and storage account based upon input variables.

Tuesday, April 4, 2023

Data Lake is often used with Azure Data Factory to support Copy operations. It allows for downloading of files and folders directly from the Lake and to the order of thousands of files from folders, but the upload is best done with the help of Azure Data Factory which supports even upload of zip file and the in-transit extraction of files before they are stored in the Data Lake. ADF happens to support a variety of file formats such as Avro, binary and text to name a few and uses all available throughput by performing as many reads and writes in parallel as possible.

Data Lakes work best for partition pruning of time-series data which improves performance by reading only a subset of data. The pipelines that ingest time-series data often place their files with a structured naming such as /DataSet/YYYY/MM/DD/HH/mm/datafile_YYYY_MM_DD_HH_mm.tsv.

The Hitchhiker’s Guide to Data Lake from Azure recommends that monitoring be set up for effective operations management of this cloud resource. This will help to make sure that it is available for use by any workloads which consume data contained within it. Key considerations include auditing the data lake in terms of frequent operations, having visibility into key performance indicators such as operations with high latency and understanding common errors. All of the telemetry will be available via Azure Storage Logs which is easy to query with Kusto query language.

Common queries are also candidates for reporting via dashboards for Azure Monitor resource. For example,

1. Frequent operations can be queried with:

StorageBlobLogs

| where TimeGenerated > ago(3d)

| summarize count() by OperationName

| sort by count_ desc

| render piechart

2. High latency operations can be queried with:

StorageBlobLogs

| where TimeGenerated > ago(3d)

| top 10 by DurationMs desc

| project TimeGenerated, OperationName, DurationMs, ServerLatencyMs, ClientLatencyMs = DurationMs – ServerLatencyMs

3. Operations causing the most error are caused by:

StorageBlobLogs

| where TimeGenerated > ago(3d) and StatusText !contains "Success"

| summarize count() by OperationName

| top 10 by count_ desc

and these can be reported on the dashboard.

Monday, April 3, 2023

The differentiation between a Healthcare Insurance Industry Cloud and a general purpose cloud solution for health-care data.

Introduction: This article discusses whether Healthcare and InsuranceTech require different industry clouds and whether they can be directly hosted as cloud solutions. Indeed, the cloud has all the resources that allows any infrastructure to be created for any company and purpose regardless of size. This begs the question of what’s special about Healthcare & InsuranceTech.

First, there is Fast Healthcare Interoperability Resources (FHIR) which is a healthcare data standard and an information network that lets us link data across systems and a communication network that lets us exchange data between systems. Healthcare IT Systems often don’t share the same data models. In fact, as more data becomes digitized, incompatibilities and resolving them becomes more expensive and time-consuming.

Second, there is the Health Insurance Portability and Accountability Act of 1996 and associated laws that establish requirements for the use, disclosure, and safeguarding of protected health information (PHI). HIPAA applies to doctor’s offices, hospitals, health insurers, and other healthcare companies – that create, receive, maintain, transmit or access PHI. HIPAA and Health Information Technology for Economic and Clinical Health Acts include rules for 1. Privacy for safeguards to protect PHI without authorization and 2. Security for administrative, technical, and physical safeguards and 3. a Breach Notification for whenever there is a breach of unsecured PHI occurs.

Third, there is the consent management that involves managing data related to consent and privacy across configuration management for the consent store, data related to the permissions granted by the users and managed resources that include user data mappings and data related to the resources in the form of attributes.

Lastly, there must be an overt display of security and compliance controls that span both observability and security controls. As an example of compliance, the Cybersecurity Maturity Model Certification attempts to prevent the theft of intellectual property and sensitive information from all industrial sectors due to malicious cyber activity. FedRamp High and FedRamp moderate both pertain to account management, monitoring, and role-based access controls and have different impact levels. HIPAA HITRUST 9.2 targets both privilege management and role-based access control.

The difference between Healthcare and InsuranceTech can be compared to technologies that are customer-facing and backend processing. On one hand, the customer facing data must be aggregated from various sources and a service or cloud greatly abstracts and simplifies this handling and on the other hand, insurance companies increase the competition by introducing various innovations, analysis and ultimately new options when processing these data. Many industry clouds from different cloud providers have agreed on the need for a dedicated API that can be consumed from various providers, clients and their devices. Insurance IT companies seek to develop new business capabilities with rapid application development and new machine learning models that mandate the need for dedicated pipelines. Even the cloud resources these two use, differ in their purpose. Healthcare benefits from FHIR API and the InsuranceTech prefers data lakes and pipelines.

Sunday, April 2, 2023

Some essentials for an integration pipeline include:

Data Ingestion: data passed to the blob storage and the metadata queued in a job database for downstream processing.

Central components:

messaging queue: all components interact with these which allows scalability.

config store: enables stable pipeline with configuration driven variability

job store: keeps track of various job executions

Job poller/scheduler: this picks up and drops a message in the message queue, for example, Kafka.

Consumers: these must be horizontally scalable where the data is staged in tenant specific database.

Databases: these must be horizontally scalable where the data is transformed and stored in suitable multi-tenant databases.

Data warehouses/ data lake: the data must be denormalized with dimensions along say tenants to support multitenant data sources.

Analytics stack: The Data Lake/warehouse is the source for the analysis stack and preferably U-SQL based for leveraging existing skills

ML stack: also leverages the data lake/warehouse but with emphasis on separation of training and test datasets and the execution and feedback loop for a model.

Monitoring, performance, and security: these include telemetry, auditing, security and compliance, aging and lifecycle management.

Saturday, April 1, 2023

The Design Security for data:

One of the ways to secure data is to control access. These involve generating access keys, shared access signatures, and Azure AD access. Access keys provide administrative access to an entire resource such as storage accounts. Microsoft recommends these are only used for administrative purposes. Shared access signatures are like tokens that can be used to generate granular access to resources within a storage account. Access is provided to whoever or whatever has this signature. Role based access control can be used to control access to both the management and data layer. Shared access signatures can be generated to be used in association with an Azure AD identity, instead of being created with a storage account access key. Access to files from domain joined devices are secured using the Azure AD identities that can be used for authentication and authorization.

Another way to secure the data is to protect the data using storage encryption, Azure Disk Encryption, and Immutable storage. Storage encryption involves server-side encryption for data at rest and Transport Layer Security based encryption for data in transit. The default encryption involves an Account Encryption Key which is Microsoft managed but security can be extended using customer-managed keys which are usually stored in a Key Vault. The volume encryption encrypts the boot OS and data volumes to further protect the data.

Some controls are available from the databases and data storage products. For example, a database inside a SQL Server is secured with the help of a SQL login. This SQL login is created inside the master database and then it is mapped to a database user associated with the database to be protected and then the data can be accessed with this credential. Every database will require to create its own mapping for that same SQL login. If the authentication is to be based on Azure Active Directory, then the mechanism varies slightly between a cloud SQL managed instances and a managed SQL database. Both of them require Azure AD-based administrator for the product and then the user accounts to work with the databases. When working with SQL managed instances, we give the read access to the Azure AD tenant to the admin account, map the user accounts to SQL logins in the master database similar to the traditional SQL Server approach and then create database users that is mapped one-on-one to the SQL logins. When working on the managed SQL database, all we need is the database user mapped to the Azure identity and so the entries in the master database and the read access to the admin account are completely avoided.

Azure SQL Services also provide ways to protect the data. These include Transparent Data Encryption, Always Encrypted, and Dynamic Data Masking. The encryption for the database is like that for the storage discussed just before the discussion on Azure SQL products in that the database is stored on the Azure storage infrastructure which guarantees the encryption at rest and the TLS guarantees the encryption in transit. The encryption supports Bring Your Own Key aka TDE protectors which are managed by the customers and saved in a Key Vault which helps meet compliance requirements around access and audit. A managed identity enables the Azure SQL Database service to access the Key Vault. With Always Encrypted, the data is encrypted within columns within the tables when storing, say credit card information. An encryption key is used for one or more columns and the data is not in the clear even for administrators. When the data needs to be hidden even from a genuine user, data masking is used so that only partial information is displayed, say the last four digits of the credit card number.

These are some of the ways to secure data on the Azure Data Platform.