Cluster computing

Wednesday, October 30, 2024

This section explores dependency management and pipeline orchestration that are aptly dubbed “under-currents” in the book “Fundamentals of Data Engineering” by Matt Housley and Joe Reis.

Data orchestration is a process of dependency management, facilitated through automation. It involves scheduling, triggering, monitoring, and resource allocation. Data orchestrators are different from schedulers, which are cron-based. They can trigger events, webhooks, schedules, and even intra-workflow dependencies. Data orchestration provides a structured, automated, and efficient way to handle large-scale data from diverse sources.

Orchestration steers workflows toward efficiency and functionality, with an orchestrator serving as the tool enabling these workflows. They typically trigger pipelines based on a schedule or a specific event. Event-driven pipelines are beneficial for handling unpredictable data or resource-intensive jobs.

The perks of having an orchestrator in your data engineering toolkit include workflow management, automation, error handling and recovery, monitoring and alerting, and resource optimization. Directed Acyclic Graphs or DAGs for short bring order, control, and repeatability to data workflows, managing dependencies and ensuring a structured and predictable flow of data. They are pivotal for orchestrating and visualizing pipelines, making them indispensable in managing complex workflows, particularly within a team or large-scale setups. For example, a DAG serves as a clear roadmap defining the order of tasks and with this lens, it is possible to organize the creation, scheduling, and monitoring of data pipelines.

Data orchestration tools have evolved significantly over the past few decades, with the creation of Apache Airflow and Luigi being the most dominant tools. However, it is crucial to choose the right tool for the job, as each tool has its strengths and weaknesses. Data orchestrators, like the conductor of an orchestra, balance declarative and imperative frameworks to provide flexibility and efficiency in software engineering best practices.

When selecting an orchestrator, factors such as scalability, code and configuration reusability, and the ability to handle complex logic and dependencies are important to consider. The orchestrator should be able to scale vertically or horizontally, ensuring that the process of orchestrating data is separate from the process of transforming data.

Orchestration is about connections, and platforms like Azure, Amazon, and Google offer hosted versions of popular tools like Airflow. Platform-embedded alternatives like Databricks Workflows provide more visibility into tasks orchestrated by orchestrators. Popular orchestrators have strong community support and continuous development, ensuring they remain up to date with the latest technologies and best practices. Support is crucial for both closed source and paid solutions, as solutions engineers can help resolve issues.

Observability is essential for understanding transformation flows and ensuring your orchestrator supports various methods for alerting your team. To implement an orchestration solution, you can build a solution, buy an off-the-shelf tool, self-host an open-source tool, or use a tool included with your cloud provider or data platform. Apache Airflow, developed by Airbnb, is a popular choice due to its ease of adoption, simple deployment, and ubiquity in the data space. However, it has flaws, such as being engineered to orchestrate, not transform or ingest.

Open-source tools like Airflow and Prefect are popular orchestrators with paid, hosted services and support. Newer tools like Mage, Keboola, and Kestra are also innovating. Open-source tools offer community support and the ability to modify source code. However, they depend on support for continued development and may risk project abandonment or instability. A tool's history, support, and stability must be considered when choosing a solution.

Data orchestration is a crucial aspect of modern data engineering, involving the use of relational databases for data transformation. Tools like dbt, Delta Live Tables, Dataform, and SQLMesh are used as orchestrators to evaluate dependencies, optimize, and execute commands against a database to produce desired results. However, there is a potential limitation in data orchestration due to the need for a mechanism to observe data across different layers, leading to a disconnection between sources and cleaned data. This can be a challenge in identifying errors in downstream data.

Design patterns can significantly enhance the efficiency, reliability, and maintainability of data orchestration processes. Some orchestration solutions make these patterns easier, such as building backfill logic into pipelines, ensuring idempotence, and event-driven data orchestration. These patterns can help avoid one-time thinking and ensure consistent results in data engineering. Choosing a platform-specific data orchestrator can provide greater visibility between and within data workflows, making it essential for ETL workflows.

Orchestrators are complex and difficult to develop locally due to their complex trigger actions. To improve performance, invest in tools that allow for fast feedback loops, error identification, and a local environment that is developer friendly. Retry and fallback logic are essential for handling failures in a complex data stack, ensuring data integrity and system reliability. Idempotent pipelines set up scenarios for retrying operations or skipping and alerting the proper parties. Parameterized execution allows for more malleability in orchestration, allowing for multiple cases and reuse of pipelines. Lineage refers to the path traveled by data through its lifecycle, and a robust lineage solution is crucial for debugging issues and extending pipelines. Column-level lineage is becoming an industry norm in SQL orchestration, and platform-integrated orchestration solutions like Databricks Unity Catalog and Delta Live Tables offer advanced lineage capabilities. Pipeline decomposition breaks pipelines into smaller tasks for better monitoring, error handling, and scalability. Building autonomous DAGs can mitigate dependencies and critical failures, making it easier to build and debug workflows.

The evolution of transformation tools, such as containerized infrastructure and hybrid tools like Prefect and Dagster, may change the landscape of data teams. These tools can save time and resources, enabling better observability and monitoring within data warehouses. Emerging tools like SQLMesh may challenge established players like dbt, while plug-and-play solutions like Databricks Workflows are becoming more appealing. These developments will enable data teams to deliver quality data in a timely and robust manner.

Tuesday, October 29, 2024

The history of data engineering has evolved from big data frameworks like Hadoop and MapReduce to streamlined tools like Spark, Databricks, BigQuery, Redshift, Snowflake, Presto, Trino, and Athena. Cloud storage and transformation tools have made data more accessible, and lakehouses have offered a cost-efficient, unified option for managing data at scale. This evolution has led to a more accessible and efficient data management landscape.

Data transformation environments vary, with common environments being data warehouses, data lakes, and lakehouses. Data warehouses use SQL for transformation, while data lakes store large amounts of data economically. Lakehouses combine aspects of both, offering flexibility and cost-effectiveness. Databricks SQL is a serverless data warehouse that sits on the lakehouse platform. The choice between these environments depends on project needs, team expertise, and long-term data strategy.

Data staging is a crucial process in data transformation, often written in a temporary state to a suitable location, such as cloud storage or an intermediate table. Medallion architecture preserves data history and makes time travel possible. It comprises three distinct layers: Bronze for raw data, Silver for light transformation, and Gold for "clean" data. Bronze data is raw and unfiltered, Silver data is filtered, cleaned, and adjusted, and Gold data is stakeholder-ready and sometimes aggregated. This approach can be used in a lake or warehouse, breaking down each storage layer into discrete stages of data cleanliness.

Data transformation is largely influenced by the tools available, with Python being a popular choice in the digital era. Python's Pandas library is at its core, and it has evolved significantly in data processing. However, scaling Python for large datasets has been challenging, often requiring libraries like Dask and Ray. Python-based data processing is a renaissance, with Rust emerging. To transform data in Python, choose a suitable library and framework, such as Pandas or emerging libraries like Polars and DuckDB. SQL, a declarative language, can be used as a declarative or imperative language, but is limited by a lack of functionality. Languages like Jinja/Python and JavaScript often complement SQL workflows. Rust, a new transformation language, is considered the future of data engineering, but Python has a solid foothold due to its community support and library ecosystem.

Transformation frameworks are multilanguage engines for executing data transformations across machines or clusters, enabling transformations to be manipulated in various languages like Python or SQL. Two popular engines are Hadoop and Spark. Hadoop, an open-source framework, gained traction in the mid-2000s with tech giants like Yahoo, Facebook, and Google. However, its MapReduce was not well-suited for real-time or iterative workloads, leading to the rise of Apache Spark in the early 2010s. Spark, a powerful open-source data processing framework, revolutionized big data analytics by offering speed, versatility, and integration with key technologies. Its key innovation is resilient distributed datasets (RDDs), enabling in-memory data processing and faster computations. With the rise of serverless data warehouses, big data engines may no longer be necessary, but query engines like BigQuery, Databricks SQL, and Redshift should not be disregarded. Recent advancements in in-memory computation may continue to expand data warehouses' transformation capabilities.

Data transformation is a crucial process that involves pattern mapping and understanding the different transformations that should be applied. Enrichment involves enhancing existing data with additional sources, such as adding demographic information to customer records. Joining involves combining two or more datasets based on a common field, like a JOIN operation in SQL. Filtering selects only the necessary data points for analysis based on certain criteria, reducing the volume and improving the quality of the data. Structuring involves translating data into a required format or structure, such as transforming JSON documents into tabular format or vice versa. Conversion is changing the data type of a particular column or field, especially when converting between semi-structured and structured data sources. Aggregation is summarizing and combining data to draw conclusions from large volumes of data, enabling insights to inform business decisions and create value from data assets. Anonymization is masking or obfuscating sensitive information within a dataset to protect privacy. It involves hashing emails or removing personally identifiable information (PII) from records. Splitting is a form of denormalization, dividing a complex data column into multiple columns. Deduplication is the process of removing redundant records to create a unique dataset, often through aggregation, filtering, or other methods.

Data update patterns are essential for transforming data in a target system. Overwrite is the simplest form, which involves a complete drop of an existing source or table and an overwrite with new data. Inserting is a more complex pattern, involving the appending of new data to an existing source without changing existing rows. Upsert is a more complex pattern, with applications for change data capture, sessionization, and deduplication. Platforms like Databricks have MERGE functionality to simplify the process. Data deletion is often misunderstood, with two main types: "hard" and "soft." Soft deletes enable the creation of historical records for an asset's status, while hard deletes eliminate these records, which can be problematic in data recovery cases.

When building a data transformation solution, consider several best practices, including staging, idempotency, normalization, and incrementality. Staging protects against data loss and ensures a low time to recovery (TTR) in case of failure. Idempotency ensures consistency and reliability by performing something multiple times, similar to reproducibility. Normalization refines data to a clean, orderly format, while denormalization duplicates records and information for improved performance. Incrementality determines whether a pipeline is a simple INSERT OVERWRITE or a more complex UPSERT. Predefined patterns for building incremental workflows can be found in tools like dbt and Airflow. Real-time data transformation involves batch, micro-batch, and streaming transformations. Micro-batch approaches, like Apache Spark's PySpark and Spark SQL, are simpler to implement compared to true, single-event transformations. Spark Structured Streaming is a popular streaming application that efficiently handles incremental and continuous updates, achieving latencies as low as 100 milliseconds with exactly once fault tolerance. Continuous Processing, introduced in Spark 2.3, can reduce latencies to as little as 1 millisecond, further enhancing its capability for streaming data transformation.

The modern data stack is experiencing a second renaissance due to new technologies and AI advancements. As a result, new tools and technologies are emerging to redefine data transformation. However, it's crucial to adhere to timeless strategies for managing data and creating cleaned assets. Supercharged tooling and automations can be both beneficial and challenging, but engineers must ensure well-planned and executed transformation systems with a high value-to-cost ratio.

Monday, October 28, 2024

Comparisons of ADF with Apache Airflow

Choosing the right tool for a data transfer job is highly important. Previous articles have introduced Azure Data Factory and Apache Airflow as cloud tools to do large scale and dependable transfers along with a comparison to DEIS workflow. This section enumerates the differences between ADF and Apache Airflow.

Azure offers various services, each with its strength and use cases. ADF is a fully managed serverless data integration service that provides visual designer tools for configuring source, destination and ETL processes. It can handle large scale data pipelines and transformations. Apache Airflow is not a primary citizen of the Azure cloud, and it is an open-source scheduler for workflow management. While it is not a native Azure service, it can be deployed to Azure Kubernetes and Azure Container Instances.

Azure does have a native equivalent of Apache Airflow in the form of Azure Logic Apps which provides workflow automation and app integration. But for more complex scenarios requiring code-based orchestration and dynamic workflows, developers might opt for deploying Airflow on Azure.

Airflow offers more flexibility with its Python based DAGs, allowing for complex logic and dependencies. ADF’s strength lies in its no-code or low code approach and integration with other Azure services.

Apache Airflow is also more suitable for event-driven workflows and streaming data while ADF is suitable for batch processing and regular ETL tasks with an emphasis on visual authoring and monitoring.

Best practices and most efficient use of the resources are continuously updated in the official documentation from both their respective sources.

One of the often-overlooked features is that Apache Airflow can be integrated with Azure Data Factory which allows for the orchestration of complex workflows across various cloud services. This integration leverages the strengths of both platforms to create a robust data pipeline solution. The serverless architecture of the ADF can handle large-scale data workloads, while Airflow’s scheduling capabilities can manage the workflow orchestration.

Airflow’s python-based workflows allow for dynamic pipeline generation which can be tailored to specific needs. Both platforms also offer extensivity connectivity options with various data sources and processing services.

The integration steps can be listed as follows: 1. Create a data factory first with the appropriate storage and compute resources. 2. Configure the airflow environment by ensuring that it has network access to Azure Data Factory. 3. Create custom airflow operators to interact with the ADF APIs for triggering and monitoring pipelines.4. Define workflows to use Airflow DAGs to define the sequence of tasks, including the custom operators to manage ADF activities.5. use the user interface to monitor the workflow execution and manage any necessary interventions.

Setting up an active directory connection to Apache Airflow is a prerequisite.

This can be done with commands such as:

pip install apache-airflow[azure]

airflow connections add azure_ad_conn \

--conn-type azure_data_explorer \

--conn-login <client_id> \

--conn-password <client_secret> \

--conn-extra '{"tenantId": "<tenant_id>"}'

and validated with:

airflow connections get my_azure_connection

The integration can then be tested with:

from airflow import DAG

from airflow.providers.microsoft.azure.operators.data_factory import AzureDataFactoryRunPipelineOperator

with DAG('azure_data_factory_integration', schedule_interval='@daily', default_args=default_args) as dag:

run_pipeline = AzureDataFactoryRunPipelineOperator(

task_id='run_pipeline',

azure_data_factory_conn_id='azure_data_factory_default',

pipeline_name='MyPipeline',

parameters={'param1': 'value1'},

wait_for_completion=True

)

This completes the comparision of ADF with Airflow.

Previous articles: IaCResolutionsPart191.docx

Sunday, October 27, 2024

Number of Ways to Split Array

You are given a 0-indexed integer array nums of length n.

nums contains a valid split at index i if the following are true:

• The sum of the first i + 1 elements is greater than or equal to the sum of the last n - i - 1 elements.

• There is at least one element to the right of i. That is, 0 <= i < n - 1.

Return the number of valid splits in nums.

Example 1:

Input: nums = [10,4,-8,7]

Output: 2

Explanation:

There are three ways of splitting nums into two non-empty parts:

- Split nums at index 0. Then, the first part is [10], and its sum is 10. The second part is [4,-8,7], and its sum is 3. Since 10 >= 3, i = 0 is a valid split.

- Split nums at index 1. Then, the first part is [10,4], and its sum is 14. The second part is [-8,7], and its sum is -1. Since 14 >= -1, i = 1 is a valid split.

- Split nums at index 2. Then, the first part is [10,4,-8], and its sum is 6. The second part is [7], and its sum is 7. Since 6 < 7, i = 2 is not a valid split.

Thus, the number of valid splits in nums is 2.

Example 2:

Input: nums = [2,3,1,0]

Output: 2

Explanation:

There are two valid splits in nums:

- Split nums at index 1. Then, the first part is [2,3], and its sum is 5. The second part is [1,0], and its sum is 1. Since 5 >= 1, i = 1 is a valid split.

- Split nums at index 2. Then, the first part is [2,3,1], and its sum is 6. The second part is [0], and its sum is 0. Since 6 >= 0, i = 2 is a valid split.

Constraints:

• 2 <= nums.length <= 105

• -105 <= nums[i] <= 105

Solution:

class Solution {

public int waysToSplitArray(int[] nums) {

if (nums == null || nums.length <= 1 ) return 0;

int sumSoFar = 0;

int total = 0;

int count = 0;

for (int i = 0; i < nums.length; i++) {

total += nums[i];

}

for (int i = 0; i < nums.length - 1; i++) {

sumSoFar += nums[i];

if (sumSoFar >= total-sumSoFar) {

count += 1;

}

return count;

}

Test cases:

[0] => 0

[1] => 0

[0,1] => 0

[1,0] => 1

[0,0,1] => 0

[0,1,0] => 1

[0,1,1] => 1

[1,0,0] => 2

[1,0,1] => 2

[1,1,0] => 2

[1,1,1] => 1

Saturday, October 26, 2024

Managing copilots:

This section of the series on cloud infrastructure deployments focuses on the proliferation of copilots for different business purposes and internal processes. As with any flight, a copilot is of assistance only to the captain responsible for the flight to be successful. If the captain does not know where she is going, then the copilot immense assistance will still not be enough. It is of secondary importance that the data that a copilot uses might be prone to bias or shortcomings and might even lead to so-called hallucinations for the copilot. Copilots are after all large language models that work entirely on treating data as vectors and leveraging classification, regression and vector algebra to respond to queries. They don’t build a knowledge graph and do not have the big picture on what business purpose they will be applied to. If the purpose is not managed properly, infrastructure engineers might find themselves maintaining many copilots for different use cases and even reducing the benefits of where one would have sufficed.

Consolidation of large language models and their applications to different datasets is only the tip of the iceberg that these emerging technologies have provided as instruments for the data scientists. Machine Learning pipelines and applications are as diverse and silo’ed as the datasets that they operate on and they are not always present in data lakes or virtual warehouses. Consequently, a script or a prediction api written and hosted as an application does not make the best use of infrastructure for customer convenience in terms of interaction streamlining and improvements in touch points. This is not to say that different models cannot be used or that the resources don’t need to proliferate or that there are some cost savings with consolidation, it is about business justification of the number of copilots needed. When we work backwards from what the customer benefits or experiences, one of the salient features that works in favor of infrastructure management is that less is more. Hyperconvergence of infrastructure for various business purposes when those initiatives are bought into by stakeholders that have both business and technical representations makes the investments more deliberate and fulfilling.

And the cloud or the infrastructure management is not restrictive to experimentation, just that it is arguing against the uncontrolled experimentation and placing the customers in a lab. As long as experimentation and initiatives can be specific in terms of duration, budget and outcomes, infrastructure management can go the extra mile of cleaning up, decommissioning and even repurposing so that technical and business outcomes go hand in glove.

Processes are hard to establish when the technology is emerging and processes are also extremely difficult to enforce as new standards in the workplace. The failure of six sigma and the adoption of agile technologies are testament to that. However, best practices for copilot engineering are easier to align with cloud best practices and current software methodologies in most workplaces.

#codingexercise Codingexercise-10-26-2024.docx

Friday, October 25, 2024

This is a summary of the book titled “The Good Drone: How social movements democratize surveillance” written by Austin Choi-FitzPatrick and published by MIT Press in 2020. The author asserts that drones democratize airspace. He does a comprehensive survey of civic use which is both intriguing and compelling to wonder if “unmanned aircraft” is going to be for everyone. For any social scientist, these material technologies will be engrossing. Drones began as nonviolent tools. It might be considered disruptive but it not. Malevolent drones are rare if not exclusively military and there are countermeasures against drones. Tools that accomplish social change must be “visible, accessible, affordable, useful and appropriate”.

Material technologies, such as drones, have become increasingly important in social movements and activism. They serve as tools to collect and disseminate information, sometimes making it costly for those in power to maintain the status quo. Social scientists often view new devices as weapons or threats to civil liberties, but these tools can also be used to influence society. Drones, which began as nonviolent tools, have become practical and affordable when manufacturers made them easy to control. In 2012, drone use by intergovernmental organizations, governments, businesses, scientists, and civil society groups took off. The air is no longer a place where governments and corporations rule and surveil the population, and some drone use is disruptive. Examples include a drone above the Kruger National Park in South Africa, documenting crowds in Moscow, Kyiv, Bangkok, and Istanbul, and a drone in Aleppo, Syria, showcasing the effects of a brutal siege. Other drone use is emergent and nondisruptive, such as Australian biologists investigating the health of humpback whales and the Slavery from Space project searching for brick kilns in India.

A drone helped a Sinti-Roma settlement in Hungary show the world their plight. However, social scientists often focus more on the image itself than the drone. Drones can disrupt traditional photography by taking flight, potentially leading to near-instantaneous monitoring of places or events. They can also provide unfettered views of skyscrapers, prisons, and factory farms, providing democratic surveillance. New forms of data gathering may enable drones to document police actions, potentially allowing citizens to fly over military installations. Drones may also change our concept of private spaces, raising questions about privacy and surveillance.

Drones can be used for malevolent purposes, including surveillance, hunting, and causing harm to people or infrastructure. So, privacy remains a concern, with regulations focusing on police use, surveillance without permission, hunting, and drone control. Countermeasures include drone detection systems like DroneShield and SkySafe, as well as GPS transmitters and energy beams. Drones can also be difficult to fly in strong winds, fog, or dense rain, and warm surroundings can make them undetectable. As tools of resistance, drones can be used to estimate crowd sizes and monitor police officers. However, as drone prices decrease, it is crucial to consider the psychological, economic, political, and social consequences of drone use.

Thursday, October 24, 2024

Chaos engineering

Administrators will find that this section is familiar to them. There are technical considerations that emphasize design and service-level objectives, as well as the scale of the solution. But drills, testing and validations are essential to guarantee smooth operations. Chaos engineering is one such method to test the reliability of an infrastructure solution. While reliability of individual resources might be a concern for the public provider, that of the deployments fall on the IaC authors and deployers. As a contrast from other infrastructure concerns such as security that has mitigation in theory and design involving Zero Trust and least privilege principles, Chaos Engineering is all about trying it out with drills and controlled environments.
By the time a deployment of infrastructure is planned, business considerations have included the following:
understanding what kind of solution is being created such as business-to-business, business-to-consumer, or enterprise software 2. Defining the tenants in terms of number and growth plans, 3. Defining the pricing model and ensuring it aligns with the tenants’ consumption of Azure resources. 4. Understanding whether we need to separate the tenants into different tiers and based on the customer’s requirements, deciding on the tenancy model.
And a well-architected review would have addressed the key pillars of 1) Reliability, 2) Security, 3) Cost Optimization, 4) Operational Excellence, and 5) Performance efficiency. Performance is usually based on external factors and is very close to customer satisfaction. Continuous telemetry and reactiveness are essential to tuned up performance.

Security and reliability are operation concerns. When trying out the deployments for testing reliability, it is often required to inject faults and bring down components to check how the remaining part of the system behaves. The idea of injection of failure also finds parallels in beefing up security in the form of penetrative testing. The difference is that security testing is geared towards exploitation while reliability testing is geared toward reducing mean time between failures.

The component-down testing quite drastic which involves powering down the zone. There are a lot of network connections to and from cloud resources, so it becomes hard to find an alternative to a component that is down. A multi-tiered approach is necessitated to enable robustness against component-down design. Mitigation often involve workarounds and diverting traffic to healthy redundancies.
Having multi-region deployments of components not only improves reliability but also draws business from the neighborhood of the new presence. A geographical presence for a public cloud is only somewhat different from that of a private cloud. A public cloud lists regions where the services it offers are hosted. A region may have three availability zones for redundancy and availability and each zone may host a variety of cloud computing resources – small or big. Each availability zone may have one or more stadium sized datacenters. When the infrastructure is established, the process of commissioning services in the new region can be referred to as buildouts. Including appropriate buildouts increases reliability of the system in face of failures.
Component-down testing for chaos engineering differs from business continuity and disaster recovery planning in that one discovers problems in reliability and the other determines acceptable mitigation. One cannot do without the other.

For a complete picture on reliability of infrastructure deployments, additional considerations become necessary. These include:

First, the automation must involve context switching between the platform and the task for deploying each service to the region. The platform co-ordinates these services and must maintain an order, dependency and status during the tasks.
Second, the task of each service itself is complicated and requires definitions in terms of region-specific parameters to an otherwise region agnostic service model.
Third, the service must manifest their dependencies declaratively so that they can be validated and processed correctly. These dependencies may be between services, on external resources and the availability or event from another activity.
Fourth, the service buildouts must be retry-able on errors and exceptions otherwise the platform will require a lot of manual intervention which increase the cost
Fifth, the communication between the automated activities and manual interventions must be captured with the help of the ticket tracking or incident management system
Sixth, the workflow and the state for each activity pertaining to the task must follow standard operating procedures that are defined independent of region and available to audit
Seventh, the technologies for the platform execution and that for the deployments of the services might be different requiring consolidation and coordination between the two. In such case, the fewer the context switches between the two the better.
Eighth, the platform itself must have support for templates, event publishing and subscription, metadata store, onboarding and bootstrapping processes that can be reused.
Ninth, the platform should support parameters for enabling a region to be differentiated from others or for customer satisfaction in terms of features or services available.
Tenth, the progress for the buildout of new regions must be actively tracked with the help of tiles for individual tasks and rows per services.

#codingexercise: CodingExercise-10-24-2024.docx