Cluster computing: October 2024

Thursday, October 31, 2024

A previous article discussed the ETL process and its evolution with the recent paradigms followed by a discussion on the role of an orchestrator in data engineering. This section focuses on pipeline issues and troubleshooting. The return on investment in data engineering projects is often reduced by how fragile the system becomes and the maintenance it requires. Systems do fail but planning for failure means making it easier to maintain and extend, providing automation for handling errors and learning from experience. The minimum viable product principle and the 80/20 principle are time-honored traditions.

Direct and indirect costs of ETL systems are significant, as they can lead to inefficient operations, long run times, and high bills from providers. Indirect costs, such as constant triaging and data failure, can be even more significant. Teams that win build efficient systems that allow them to focus on feature development and data democratization. SaaS (Software as a Service) offers a cost-effective solution, but it can also lead to loss of trust, revenue, and reputation. To minimize these costs, focus on maintainability, data quality, error handling, and improved workflows. Monitoring and benchmarking are essential for minimizing pipeline issues and expediting troubleshooting efforts. Proper monitoring and alerting can help improve maintainability of data systems and lower costs associated with broken data. Observing data across ingestion, transformation, and storage, handling errors as they arise, and alerting the team when things break, is crucial for ensuring good business decisions.

Data reliability and usefulness are assessed using metrics such as freshness, volume, and quality. Freshness measures the timeliness and relevance of data, ensuring accurate and recent information for analytics, decision-making, and other data-driven processes. Common metrics include the length between the most recent timestamp and the current timestamp, lag between source data and the dataset, refresh rate, and latency. Volume refers to the amount of data needed for processing, storage, and management within a system. Quality involves ensuring data is accurate, consistent, and reliable throughout its lifecycle. Examples of data quality metrics include uniqueness, completeness, and validity.

Monitoring involves detecting errors in a timely fashion and implementing strict measures to improve data quality. Techniques to improve data quality include logging and monitoring, lineage, and visual representations of pipelines and systems. Lineage should be complete and granular, allowing for better insight and efficiency in triaging errors and improving productivity. Overall, implementing these metrics helps ensure data quality and reliability within an organization.

Anomaly detection systems analyze time series data to make statistical forecasts within a certain confidence interval. They can catch errors that might originate outside the systems, such as a bug in a payments processing team that decreases purchases. Data diffs are systems that report on data changes presented by changes in code, ensuring accurate systems remain accurate especially when used as an indicator on data quality. Tools like Datafold and SQLMesh have data diffing functionality. Assertions are constraints put on data outputs to validate source data. They are simpler than anomaly detection and can be found in libraries like Great Expectations aka GX Expectations or systems with built-in assertion definitions.

Error handling is crucial for data systems and recovery from their effects, such as lost data or downtime. Error handling involves automating error responses or boundary conditions to keep data systems functioning or alert the team in a timely and discreet manner. Approaches include conditional logic, retry mechanisms, and pipeline decomposition. These methods help keep the impact of errors contained and ensure the smooth functioning of data systems.

Graceful degradation and error isolation are essential for maintaining limited functionality even when a part of a system fails. Error isolation is enabled through pipeline decomposition, which allows systems to fail in a contained manner. Graceful degradation maintains limited functionality even when a part of the system fails, allowing only one part of the business to notice an error.

Alerting should be a last line of defense, as receiving alerts is reactive. Isolating errors and building systems that degrade gracefully can reduce alarm fatigue and create a good developer experience for the team.

Recovery systems should be built for disasters, including lost data. Staged data, such as Parquet-based formats like Delta Lake and patterns like the medallion architecture, can be used for disaster recovery. Backfilling, the practice of simulating historical runs of a pipeline to create a complete dataset, can save time when something breaks.

Improving workflows is crucial in data engineering, as it is an inherently collaborative job. Data engineering is a question of when things break, not if. Starting with systems that prioritize troubleshooting, adaptability, and recovery can reduce headaches down the line.

In the context of software teams, understanding their motivations and workflows is crucial for fostering healthy relationships and improving efficiency. By focusing on the team's goals and understanding their workflows, you can craft a process to improve efficiency.

Structured, pragmatic approaches can ensure healthy relationships through Service-Level Agreements (SLAs), data contracts, APIs, compassion and empathy, and aligning incentives. SLAs can be used to define performance metrics, responsibilities, response and resolution times, and escalation procedures, improving the quality of data that is outside of your control. Data contracts, popularized by dbt, govern data ingested from external sources, providing a layer of standardization and consistency. APIs can be used to transmit an expected set of data, providing granular access control, scalability benefits, and versioning, which can be useful for compliance.

Compassion and empathy are essential in engineering and psychology, as understanding coworkers' motivations, pain points, and workflows allows for effective communication and appeal to their incentives. In the digital age, it's essential to go the extra mile to understand coworkers and appeal to their incentives.

Setting key performance indicators (KPIs) around common incident management metrics can help justify the time and energy required to do the job right. These metrics include the number of incidents, time to detection (TTD), time to resolution (TTR), and data downtime (N × [TTD + TTR].

Continually iterating and adjusting processes in the wake of failures and enhancing good pipelines to become great are some ways to improve outcomes. Documentation is crucial for understanding how to fix errors and improve the quality of data pipelines. Postmortems are valuable for analyzing failures and learning from them, leading to fewer events that require recovery. Unit tests are essential for validating small pieces of code and ensuring they produce desired results. Continuous integration/continuous deployment (CI/CD) is a preventative practice to minimize future errors and ensure a consistent code base.

Engineers should simplify and abstract complex code to improve collaboration and reduce errors. Building data systems as code, which can be rolled back and reverted to previous states, can improve observability, disaster recovery, and collaboration. Tools that are difficult or impossible to version control or manipulate through code should be exercised with caution. With responsibilities defined, incentives aligned, and a monitoring/troubleshooting toolkit, engineers can automate and optimize data workflows. Balancing automation and practicality is essential in data engineering, ensuring robust, resilient systems ready for scaling.

Wednesday, October 30, 2024

This section explores dependency management and pipeline orchestration that are aptly dubbed “under-currents” in the book “Fundamentals of Data Engineering” by Matt Housley and Joe Reis.

Data orchestration is a process of dependency management, facilitated through automation. It involves scheduling, triggering, monitoring, and resource allocation. Data orchestrators are different from schedulers, which are cron-based. They can trigger events, webhooks, schedules, and even intra-workflow dependencies. Data orchestration provides a structured, automated, and efficient way to handle large-scale data from diverse sources.

Orchestration steers workflows toward efficiency and functionality, with an orchestrator serving as the tool enabling these workflows. They typically trigger pipelines based on a schedule or a specific event. Event-driven pipelines are beneficial for handling unpredictable data or resource-intensive jobs.

The perks of having an orchestrator in your data engineering toolkit include workflow management, automation, error handling and recovery, monitoring and alerting, and resource optimization. Directed Acyclic Graphs or DAGs for short bring order, control, and repeatability to data workflows, managing dependencies and ensuring a structured and predictable flow of data. They are pivotal for orchestrating and visualizing pipelines, making them indispensable in managing complex workflows, particularly within a team or large-scale setups. For example, a DAG serves as a clear roadmap defining the order of tasks and with this lens, it is possible to organize the creation, scheduling, and monitoring of data pipelines.

Data orchestration tools have evolved significantly over the past few decades, with the creation of Apache Airflow and Luigi being the most dominant tools. However, it is crucial to choose the right tool for the job, as each tool has its strengths and weaknesses. Data orchestrators, like the conductor of an orchestra, balance declarative and imperative frameworks to provide flexibility and efficiency in software engineering best practices.

When selecting an orchestrator, factors such as scalability, code and configuration reusability, and the ability to handle complex logic and dependencies are important to consider. The orchestrator should be able to scale vertically or horizontally, ensuring that the process of orchestrating data is separate from the process of transforming data.

Orchestration is about connections, and platforms like Azure, Amazon, and Google offer hosted versions of popular tools like Airflow. Platform-embedded alternatives like Databricks Workflows provide more visibility into tasks orchestrated by orchestrators. Popular orchestrators have strong community support and continuous development, ensuring they remain up to date with the latest technologies and best practices. Support is crucial for both closed source and paid solutions, as solutions engineers can help resolve issues.

Observability is essential for understanding transformation flows and ensuring your orchestrator supports various methods for alerting your team. To implement an orchestration solution, you can build a solution, buy an off-the-shelf tool, self-host an open-source tool, or use a tool included with your cloud provider or data platform. Apache Airflow, developed by Airbnb, is a popular choice due to its ease of adoption, simple deployment, and ubiquity in the data space. However, it has flaws, such as being engineered to orchestrate, not transform or ingest.

Open-source tools like Airflow and Prefect are popular orchestrators with paid, hosted services and support. Newer tools like Mage, Keboola, and Kestra are also innovating. Open-source tools offer community support and the ability to modify source code. However, they depend on support for continued development and may risk project abandonment or instability. A tool's history, support, and stability must be considered when choosing a solution.

Data orchestration is a crucial aspect of modern data engineering, involving the use of relational databases for data transformation. Tools like dbt, Delta Live Tables, Dataform, and SQLMesh are used as orchestrators to evaluate dependencies, optimize, and execute commands against a database to produce desired results. However, there is a potential limitation in data orchestration due to the need for a mechanism to observe data across different layers, leading to a disconnection between sources and cleaned data. This can be a challenge in identifying errors in downstream data.

Design patterns can significantly enhance the efficiency, reliability, and maintainability of data orchestration processes. Some orchestration solutions make these patterns easier, such as building backfill logic into pipelines, ensuring idempotence, and event-driven data orchestration. These patterns can help avoid one-time thinking and ensure consistent results in data engineering. Choosing a platform-specific data orchestrator can provide greater visibility between and within data workflows, making it essential for ETL workflows.

Orchestrators are complex and difficult to develop locally due to their complex trigger actions. To improve performance, invest in tools that allow for fast feedback loops, error identification, and a local environment that is developer friendly. Retry and fallback logic are essential for handling failures in a complex data stack, ensuring data integrity and system reliability. Idempotent pipelines set up scenarios for retrying operations or skipping and alerting the proper parties. Parameterized execution allows for more malleability in orchestration, allowing for multiple cases and reuse of pipelines. Lineage refers to the path traveled by data through its lifecycle, and a robust lineage solution is crucial for debugging issues and extending pipelines. Column-level lineage is becoming an industry norm in SQL orchestration, and platform-integrated orchestration solutions like Databricks Unity Catalog and Delta Live Tables offer advanced lineage capabilities. Pipeline decomposition breaks pipelines into smaller tasks for better monitoring, error handling, and scalability. Building autonomous DAGs can mitigate dependencies and critical failures, making it easier to build and debug workflows.

The evolution of transformation tools, such as containerized infrastructure and hybrid tools like Prefect and Dagster, may change the landscape of data teams. These tools can save time and resources, enabling better observability and monitoring within data warehouses. Emerging tools like SQLMesh may challenge established players like dbt, while plug-and-play solutions like Databricks Workflows are becoming more appealing. These developments will enable data teams to deliver quality data in a timely and robust manner.

Tuesday, October 29, 2024

The history of data engineering has evolved from big data frameworks like Hadoop and MapReduce to streamlined tools like Spark, Databricks, BigQuery, Redshift, Snowflake, Presto, Trino, and Athena. Cloud storage and transformation tools have made data more accessible, and lakehouses have offered a cost-efficient, unified option for managing data at scale. This evolution has led to a more accessible and efficient data management landscape.

Data transformation environments vary, with common environments being data warehouses, data lakes, and lakehouses. Data warehouses use SQL for transformation, while data lakes store large amounts of data economically. Lakehouses combine aspects of both, offering flexibility and cost-effectiveness. Databricks SQL is a serverless data warehouse that sits on the lakehouse platform. The choice between these environments depends on project needs, team expertise, and long-term data strategy.

Data staging is a crucial process in data transformation, often written in a temporary state to a suitable location, such as cloud storage or an intermediate table. Medallion architecture preserves data history and makes time travel possible. It comprises three distinct layers: Bronze for raw data, Silver for light transformation, and Gold for "clean" data. Bronze data is raw and unfiltered, Silver data is filtered, cleaned, and adjusted, and Gold data is stakeholder-ready and sometimes aggregated. This approach can be used in a lake or warehouse, breaking down each storage layer into discrete stages of data cleanliness.

Data transformation is largely influenced by the tools available, with Python being a popular choice in the digital era. Python's Pandas library is at its core, and it has evolved significantly in data processing. However, scaling Python for large datasets has been challenging, often requiring libraries like Dask and Ray. Python-based data processing is a renaissance, with Rust emerging. To transform data in Python, choose a suitable library and framework, such as Pandas or emerging libraries like Polars and DuckDB. SQL, a declarative language, can be used as a declarative or imperative language, but is limited by a lack of functionality. Languages like Jinja/Python and JavaScript often complement SQL workflows. Rust, a new transformation language, is considered the future of data engineering, but Python has a solid foothold due to its community support and library ecosystem.

Transformation frameworks are multilanguage engines for executing data transformations across machines or clusters, enabling transformations to be manipulated in various languages like Python or SQL. Two popular engines are Hadoop and Spark. Hadoop, an open-source framework, gained traction in the mid-2000s with tech giants like Yahoo, Facebook, and Google. However, its MapReduce was not well-suited for real-time or iterative workloads, leading to the rise of Apache Spark in the early 2010s. Spark, a powerful open-source data processing framework, revolutionized big data analytics by offering speed, versatility, and integration with key technologies. Its key innovation is resilient distributed datasets (RDDs), enabling in-memory data processing and faster computations. With the rise of serverless data warehouses, big data engines may no longer be necessary, but query engines like BigQuery, Databricks SQL, and Redshift should not be disregarded. Recent advancements in in-memory computation may continue to expand data warehouses' transformation capabilities.

Data transformation is a crucial process that involves pattern mapping and understanding the different transformations that should be applied. Enrichment involves enhancing existing data with additional sources, such as adding demographic information to customer records. Joining involves combining two or more datasets based on a common field, like a JOIN operation in SQL. Filtering selects only the necessary data points for analysis based on certain criteria, reducing the volume and improving the quality of the data. Structuring involves translating data into a required format or structure, such as transforming JSON documents into tabular format or vice versa. Conversion is changing the data type of a particular column or field, especially when converting between semi-structured and structured data sources. Aggregation is summarizing and combining data to draw conclusions from large volumes of data, enabling insights to inform business decisions and create value from data assets. Anonymization is masking or obfuscating sensitive information within a dataset to protect privacy. It involves hashing emails or removing personally identifiable information (PII) from records. Splitting is a form of denormalization, dividing a complex data column into multiple columns. Deduplication is the process of removing redundant records to create a unique dataset, often through aggregation, filtering, or other methods.

Data update patterns are essential for transforming data in a target system. Overwrite is the simplest form, which involves a complete drop of an existing source or table and an overwrite with new data. Inserting is a more complex pattern, involving the appending of new data to an existing source without changing existing rows. Upsert is a more complex pattern, with applications for change data capture, sessionization, and deduplication. Platforms like Databricks have MERGE functionality to simplify the process. Data deletion is often misunderstood, with two main types: "hard" and "soft." Soft deletes enable the creation of historical records for an asset's status, while hard deletes eliminate these records, which can be problematic in data recovery cases.

When building a data transformation solution, consider several best practices, including staging, idempotency, normalization, and incrementality. Staging protects against data loss and ensures a low time to recovery (TTR) in case of failure. Idempotency ensures consistency and reliability by performing something multiple times, similar to reproducibility. Normalization refines data to a clean, orderly format, while denormalization duplicates records and information for improved performance. Incrementality determines whether a pipeline is a simple INSERT OVERWRITE or a more complex UPSERT. Predefined patterns for building incremental workflows can be found in tools like dbt and Airflow. Real-time data transformation involves batch, micro-batch, and streaming transformations. Micro-batch approaches, like Apache Spark's PySpark and Spark SQL, are simpler to implement compared to true, single-event transformations. Spark Structured Streaming is a popular streaming application that efficiently handles incremental and continuous updates, achieving latencies as low as 100 milliseconds with exactly once fault tolerance. Continuous Processing, introduced in Spark 2.3, can reduce latencies to as little as 1 millisecond, further enhancing its capability for streaming data transformation.

The modern data stack is experiencing a second renaissance due to new technologies and AI advancements. As a result, new tools and technologies are emerging to redefine data transformation. However, it's crucial to adhere to timeless strategies for managing data and creating cleaned assets. Supercharged tooling and automations can be both beneficial and challenging, but engineers must ensure well-planned and executed transformation systems with a high value-to-cost ratio.

Monday, October 28, 2024

Comparisons of ADF with Apache Airflow

Choosing the right tool for a data transfer job is highly important. Previous articles have introduced Azure Data Factory and Apache Airflow as cloud tools to do large scale and dependable transfers along with a comparison to DEIS workflow. This section enumerates the differences between ADF and Apache Airflow.

Azure offers various services, each with its strength and use cases. ADF is a fully managed serverless data integration service that provides visual designer tools for configuring source, destination and ETL processes. It can handle large scale data pipelines and transformations. Apache Airflow is not a primary citizen of the Azure cloud, and it is an open-source scheduler for workflow management. While it is not a native Azure service, it can be deployed to Azure Kubernetes and Azure Container Instances.

Azure does have a native equivalent of Apache Airflow in the form of Azure Logic Apps which provides workflow automation and app integration. But for more complex scenarios requiring code-based orchestration and dynamic workflows, developers might opt for deploying Airflow on Azure.

Airflow offers more flexibility with its Python based DAGs, allowing for complex logic and dependencies. ADF’s strength lies in its no-code or low code approach and integration with other Azure services.

Apache Airflow is also more suitable for event-driven workflows and streaming data while ADF is suitable for batch processing and regular ETL tasks with an emphasis on visual authoring and monitoring.

Best practices and most efficient use of the resources are continuously updated in the official documentation from both their respective sources.

One of the often-overlooked features is that Apache Airflow can be integrated with Azure Data Factory which allows for the orchestration of complex workflows across various cloud services. This integration leverages the strengths of both platforms to create a robust data pipeline solution. The serverless architecture of the ADF can handle large-scale data workloads, while Airflow’s scheduling capabilities can manage the workflow orchestration.

Airflow’s python-based workflows allow for dynamic pipeline generation which can be tailored to specific needs. Both platforms also offer extensivity connectivity options with various data sources and processing services.

The integration steps can be listed as follows: 1. Create a data factory first with the appropriate storage and compute resources. 2. Configure the airflow environment by ensuring that it has network access to Azure Data Factory. 3. Create custom airflow operators to interact with the ADF APIs for triggering and monitoring pipelines.4. Define workflows to use Airflow DAGs to define the sequence of tasks, including the custom operators to manage ADF activities.5. use the user interface to monitor the workflow execution and manage any necessary interventions.

Setting up an active directory connection to Apache Airflow is a prerequisite.

This can be done with commands such as:

pip install apache-airflow[azure]

airflow connections add azure_ad_conn \

--conn-type azure_data_explorer \

--conn-login <client_id> \

--conn-password <client_secret> \

--conn-extra '{"tenantId": "<tenant_id>"}'

and validated with:

airflow connections get my_azure_connection

The integration can then be tested with:

from airflow import DAG

from airflow.providers.microsoft.azure.operators.data_factory import AzureDataFactoryRunPipelineOperator

with DAG('azure_data_factory_integration', schedule_interval='@daily', default_args=default_args) as dag:

run_pipeline = AzureDataFactoryRunPipelineOperator(

task_id='run_pipeline',

azure_data_factory_conn_id='azure_data_factory_default',

pipeline_name='MyPipeline',

parameters={'param1': 'value1'},

wait_for_completion=True

)

This completes the comparision of ADF with Airflow.

Previous articles: IaCResolutionsPart191.docx

Sunday, October 27, 2024

Number of Ways to Split Array

You are given a 0-indexed integer array nums of length n.

nums contains a valid split at index i if the following are true:

• The sum of the first i + 1 elements is greater than or equal to the sum of the last n - i - 1 elements.

• There is at least one element to the right of i. That is, 0 <= i < n - 1.

Return the number of valid splits in nums.

Example 1:

Input: nums = [10,4,-8,7]

Output: 2

Explanation:

There are three ways of splitting nums into two non-empty parts:

- Split nums at index 0. Then, the first part is [10], and its sum is 10. The second part is [4,-8,7], and its sum is 3. Since 10 >= 3, i = 0 is a valid split.

- Split nums at index 1. Then, the first part is [10,4], and its sum is 14. The second part is [-8,7], and its sum is -1. Since 14 >= -1, i = 1 is a valid split.

- Split nums at index 2. Then, the first part is [10,4,-8], and its sum is 6. The second part is [7], and its sum is 7. Since 6 < 7, i = 2 is not a valid split.

Thus, the number of valid splits in nums is 2.

Example 2:

Input: nums = [2,3,1,0]

Output: 2

Explanation:

There are two valid splits in nums:

- Split nums at index 1. Then, the first part is [2,3], and its sum is 5. The second part is [1,0], and its sum is 1. Since 5 >= 1, i = 1 is a valid split.

- Split nums at index 2. Then, the first part is [2,3,1], and its sum is 6. The second part is [0], and its sum is 0. Since 6 >= 0, i = 2 is a valid split.

Constraints:

• 2 <= nums.length <= 105

• -105 <= nums[i] <= 105

Solution:

class Solution {

public int waysToSplitArray(int[] nums) {

if (nums == null || nums.length <= 1 ) return 0;

int sumSoFar = 0;

int total = 0;

int count = 0;

for (int i = 0; i < nums.length; i++) {

total += nums[i];

}

for (int i = 0; i < nums.length - 1; i++) {

sumSoFar += nums[i];

if (sumSoFar >= total-sumSoFar) {

count += 1;

}

return count;

}

Test cases:

[0] => 0

[1] => 0

[0,1] => 0

[1,0] => 1

[0,0,1] => 0

[0,1,0] => 1

[0,1,1] => 1

[1,0,0] => 2

[1,0,1] => 2

[1,1,0] => 2

[1,1,1] => 1

Saturday, October 26, 2024

Managing copilots:

This section of the series on cloud infrastructure deployments focuses on the proliferation of copilots for different business purposes and internal processes. As with any flight, a copilot is of assistance only to the captain responsible for the flight to be successful. If the captain does not know where she is going, then the copilot immense assistance will still not be enough. It is of secondary importance that the data that a copilot uses might be prone to bias or shortcomings and might even lead to so-called hallucinations for the copilot. Copilots are after all large language models that work entirely on treating data as vectors and leveraging classification, regression and vector algebra to respond to queries. They don’t build a knowledge graph and do not have the big picture on what business purpose they will be applied to. If the purpose is not managed properly, infrastructure engineers might find themselves maintaining many copilots for different use cases and even reducing the benefits of where one would have sufficed.

Consolidation of large language models and their applications to different datasets is only the tip of the iceberg that these emerging technologies have provided as instruments for the data scientists. Machine Learning pipelines and applications are as diverse and silo’ed as the datasets that they operate on and they are not always present in data lakes or virtual warehouses. Consequently, a script or a prediction api written and hosted as an application does not make the best use of infrastructure for customer convenience in terms of interaction streamlining and improvements in touch points. This is not to say that different models cannot be used or that the resources don’t need to proliferate or that there are some cost savings with consolidation, it is about business justification of the number of copilots needed. When we work backwards from what the customer benefits or experiences, one of the salient features that works in favor of infrastructure management is that less is more. Hyperconvergence of infrastructure for various business purposes when those initiatives are bought into by stakeholders that have both business and technical representations makes the investments more deliberate and fulfilling.

And the cloud or the infrastructure management is not restrictive to experimentation, just that it is arguing against the uncontrolled experimentation and placing the customers in a lab. As long as experimentation and initiatives can be specific in terms of duration, budget and outcomes, infrastructure management can go the extra mile of cleaning up, decommissioning and even repurposing so that technical and business outcomes go hand in glove.

Processes are hard to establish when the technology is emerging and processes are also extremely difficult to enforce as new standards in the workplace. The failure of six sigma and the adoption of agile technologies are testament to that. However, best practices for copilot engineering are easier to align with cloud best practices and current software methodologies in most workplaces.

#codingexercise Codingexercise-10-26-2024.docx

Friday, October 25, 2024

This is a summary of the book titled “The Good Drone: How social movements democratize surveillance” written by Austin Choi-FitzPatrick and published by MIT Press in 2020. The author asserts that drones democratize airspace. He does a comprehensive survey of civic use which is both intriguing and compelling to wonder if “unmanned aircraft” is going to be for everyone. For any social scientist, these material technologies will be engrossing. Drones began as nonviolent tools. It might be considered disruptive but it not. Malevolent drones are rare if not exclusively military and there are countermeasures against drones. Tools that accomplish social change must be “visible, accessible, affordable, useful and appropriate”.

Material technologies, such as drones, have become increasingly important in social movements and activism. They serve as tools to collect and disseminate information, sometimes making it costly for those in power to maintain the status quo. Social scientists often view new devices as weapons or threats to civil liberties, but these tools can also be used to influence society. Drones, which began as nonviolent tools, have become practical and affordable when manufacturers made them easy to control. In 2012, drone use by intergovernmental organizations, governments, businesses, scientists, and civil society groups took off. The air is no longer a place where governments and corporations rule and surveil the population, and some drone use is disruptive. Examples include a drone above the Kruger National Park in South Africa, documenting crowds in Moscow, Kyiv, Bangkok, and Istanbul, and a drone in Aleppo, Syria, showcasing the effects of a brutal siege. Other drone use is emergent and nondisruptive, such as Australian biologists investigating the health of humpback whales and the Slavery from Space project searching for brick kilns in India.

A drone helped a Sinti-Roma settlement in Hungary show the world their plight. However, social scientists often focus more on the image itself than the drone. Drones can disrupt traditional photography by taking flight, potentially leading to near-instantaneous monitoring of places or events. They can also provide unfettered views of skyscrapers, prisons, and factory farms, providing democratic surveillance. New forms of data gathering may enable drones to document police actions, potentially allowing citizens to fly over military installations. Drones may also change our concept of private spaces, raising questions about privacy and surveillance.

Drones can be used for malevolent purposes, including surveillance, hunting, and causing harm to people or infrastructure. So, privacy remains a concern, with regulations focusing on police use, surveillance without permission, hunting, and drone control. Countermeasures include drone detection systems like DroneShield and SkySafe, as well as GPS transmitters and energy beams. Drones can also be difficult to fly in strong winds, fog, or dense rain, and warm surroundings can make them undetectable. As tools of resistance, drones can be used to estimate crowd sizes and monitor police officers. However, as drone prices decrease, it is crucial to consider the psychological, economic, political, and social consequences of drone use.

Thursday, October 24, 2024

Chaos engineering

Administrators will find that this section is familiar to them. There are technical considerations that emphasize design and service-level objectives, as well as the scale of the solution. But drills, testing and validations are essential to guarantee smooth operations. Chaos engineering is one such method to test the reliability of an infrastructure solution. While reliability of individual resources might be a concern for the public provider, that of the deployments fall on the IaC authors and deployers. As a contrast from other infrastructure concerns such as security that has mitigation in theory and design involving Zero Trust and least privilege principles, Chaos Engineering is all about trying it out with drills and controlled environments.
By the time a deployment of infrastructure is planned, business considerations have included the following:
understanding what kind of solution is being created such as business-to-business, business-to-consumer, or enterprise software 2. Defining the tenants in terms of number and growth plans, 3. Defining the pricing model and ensuring it aligns with the tenants’ consumption of Azure resources. 4. Understanding whether we need to separate the tenants into different tiers and based on the customer’s requirements, deciding on the tenancy model.
And a well-architected review would have addressed the key pillars of 1) Reliability, 2) Security, 3) Cost Optimization, 4) Operational Excellence, and 5) Performance efficiency. Performance is usually based on external factors and is very close to customer satisfaction. Continuous telemetry and reactiveness are essential to tuned up performance.

Security and reliability are operation concerns. When trying out the deployments for testing reliability, it is often required to inject faults and bring down components to check how the remaining part of the system behaves. The idea of injection of failure also finds parallels in beefing up security in the form of penetrative testing. The difference is that security testing is geared towards exploitation while reliability testing is geared toward reducing mean time between failures.

The component-down testing quite drastic which involves powering down the zone. There are a lot of network connections to and from cloud resources, so it becomes hard to find an alternative to a component that is down. A multi-tiered approach is necessitated to enable robustness against component-down design. Mitigation often involve workarounds and diverting traffic to healthy redundancies.
Having multi-region deployments of components not only improves reliability but also draws business from the neighborhood of the new presence. A geographical presence for a public cloud is only somewhat different from that of a private cloud. A public cloud lists regions where the services it offers are hosted. A region may have three availability zones for redundancy and availability and each zone may host a variety of cloud computing resources – small or big. Each availability zone may have one or more stadium sized datacenters. When the infrastructure is established, the process of commissioning services in the new region can be referred to as buildouts. Including appropriate buildouts increases reliability of the system in face of failures.
Component-down testing for chaos engineering differs from business continuity and disaster recovery planning in that one discovers problems in reliability and the other determines acceptable mitigation. One cannot do without the other.

For a complete picture on reliability of infrastructure deployments, additional considerations become necessary. These include:

First, the automation must involve context switching between the platform and the task for deploying each service to the region. The platform co-ordinates these services and must maintain an order, dependency and status during the tasks.
Second, the task of each service itself is complicated and requires definitions in terms of region-specific parameters to an otherwise region agnostic service model.
Third, the service must manifest their dependencies declaratively so that they can be validated and processed correctly. These dependencies may be between services, on external resources and the availability or event from another activity.
Fourth, the service buildouts must be retry-able on errors and exceptions otherwise the platform will require a lot of manual intervention which increase the cost
Fifth, the communication between the automated activities and manual interventions must be captured with the help of the ticket tracking or incident management system
Sixth, the workflow and the state for each activity pertaining to the task must follow standard operating procedures that are defined independent of region and available to audit
Seventh, the technologies for the platform execution and that for the deployments of the services might be different requiring consolidation and coordination between the two. In such case, the fewer the context switches between the two the better.
Eighth, the platform itself must have support for templates, event publishing and subscription, metadata store, onboarding and bootstrapping processes that can be reused.
Ninth, the platform should support parameters for enabling a region to be differentiated from others or for customer satisfaction in terms of features or services available.
Tenth, the progress for the buildout of new regions must be actively tracked with the help of tiles for individual tasks and rows per services.

#codingexercise: CodingExercise-10-24-2024.docx

Wednesday, October 23, 2024

This is a summary of the book titled “Palaces for the people” written by Eric Klinenberg and published by Crown in 2019. The author explains “How social infrastructure can help fight Inequality, Polarization, and the Decline of Civic Life”. The term social infrastructure refers to shared public spaces such as libraries, parks and coffee shops. His examples include a hurricane shelter in Houston that was made out from a church and vacant lots in Chicago that was converted into an urban farm. He is upbeat about innovative ways to improve, expand and maintain such structures and refers to them as an indicator of the health of a community and how it avoids crime. Libraries are a must in his inclusive list of such infrastructure. He lists threats such as commercial development that brings gentrification, absence of such infrastructure that leads to poorer health of already suffering communities, and shortcomings in disaster preparation. He aims this treatise exclusively for the United States.

Social infrastructure is crucial for a community's health and well-being, as it includes buildings, public spaces, parks, libraries, and coffee shops. For example, public libraries offer a free, open place for people to socialize and volunteer organizations to meet. Libraries serve as bedrocks of civil society and offer responsibility and independence for young people. However, many libraries suffer from neglect due to reduced funding and services.

Social infrastructure can also play a role in preventing crime. Cities often focus on individual offenders, but experts suggest that focusing on the environments where crime flourishes, such as empty lots and abandoned buildings, may be more effective. Philadelphia researchers found a 39% drop in gun violence in areas around repaired structures and a 5% reduction in gun violence in vacant lots.

Research by landscape architect William Sullivan and environmental scientist Frances Kuo found that vegetation provides social benefits, such as lower crime rates in areas around buildings. Building prisons for the poor has been a main crime reduction policy, but the social costs have been as great as the economic expenses.

Commercial development can lead to increased property values and rents even as it displaces residents, and causes a decrease in crime rates. Yale professor Andrew Papachristos found a correlation between the number of new coffee shops and a reduction in local murder rates, regardless of the neighborhood's residents. However, street robberies rates declined in primarily white and Latino neighborhoods but tended to rise in gentrifying neighborhoods with primarily Black residents. This suggests that gentrification is not a viable anticrime strategy due to its social costs.

Social infrastructure is crucial for the health of people in poorer communities, as opioid addiction has reached epidemic levels in small and rural areas. Modern infrastructure, such as reliable power, clean water, fast transit, affordable food, and resilient structures, has improved public health more than any other modern intervention. In some areas, community activists and social entrepreneurs have turned vacant lots into urban agriculture, providing fresh, healthy food from farmer’s markets, fostering social ties, and reducing stress levels. A robust social infrastructure provides opportunities for the elderly to socialize and stay active, as seen in Singapore's high-rise complexes.

Social infrastructure is crucial in times of disaster, as societies worldwide invest trillions in hard infrastructure to cope with storms, floods, heat, drought, and migration. Religious groups and community organizations play a vital role in recovery from disasters, such as Hurricane Harvey and Hurricane Sandy. Policymakers are seeking creative ways to construct protective systems that double as social infrastructure, such as the Potomac Park Levee in Washington, DC, and Singapore's Marina Barrage and Reservoir project in Singapore. Grassroots organizations have devised innovative schemes for adapting to high waters, such as Bangladesh's "floating schools and libraries" program. Investors, including PayPal founder Peter Thiel, are backing floating-city concepts. The United States must rebuild its social infrastructure, as most significant infrastructure is a product of state funding or private philanthropy. Social infrastructure can determine the number of opportunities for meaningful social interactions and can make the difference between life and death during crises.

Tuesday, October 22, 2024

Count the number of array slices in an array of random integers such that the difference between the min and the max values in the slice is <= k

public static int getCountOfSlicesWithMinMaxDiffGEk(int[] A, int k){

int N = A.length;

int[] maxQ = new int[N+1];

int[] posmaxQ = new int[N+1];

int[] minQ = new int[N+1];

int[] posminQ = new int[N+1];

int firstMax = 0, lastMax = -1;

int firstMin = 0, lastMin = -1;

int j = 0, result = 0;

for (int i = 0; i < N; i++) {

while (j < N) {

while(lastMax >= firstMax && maxQ[lastMax] <= A[j]) {

lastMax -= 1;

}

lastMax += 1;

maxQ[lastMax] = A[j];

posmaxQ[lastMax] = j;

while (lastMin >= firstMin && minQ[lastMin] >= A[j]) {

lastMin -= 1;

}

lastMin += 1;

minQ[lastMin] = A[j];

posminQ[lastMin] = j;

if (maxQ[firstMax] - minQ[firstMin] <= k) {

j += 1;

} else {

break;

}

System.out.println("result="+result + " i=" + i + " j="+ j);

result += (j-i);

if (result >= Integer.MAX_VALUE) {

result = Integer.MAX_VALUE;

}

if (posminQ[firstMin] == i) {

firstMin += 1;

}

if (posmaxQ[firstMax] == i) {

firstMax += 1;

}

return result;

}

A: 3,5,7,6,3 K=2

result=0 i=0 j=2

result=2 i=1 j=4

result=5 i=2 j=4

result=7 i=3 j=4

result=8 i=4 j=5

Monday, October 21, 2024

Managing copilots:

Reference: Previous articles on Infrastructure management: https://1drv.ms/w/s!Ashlm-Nw-wnWhPZmOhdL7Y5aiDLb6g?e=gsr9g4

Sunday, October 20, 2024

This is a summary of the book titled “Who built that?” written by Michelle Malkin and published by Simon and Schuster in 2015. The author collects biographies of highly prolific but lesser-known American inventors and uses it to argue against President Obama’s assertion in 2012 that business owners need help from government-funded program. Although there are political ideas and interpretations, this collection of mini-biographies is an interesting read about free market capitalism. American inventors and investors have changed the world. The profit motive and US Patent law serves them well as they become “tinkerpreneurs”. For example, Tony Maglica’s patented Maglite revolutionized flashlight design in 1978. Inventor Willis Carrier and marketer Irvine Lyle developed and sold air conditioning and refrigeration technologies. The Roebling family changed wire rope manufacturing and improved bridge building. Success stories for tinkerpreneurs often begins as an accumulation of individual efforts and fostered by a free market society. Toilet papers, inexpensive razors and other disposable products highlight this. Partnership between inventors and industrialists illustrate perseverance paired with capitalism. Modern advances in prosthetics shows how free enterprise supports invention. The only difference between earlier and now is a switch from “first to invent” to “first to file” mindset.

American innovation is threatened by changes to US patent laws in 2011 and "wealth shaming" by the political left. It is important to celebrate invention and remember the unsung heroes who helped make the United States a powerhouse of innovation. The work of US inventors in the 19th and 20th centuries testifies to the power of creativity, innovation, and the supportive environment backed by "American exceptionalism." Thanks to US patent laws and the free market economy, these creative, industrious inventors prospered from their work.

The Maglite, invented by Tony Maglica, revolutionized flashlight construction with its tough metal body, adjustable lighting, and superior design. Maglica attributes his success to America's strong patent laws, which provide the basis for defending against intellectual property theft.

The fathers of modern air conditioning, Willis Carrier, and Irvine Lyle began their journey to entrepreneurial success and technological innovation in 1902 by attempting to prevent multicolored printing jobs from bleeding in New York's summer heat. Their discoveries led to breakthroughs in the development of lifesaving pharmaceuticals, such as the polio vaccine, penicillin, and streptomycin.

The Roebling family, led by Johann (John) Augustus Roebling, was pioneers in innovation and entrepreneurship. Roebling immigrated to America in 1831 to escape Prussia's government control of engineering projects. He began his tinkerpreneurship by designing and patenting improvements for steam-powered machines. Roebling's first major "aha" moment came while working on a Pennsylvania canal, where he thought about replacing weak hemp ropes with stronger wire ropes. He was the first to create machines to make the rope uniform and sturdy, producing it with limited manpower. Roebling's first successful project using patented wire was a suspension aqueduct into Pittsburgh in 1845. He famously spanned Niagara Falls in 1855 and built the Covington-Cincinnati Bridge in 1867 with his son, Washington. The Roebling family stands as a testament to their nation's unprecedented ideas and ambition.

US Patent 4,286,311, filed by Tony Maglica in 1978, marked the beginning of a new era of heavy, rugged flashlights. Other small innovations, such as toilet paper, disposable razor blades, crown-type bottle caps, and fuller's earth, were created by inventors who recognized the need for practical solutions. These inventions created continual demand by keeping their products cheap and disposable.

Some of the most dramatic inventions emerged from collaboration, with inventors often forming partnerships with other inventors or visionary industrialists who provided financial backing and marketing support. For example, Nikola Tesla developed the alternating current (AC) used in light sockets in America without the help of inventor-industrialist George Westinghouse

Manufactured weather, such as Willis Carrier and Irvine Lyle, was largely influenced by industrialist Edward Libbey. They helped develop and patent glassmaking machines, transforming the industry from a handcraft controlled by labor unions to safe, cheap, fully automated production. Libbey and Owens fought against obstructionists and anticapitalists, transforming the world's relationship with glass and combating dangerous child labor practices.

The American creative spirit, which inspired past inventors, continues to thrive in fields like prosthetics. Modern American tinkerpreneurs build on past breakthroughs, such as A.A. Marks, Albert Winkley, and Edward Hanger. American "free enterprise" fosters invention, with companies like Bally Ribbon Company and BrainGate creating robotic limbs. However, the 2011 America Invents Act (AIA) is a special-interest boondoggle that enriches corporate lawyers, big business, and federal bureaucrats at the expense of independent inventors and innovators. The shift from "first to invent" to "first to file" patent laws favors multinational corporations and turns patent law against "small" inventors, the nation's most productive and creative members. Repealing the AIA is crucial, as opportunity and freedom are the key to promoting innovation in the future.

#Codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWhNQ6C15Thp-sCFzgag?e=E4pnej

Saturday, October 19, 2024

This is a summary of the book titled “Healthcare disrupted” written by Jeff Elton and Anne O’Riordan and published by Wiley in 2016. The authors are business consultants who assert that the new business models are inevitable, and they will be about curing people and not pushing pills. Although written a few years ago and prior to the change in government policies, their writing continues to be thought provoking. When the industry is set to reach $18 trillion in spending in 2030, costs will be incurred more from “fee-for-service” model rather than anything else. The new health care model emphasizes health over disease and value over volume. Reformers who are focused on results-based healthcare say providers should earn based on outcomes. There are some challenges from patients’ behavior as well which makes certain diseases like diabetes difficult to treat. Other emerging business models include “Lean Innovators”, “Around the patient innovators”, and “Value Innovators”. Lean Innovators who are typically makers of generic drugs, also invent products. Patient space includes apps, sensors, and other technology into their offerings. Value innovators are proposing that treatment begins at patients’ home. It is more effective for costly conditions like heart failure and diabetes.

The health care revolution is characterized by patients becoming active consumers, leading to increased healthcare costs and a shift towards a value-based model. The "fee-for-service" model, which incentivizes medical professionals to provide procedures, drugs, and devices in the most expensive scenarios, is causing health costs to soar. This flawed approach has led to reforms such as quality reports about physicians and the US Medicare system processing half of its payments using performance measures by 2018. Accountable care organizations (ACOs) have also contributed to this shift. Health care companies operate on a value curve that progresses through four stages: "Simple product," "Enhanced product," "Integrated services," and "Living services." The final frontier of the value curve, the "final frontier," involves offering an array of services, with patient outcomes partially determining payment. This shift is challenging due to legal and regulatory constraints.

Pharmaceutical firms known as "Lean Innovators" are combining generic drugs with innovative products to avoid patent-expiration problems and focus on niche products. They operate in area A of the value curve, selling products but not delving far beyond that level. Lean Innovators are rooted in the generic drug industry, selling cheaper alternatives, and embracing supply chain efficiency. Examples include Teva Pharmaceutical Industries, Allergan PLC, and Valeant Pharmaceuticals International. They typically grow through acquisitions and have a lower cost of sales and R&D than big pharma companies. They can post EBITDAs that exceed their big pharma competitors, with companies like Allergan, Teva, and Valeant posting EBITDAs of 35.6% in 2014.

Around-the-Patient Innovators are companies that focus on addressing patients' lifestyle challenges rather than just selling a basic product. They invest in talent and research, such as Johnson & Johnson and Novartis, and aim to evolve with the changing healthcare market. These companies aim to provide a broader value proposition and partner with companies like Apple, Google, and Qualcomm to fill gaps in their offerings. They operate in areas B and C of the value curve and must nimbly innovate while maintaining their legacy businesses. For example, Novartis' Entresto, a heart failure treatment, requires patients to monitor their blood pressure, change their diets, and maintain activity levels to maintain effectiveness. By focusing on these aspects, Around-the-Patient Innovators can help improve patient outcomes and reduce the need for prescription drugs.

The healthcare industry is transitioning from a traditional model of specialty therapeutics, geographic regions, and settings to a new model emphasizing health over disease and value over volume. Value Innovators, life sciences companies, are pushing into areas C and D of the value curve. Boston Scientific, for example, is focusing on treating congestive heart failure, a costly condition that requires high patient compliance. Medtronic, a device maker, focuses on data and patient monitoring to manage costs and keep chronically ill patients out of expensive facilities. However, achieving real-world success is challenging due to the longer time horizon involved in treating the chronically ill. The future of healthcare will see care becoming untethered from traditional locations, with patients playing a bigger role in the decision-making process. The industry will redefine medicines and care, moving from an intervention-based model to an ongoing mode of managing patients' health.

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWhNM_tgTe4304lDjcuw?e=rS1wVr

Friday, October 18, 2024

This is a continuation of a series of articles on IaC shortcomings and resolutions. In this section, we discuss ways to transfer data from one Azure managed instance of Apache Cassandra server in a virtual network to another in a different network. The separation in terms of network for the Cassandra resource type only serves to elaborate on the steps needed to generalize the data transfer.

Data is organized in the Cassandra cluster as keyspaces and tables. The first approach is the direct approach using a command-line client like cqlsh to interact with the clusters. The steps are download the tables as csv files and upload them to the other server.

Example:

Step 1. At source server:

USE <keyspace>;

COPY <keyspace>.<table_name> TO 'path/to/file.csv' WITH HEADER = true;

Step 2. At destination server:

USE <keyspace>;

CREATE TABLE <table_name> (

column1 datatype1,

column2 datatype2,

...

PRIMARY KEY (column1)

);

COPY <keyspace>.<table_name> (column1, column2, ...) FROM 'path/to/file.csv' WITH HEADER = true;

The other option is to read the data from one server and without a local artifact save the data to the destination. An example for this would appear as follows:

This option involves running a copy activity on a Databricks notebook using Apache Spark:

Example:

from pyspark.sql import SparkSession

# Initialize the Spark session

spark = SparkSession.builder \

.appName("Copy Cassandra Data") \

.config("spark.cassandra.connection.host", "<source-cassandra-host>") \

.config("spark.cassandra.connection.port", "9042") \

.config("spark.cassandra.auth.username", "<source-username>") \

.config("spark.cassandra.auth.password", "<source-password>") \

.getOrCreate()

# List of keyspaces and tables to copy

keyspaces = ["keyspace1", "keyspace2"]

tables = ["table1", "table2"]

for keyspace in keyspaces:

for table in tables:

# Read data from the source Cassandra cluster

df = spark.read \

.format("org.apache.spark.sql.cassandra") \

.options(keyspace=keyspace, table=table) \

.load()

# Write data to the target Cassandra cluster

df.write \

.format("org.apache.spark.sql.cassandra") \

.options(

keyspace=keyspace,

table=table,

"spark.cassandra.connection.host"="<target-cassandra-host>",

"spark.cassandra.connection.port"="9042",

"spark.cassandra.auth.username"="<target-username>",

"spark.cassandra.auth.password"="<target-password>"

) \

.mode("append") \

.save()

# Stop the Spark session

spark.stop()

Note, however, that we had started out with the source and destination in different networks. So, if the databricks server is also tethered to the same network as one of the servers, it will not be able to reach the other server. One way to get around that involves peering the network but that usually affects other resources and is not always a possibility.Another option involves adding private endpoints but the source and destination might have been connected to a delegated subnet ruling out that option. Consequently, we must include an additional step to a third location as an intermediary for data transfer that both networks can access such as a storage account over public IP networking.

This would require an example as follows:

from pyspark.sql import SparkSession

from pyspark.sql.functions import col

import os

# Set up the Spark session

spark = SparkSession.builder \

.appName("Export Cassandra to Azure Storage") \

.config("spark.cassandra.connection.host", "<cassandra-host>") \

.config("spark.cassandra.connection.port", "9042") \

.config("spark.cassandra.auth.username", "<username>") \

.config("spark.cassandra.auth.password", "<password>") \

.getOrCreate()

# Define the Azure Storage account details

storage_account_name = "<storage-account-name>"

storage_account_key = "<storage-account-key>"

container_name = "<container-name>"

# Configure the storage account

spark.conf.set(f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net", storage_account_key)

# Define keyspaces and tables to export

keyspaces = ["keyspace1", "keyspace2"]

tables = ["table1", "table2"]

# Export each table to CSV and upload to Azure Storage

for keyspace in keyspaces:

for table in tables:

# Read data from Cassandra

df = spark.read \

.format("org.apache.spark.sql.cassandra") \

.options(keyspace=keyspace, table=table) \

.load()

# Define the output path

output_path = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/{keyspace}/{table}.csv"

# Write data to CSV

df.write \

.csv(output_path, header=True, mode="overwrite")

# Stop the Spark session

spark.stop()

Lastly, it does not matter whether an agent or an intermediary stash is used for the data transfer, but the size and the number of tables do matter for the reliability of the transfer especially if the connection or the execution can be interrupted. Choosing between the options requires us to make the copying logic robust.