Wednesday, November 6, 2024

 Find all K-distance indices in array.

You are given a 0-indexed integer array nums and two integers key and k. A k-distant index is an index i of nums for which there exists at least one index j such that |i - j| <= k and nums[j] == key.

Return a list of all k-distant indices sorted in increasing order.

Example 1:

Input: nums = [3,4,9,1,3,9,5], key = 9, k = 1

Output: [1,2,3,4,5,6]

Explanation: Here, nums[2] == key and nums[5] == key.

- For index 0, |0 - 2| > k and |0 - 5| > k, so there is no j where |0 - j| <= k and nums[j] == key. Thus, 0 is not a k-distant index.

- For index 1, |1 - 2| <= k and nums[2] == key, so 1 is a k-distant index.

- For index 2, |2 - 2| <= k and nums[2] == key, so 2 is a k-distant index.

- For index 3, |3 - 2| <= k and nums[2] == key, so 3 is a k-distant index.

- For index 4, |4 - 5| <= k and nums[5] == key, so 4 is a k-distant index.

- For index 5, |5 - 5| <= k and nums[5] == key, so 5 is a k-distant index.

- For index 6, |6 - 5| <= k and nums[5] == key, so 6 is a k-distant index.

Thus, we return [1,2,3,4,5,6] which is sorted in increasing order.

Example 2:

Input: nums = [2,2,2,2,2], key = 2, k = 2

Output: [0,1,2,3,4]

Explanation: For all indices i in nums, there exists some index j such that |i - j| <= k and nums[j] == key, so every index is a k-distant index.

Hence, we return [0,1,2,3,4].

Constraints:

• 1 <= nums.length <= 1000

• 1 <= nums[i] <= 1000

• key is an integer from the array nums.

• 1 <= k <= nums.length

class Solution {

    public List<Integer> findKDistantIndices(int[] nums, int key, int k) {

        List<Integer> indices = new ArrayList<Integer>();

        for (int i = 0; i < nums.length; i++){

            if (nums[i] == key){

                for (int j = i-k; j<=i+k; j++){

                    if (j >= 0 && j < nums.length; j++) {

                        if (!indices.contains(j)) {

                            indices.add(j)

                        }

                    }

                }

            }

        }

        return indices;

    }

}

10, 1, 1 => [0,1]

101,1,1 =>[0,1,2]

1010001,1,1 => [0,1,2,3,5,6]

1010001,1,2 => [0,1,2,3,4,5,6]


Tuesday, November 5, 2024

 This is an extension of use cases for a drone management software platform in the cloud. The DFCS

Autonomous drone fleet movement has been discussed with the help of self-organizing maps as implemented in https://github.com/raja0034/som4drones but sometimes the fleet movement must be controlled precisely especially at high velocity and large number or high density when the potential for congestion or collision is high. For example, when the velocity is high, drones and missiles might fly similarly. A centralized algorithm to control their movement for safe arrival of all units might make it more convenient and cheaper for the units if there is continuous connectivity during flight. One such algorithm is the virtual structure method.

 The virtual structure method for UAV swarm movement is a control strategy where the swarm is organized into a predefined formation, often resembling a geometric shape. Instead of relying on a single leader UAV, the swarm is controlled as if it were a single rigid body or virtual structure. Each UAV maintains its position relative to this virtual structure, ensuring cohesive and coordinated movement. The steps include:

Virtual Structure Definition: A virtual structure, such as a line, triangle, or more complex shape, is defined. This structure acts as a reference for the UAVs' positions.

Relative Positioning: Each UAV maintains its position relative to the virtual structure, rather than following a specific leader UAV. This means that if one UAV moves, the others adjust their positions to maintain the formation.

Coordination and Control: The UAVs use local communication and control algorithms to ensure they stay in their designated positions within the virtual structure. This can involve adjusting speed, direction, and altitude based on the positions of neighboring UAVs.

Fault Tolerance: Since the control does not rely on a single leader, the swarm can be more resilient to failures. If one UAV fails, the others can still maintain the formation.

 A sample implementation where each UAV follows a leader to maintain a formation might appear as follows:

import numpy as np

class UAV:

    def __init__(self, position):

        self.position = np.array(position)

class Swarm:

    def __init__(self, num_uavs, leader_position, formation_pattern):

        self.leader = UAV(leader_position)

        self.uavs = [UAV(leader_position + offset) for offset in formation_pattern]

    def update_leader_position(self, new_position):

        self.leader.position = np.array(new_position)

        self.update_uav_positions()

    def update_uav_positions(self):

        for i, uav in enumerate(self.uavs):

            uav.position = self.leader.position + formation_pattern[i]

    def get_positions(self):

        return [uav.position for uav in self.uavs]

# Example usage

num_uavs = 5

leader_position = [0, 0, 0] # Initial leader position

formation_pattern = [np.array([i*2, 0, 0]) for i in range(num_uavs)] # Line formation

swarm = Swarm(num_uavs, leader_position, formation_pattern)

# Update leader position

new_leader_position = [5, 5, 5]

swarm.update_leader_position(new_leader_position)

# Get updated positions of all UAVs

positions = swarm.get_positions()

print("Updated UAV positions:")

for pos in positions:

    print(pos)


Monday, November 4, 2024

 Chief executives are increasingly demanding that their technology investments, including data and AI, work harder and deliver more value to their organizations. Generative AI offers additional tools to achieve this, but it adds complexity to the challenge. CIOs must ensure their data infrastructure is robust enough to cope with the enormous data processing demands and governance challenges posed by these advances. Technology leaders see this challenge as an opportunity for AI to deliver considerable growth to their organizations, both in terms of top and bottom lines. While a large number of business leaders have stated in public that it is especially important for AI projects to help reduce costs, they also say that it is important that these projects enable new revenue generation. Gartner forecasts worldwide IT spending to grow by 4.3% in 2023 and 8.8% in 2024 with most of that growth concentrated in the software category that includes spending on data and AI. AI-driven efficiency gains promise business growth, with 81% expecting a gain greater than 25% and 33% believing it could exceed 50%. CDOs and CTOs are echoing: “If we can automate our core processes with the help of self-learning algorithms, we’ll be able to move much faster and do more with the same amount of people.” and “Ultimately for us it will mean automation at scale and at speed.”

Organizations are increasingly prioritizing AI projects due to economic uncertainty and the increasing popularity of generative AI. Technology leaders are focusing on longer-range projects that will have significant impact on the company, rather than pursuing short-term projects. This is due to the rapid pace of use cases and proofs of concept coming at businesses, making it crucial to apply existing metrics and frameworks rather than creating new ones. The key is to ensure that the right projects are prioritized, considering the expected business impact, complexity, and cost of scaling.

Data infrastructure and AI systems are becoming increasingly intertwined due to the enormous demands placed on data collection, processing, storage, and analysis. As financial services companies like Razorpay grow, their data infrastructure needs to be modernized to accommodate the growing volume of payments and the need for efficient storage. Advances in AI capabilities, such as generative AI, have increased the urgency to modernize legacy data architectures. Generative AI and the LLMs that support it will multiply workload demands on data systems and make tasks more complex. The implications of generative AI for data architecture include feeding unstructured data into models, storing long-term data in ways conducive to AI consumption, and putting adequate security around models. Organizations supporting LLMs need a flexible, scalable, and efficient data infrastructure. Many have claimed success with the adoption of Lakehouse architecture, combining features of data warehouse and data lake architecture. This architecture helps scale responsibly, with a good balance of cost versus performance. As one data leader observed “I can now organize my law enforcement data, I can organize my airline checkpoint data, I can organize my rail data, and I can organize my inspection data. And I can at the same time make correlations and glean understandings from all of that data, separate and together.”

Data silos are a significant challenge for data and technology executives, as they result from the disparate approaches taken by different parts of organizations to store and protect data. The proliferation of data, analytics, and AI systems has added complexity, resulting in a myriad of platforms, vast amounts of data duplication, and often separate governance models. Most organizations employ fewer than 10 data and AI systems, but the proliferation is most extensive in the largest ones. To simplify, organizations aim to consolidate the number of platforms they use and seamlessly connect data across the enterprise. Companies like Starbucks are centralizing data by building cloud-centric, domain-specific data hubs, while GM's data and analytics team is focusing on reusable technologies to simplify infrastructure and avoid duplication. Additionally, organizations need space to innovate, which can be achieved by having functions that manage data and those that involve greenfield exploration.

Reference: previous article


Sunday, November 3, 2024

 There are N points (numbered from 0 to N−1) on a plane. Each point is colored either red ('R') or green ('G'). The K-th point is located at coordinates (X[K], Y[K]) and its color is colors[K]. No point lies on coordinates (0, 0).

We want to draw a circle centered on coordinates (0, 0), such that the number of red points and green points inside the circle is equal. What is the maximum number of points that can lie inside such a circle? Note that it is always possible to draw a circle with no points inside.

Write a function that, given two arrays of integers X, Y and a string colors, returns an integer specifying the maximum number of points inside a circle containing an equal number of red points and green points.

Examples:

1. Given X = [4, 0, 2, −2], Y = [4, 1, 2, −3] and colors = "RGRR", your function should return 2. The circle contains points (0, 1) and (2, 2), but not points (−2, −3) and (4, 4).

class Solution {

    public int solution(int[] X, int[] Y, String colors) {

        // find the maximum

        double max = Double.MIN_VALUE;

        int count = 0;

        for (int i = 0; i < X.length; i++)

        {

            double dist = X[i] * X[i] + Y[i] * Y[i];

            if (dist > max)

            {

                max = dist;

            }

        }

        for (double i = Math.sqrt(max) + 1; i > 0; i -= 0.1)

        {

            int r = 0;

            int g = 0;

            for (int j = 0; j < colors.length(); j++)

            {

                if (Math.sqrt(X[j] * X[j] + Y[j] * Y[j]) > i)

                {

                    continue;

                }

                if (colors.substring(j, j+1).equals("R")) {

                    r++;

                }

                else {

                    g++;

                }

            }

            if ( r == g && r > 0) {

                int min = r * 2;

                if (min > count)

                {

                    count = min;

                }

            }

        }

        return count;

    }

}

Compilation successful.

Example test: ([4, 0, 2, -2], [4, 1, 2, -3], 'RGRR')

OK

Example test: ([1, 1, -1, -1], [1, -1, 1, -1], 'RGRG')

OK

Example test: ([1, 0, 0], [0, 1, -1], 'GGR')

OK

Example test: ([5, -5, 5], [1, -1, -3], 'GRG')

OK

Example test: ([3000, -3000, 4100, -4100, -3000], [5000, -5000, 4100, -4100, 5000], 'RRGRG')

OK


Saturday, November 2, 2024

 A previous article talked about ETL, it’s modernization, new take on old issues and resolutions, and efficiency and scalability. This section talks about the bigger picture where this fits in.

In terms of infrastructure for data engineering projects, customers usually get started on a roadmap that progressively builds a more mature data function. One of the approaches for drawing this roadmap that experts observe as repeated across deployment stamps involves building a data stack in distinct stages with a stack for every phase on this journey. While needs, level of sophistication, maturity of solutions, and budget determines the shape these stacks take, the four phases are more or less distinct and repeated across these endeavors. They are starters, growth, machine-learning and real-time. Customers begin with a starters stack where the essential function is to collect the data and often involve implementing a drain. A unified data layer in this stage significantly reduces engineering bottlenecks. A second stage is the growth stack which solves the problem of proliferation of data destinations and independent silos by centralizing data into a warehouse which also becomes a single source of truth for analytics. When this matures, customers want to move beyond historical analytics and into predictive analytics. At this stage, a data lake and machine learning toolset come handy to leverage unstructured data and mitigate problems proactively. The next and final frontier to address is the one that overcomes a challenge in this current stack which is that it is impossible to deliver personalized experiences in real-time.

In this way, organizations solve the point-to-point integration problem by implementing a unified, event-based integration layer in the starters stack. Then when the needs became a little more sophisticated—to enable downstream teams

(and management) to answer harder questions and act on all of the data, they will centralize both clickstream data and relational data to build a full picture of the customer and their journey. After solving these challenges by implementing a cloud data warehouse as the single source of truth for all customer data, and then using reverse ETL pipelines to activate that data, organizations gear up towards the next stage. As the business grew, optimization required moving from historical analytics to predictive analytics, including the need to incorporate unstructured data into the analysis. To accomplish that, organizations implemented the ML Stack, which included a data lake (for unstructured data), and a basic machine learning tool set that could generate predictive outputs like churn scores. Finally,

these outputs are put to use by sending them through the warehouse and reverse ETL

pipelines, making them available as data points in downstream systems including CRM for customer touchpoints.

#codingexercise: CodingExercise-11-02-2024.docx

Thursday, October 31, 2024

 

A previous article discussed the ETL process and its evolution with the recent paradigms followed by a discussion on the role of an orchestrator in data engineering. This section focuses on pipeline issues and troubleshooting. The return on investment in data engineering projects is often reduced by how fragile the system becomes and the maintenance it requires. Systems do fail but planning for failure means making it easier to maintain and extend, providing automation for handling errors and learning from experience. The minimum viable product principle and the 80/20 principle are time-honored traditions.

Direct and indirect costs of ETL systems are significant, as they can lead to inefficient operations, long run times, and high bills from providers. Indirect costs, such as constant triaging and data failure, can be even more significant. Teams that win build efficient systems that allow them to focus on feature development and data democratization. SaaS (Software as a Service) offers a cost-effective solution, but it can also lead to loss of trust, revenue, and reputation. To minimize these costs, focus on maintainability, data quality, error handling, and improved workflows. Monitoring and benchmarking are essential for minimizing pipeline issues and expediting troubleshooting efforts. Proper monitoring and alerting can help improve maintainability of data systems and lower costs associated with broken data. Observing data across ingestion, transformation, and storage, handling errors as they arise, and alerting the team when things break, is crucial for ensuring good business decisions.

Data reliability and usefulness are assessed using metrics such as freshness, volume, and quality. Freshness measures the timeliness and relevance of data, ensuring accurate and recent information for analytics, decision-making, and other data-driven processes. Common metrics include the length between the most recent timestamp and the current timestamp, lag between source data and the dataset, refresh rate, and latency. Volume refers to the amount of data needed for processing, storage, and management within a system. Quality involves ensuring data is accurate, consistent, and reliable throughout its lifecycle. Examples of data quality metrics include uniqueness, completeness, and validity.

Monitoring involves detecting errors in a timely fashion and implementing strict measures to improve data quality. Techniques to improve data quality include logging and monitoring, lineage, and visual representations of pipelines and systems. Lineage should be complete and granular, allowing for better insight and efficiency in triaging errors and improving productivity. Overall, implementing these metrics helps ensure data quality and reliability within an organization.

Anomaly detection systems analyze time series data to make statistical forecasts within a certain confidence interval. They can catch errors that might originate outside the systems, such as a bug in a payments processing team that decreases purchases. Data diffs are systems that report on data changes presented by changes in code, ensuring accurate systems remain accurate especially when used as an indicator on data quality. Tools like Datafold and SQLMesh have data diffing functionality. Assertions are constraints put on data outputs to validate source data. They are simpler than anomaly detection and can be found in libraries like Great Expectations aka GX Expectations or systems with built-in assertion definitions.

Error handling is crucial for data systems and recovery from their effects, such as lost data or downtime. Error handling involves automating error responses or boundary conditions to keep data systems functioning or alert the team in a timely and discreet manner. Approaches include conditional logic, retry mechanisms, and pipeline decomposition. These methods help keep the impact of errors contained and ensure the smooth functioning of data systems.

Graceful degradation and error isolation are essential for maintaining limited functionality even when a part of a system fails. Error isolation is enabled through pipeline decomposition, which allows systems to fail in a contained manner. Graceful degradation maintains limited functionality even when a part of the system fails, allowing only one part of the business to notice an error.

Alerting should be a last line of defense, as receiving alerts is reactive. Isolating errors and building systems that degrade gracefully can reduce alarm fatigue and create a good developer experience for the team.

Recovery systems should be built for disasters, including lost data. Staged data, such as Parquet-based formats like Delta Lake and patterns like the medallion architecture, can be used for disaster recovery. Backfilling, the practice of simulating historical runs of a pipeline to create a complete dataset, can save time when something breaks.

Improving workflows is crucial in data engineering, as it is an inherently collaborative job. Data engineering is a question of when things break, not if. Starting with systems that prioritize troubleshooting, adaptability, and recovery can reduce headaches down the line.

In the context of software teams, understanding their motivations and workflows is crucial for fostering healthy relationships and improving efficiency. By focusing on the team's goals and understanding their workflows, you can craft a process to improve efficiency.

Structured, pragmatic approaches can ensure healthy relationships through Service-Level Agreements (SLAs), data contracts, APIs, compassion and empathy, and aligning incentives. SLAs can be used to define performance metrics, responsibilities, response and resolution times, and escalation procedures, improving the quality of data that is outside of your control. Data contracts, popularized by dbt, govern data ingested from external sources, providing a layer of standardization and consistency. APIs can be used to transmit an expected set of data, providing granular access control, scalability benefits, and versioning, which can be useful for compliance.

Compassion and empathy are essential in engineering and psychology, as understanding coworkers' motivations, pain points, and workflows allows for effective communication and appeal to their incentives. In the digital age, it's essential to go the extra mile to understand coworkers and appeal to their incentives.

Setting key performance indicators (KPIs) around common incident management metrics can help justify the time and energy required to do the job right. These metrics include the number of incidents, time to detection (TTD), time to resolution (TTR), and data downtime (N × [TTD + TTR].

Continually iterating and adjusting processes in the wake of failures and enhancing good pipelines to become great are some ways to improve outcomes. Documentation is crucial for understanding how to fix errors and improve the quality of data pipelines. Postmortems are valuable for analyzing failures and learning from them, leading to fewer events that require recovery. Unit tests are essential for validating small pieces of code and ensuring they produce desired results. Continuous integration/continuous deployment (CI/CD) is a preventative practice to minimize future errors and ensure a consistent code base.

Engineers should simplify and abstract complex code to improve collaboration and reduce errors. Building data systems as code, which can be rolled back and reverted to previous states, can improve observability, disaster recovery, and collaboration. Tools that are difficult or impossible to version control or manipulate through code should be exercised with caution. With responsibilities defined, incentives aligned, and a monitoring/troubleshooting toolkit, engineers can automate and optimize data workflows. Balancing automation and practicality is essential in data engineering, ensuring robust, resilient systems ready for scaling.


Wednesday, October 30, 2024

 This section explores dependency management and pipeline orchestration that are aptly dubbed “under-currents” in the book “Fundamentals of Data Engineering” by Matt Housley and Joe Reis.

Data orchestration is a process of dependency management, facilitated through automation. It involves scheduling, triggering, monitoring, and resource allocation. Data orchestrators are different from schedulers, which are cron-based. They can trigger events, webhooks, schedules, and even intra-workflow dependencies. Data orchestration provides a structured, automated, and efficient way to handle large-scale data from diverse sources.

Orchestration steers workflows toward efficiency and functionality, with an orchestrator serving as the tool enabling these workflows. They typically trigger pipelines based on a schedule or a specific event. Event-driven pipelines are beneficial for handling unpredictable data or resource-intensive jobs.

The perks of having an orchestrator in your data engineering toolkit include workflow management, automation, error handling and recovery, monitoring and alerting, and resource optimization. Directed Acyclic Graphs or DAGs for short bring order, control, and repeatability to data workflows, managing dependencies and ensuring a structured and predictable flow of data. They are pivotal for orchestrating and visualizing pipelines, making them indispensable in managing complex workflows, particularly within a team or large-scale setups. For example, a DAG serves as a clear roadmap defining the order of tasks and with this lens, it is possible to organize the creation, scheduling, and monitoring of data pipelines.

Data orchestration tools have evolved significantly over the past few decades, with the creation of Apache Airflow and Luigi being the most dominant tools. However, it is crucial to choose the right tool for the job, as each tool has its strengths and weaknesses. Data orchestrators, like the conductor of an orchestra, balance declarative and imperative frameworks to provide flexibility and efficiency in software engineering best practices.

When selecting an orchestrator, factors such as scalability, code and configuration reusability, and the ability to handle complex logic and dependencies are important to consider. The orchestrator should be able to scale vertically or horizontally, ensuring that the process of orchestrating data is separate from the process of transforming data.

Orchestration is about connections, and platforms like Azure, Amazon, and Google offer hosted versions of popular tools like Airflow. Platform-embedded alternatives like Databricks Workflows provide more visibility into tasks orchestrated by orchestrators. Popular orchestrators have strong community support and continuous development, ensuring they remain up to date with the latest technologies and best practices. Support is crucial for both closed source and paid solutions, as solutions engineers can help resolve issues.

Observability is essential for understanding transformation flows and ensuring your orchestrator supports various methods for alerting your team. To implement an orchestration solution, you can build a solution, buy an off-the-shelf tool, self-host an open-source tool, or use a tool included with your cloud provider or data platform. Apache Airflow, developed by Airbnb, is a popular choice due to its ease of adoption, simple deployment, and ubiquity in the data space. However, it has flaws, such as being engineered to orchestrate, not transform or ingest.

Open-source tools like Airflow and Prefect are popular orchestrators with paid, hosted services and support. Newer tools like Mage, Keboola, and Kestra are also innovating. Open-source tools offer community support and the ability to modify source code. However, they depend on support for continued development and may risk project abandonment or instability. A tool's history, support, and stability must be considered when choosing a solution.

Data orchestration is a crucial aspect of modern data engineering, involving the use of relational databases for data transformation. Tools like dbt, Delta Live Tables, Dataform, and SQLMesh are used as orchestrators to evaluate dependencies, optimize, and execute commands against a database to produce desired results. However, there is a potential limitation in data orchestration due to the need for a mechanism to observe data across different layers, leading to a disconnection between sources and cleaned data. This can be a challenge in identifying errors in downstream data.

Design patterns can significantly enhance the efficiency, reliability, and maintainability of data orchestration processes. Some orchestration solutions make these patterns easier, such as building backfill logic into pipelines, ensuring idempotence, and event-driven data orchestration. These patterns can help avoid one-time thinking and ensure consistent results in data engineering. Choosing a platform-specific data orchestrator can provide greater visibility between and within data workflows, making it essential for ETL workflows.

Orchestrators are complex and difficult to develop locally due to their complex trigger actions. To improve performance, invest in tools that allow for fast feedback loops, error identification, and a local environment that is developer friendly. Retry and fallback logic are essential for handling failures in a complex data stack, ensuring data integrity and system reliability. Idempotent pipelines set up scenarios for retrying operations or skipping and alerting the proper parties. Parameterized execution allows for more malleability in orchestration, allowing for multiple cases and reuse of pipelines. Lineage refers to the path traveled by data through its lifecycle, and a robust lineage solution is crucial for debugging issues and extending pipelines. Column-level lineage is becoming an industry norm in SQL orchestration, and platform-integrated orchestration solutions like Databricks Unity Catalog and Delta Live Tables offer advanced lineage capabilities. Pipeline decomposition breaks pipelines into smaller tasks for better monitoring, error handling, and scalability. Building autonomous DAGs can mitigate dependencies and critical failures, making it easier to build and debug workflows.

The evolution of transformation tools, such as containerized infrastructure and hybrid tools like Prefect and Dagster, may change the landscape of data teams. These tools can save time and resources, enabling better observability and monitoring within data warehouses. Emerging tools like SQLMesh may challenge established players like dbt, while plug-and-play solutions like Databricks Workflows are becoming more appealing. These developments will enable data teams to deliver quality data in a timely and robust manner.