Cluster computing

Thursday, October 31, 2024

A previous article discussed the ETL process and its evolution with the recent paradigms followed by a discussion on the role of an orchestrator in data engineering. This section focuses on pipeline issues and troubleshooting. The return on investment in data engineering projects is often reduced by how fragile the system becomes and the maintenance it requires. Systems do fail but planning for failure means making it easier to maintain and extend, providing automation for handling errors and learning from experience. The minimum viable product principle and the 80/20 principle are time-honored traditions.

Direct and indirect costs of ETL systems are significant, as they can lead to inefficient operations, long run times, and high bills from providers. Indirect costs, such as constant triaging and data failure, can be even more significant. Teams that win build efficient systems that allow them to focus on feature development and data democratization. SaaS (Software as a Service) offers a cost-effective solution, but it can also lead to loss of trust, revenue, and reputation. To minimize these costs, focus on maintainability, data quality, error handling, and improved workflows. Monitoring and benchmarking are essential for minimizing pipeline issues and expediting troubleshooting efforts. Proper monitoring and alerting can help improve maintainability of data systems and lower costs associated with broken data. Observing data across ingestion, transformation, and storage, handling errors as they arise, and alerting the team when things break, is crucial for ensuring good business decisions.

Data reliability and usefulness are assessed using metrics such as freshness, volume, and quality. Freshness measures the timeliness and relevance of data, ensuring accurate and recent information for analytics, decision-making, and other data-driven processes. Common metrics include the length between the most recent timestamp and the current timestamp, lag between source data and the dataset, refresh rate, and latency. Volume refers to the amount of data needed for processing, storage, and management within a system. Quality involves ensuring data is accurate, consistent, and reliable throughout its lifecycle. Examples of data quality metrics include uniqueness, completeness, and validity.

Monitoring involves detecting errors in a timely fashion and implementing strict measures to improve data quality. Techniques to improve data quality include logging and monitoring, lineage, and visual representations of pipelines and systems. Lineage should be complete and granular, allowing for better insight and efficiency in triaging errors and improving productivity. Overall, implementing these metrics helps ensure data quality and reliability within an organization.

Anomaly detection systems analyze time series data to make statistical forecasts within a certain confidence interval. They can catch errors that might originate outside the systems, such as a bug in a payments processing team that decreases purchases. Data diffs are systems that report on data changes presented by changes in code, ensuring accurate systems remain accurate especially when used as an indicator on data quality. Tools like Datafold and SQLMesh have data diffing functionality. Assertions are constraints put on data outputs to validate source data. They are simpler than anomaly detection and can be found in libraries like Great Expectations aka GX Expectations or systems with built-in assertion definitions.

Error handling is crucial for data systems and recovery from their effects, such as lost data or downtime. Error handling involves automating error responses or boundary conditions to keep data systems functioning or alert the team in a timely and discreet manner. Approaches include conditional logic, retry mechanisms, and pipeline decomposition. These methods help keep the impact of errors contained and ensure the smooth functioning of data systems.

Graceful degradation and error isolation are essential for maintaining limited functionality even when a part of a system fails. Error isolation is enabled through pipeline decomposition, which allows systems to fail in a contained manner. Graceful degradation maintains limited functionality even when a part of the system fails, allowing only one part of the business to notice an error.

Alerting should be a last line of defense, as receiving alerts is reactive. Isolating errors and building systems that degrade gracefully can reduce alarm fatigue and create a good developer experience for the team.

Recovery systems should be built for disasters, including lost data. Staged data, such as Parquet-based formats like Delta Lake and patterns like the medallion architecture, can be used for disaster recovery. Backfilling, the practice of simulating historical runs of a pipeline to create a complete dataset, can save time when something breaks.

Improving workflows is crucial in data engineering, as it is an inherently collaborative job. Data engineering is a question of when things break, not if. Starting with systems that prioritize troubleshooting, adaptability, and recovery can reduce headaches down the line.

In the context of software teams, understanding their motivations and workflows is crucial for fostering healthy relationships and improving efficiency. By focusing on the team's goals and understanding their workflows, you can craft a process to improve efficiency.

Structured, pragmatic approaches can ensure healthy relationships through Service-Level Agreements (SLAs), data contracts, APIs, compassion and empathy, and aligning incentives. SLAs can be used to define performance metrics, responsibilities, response and resolution times, and escalation procedures, improving the quality of data that is outside of your control. Data contracts, popularized by dbt, govern data ingested from external sources, providing a layer of standardization and consistency. APIs can be used to transmit an expected set of data, providing granular access control, scalability benefits, and versioning, which can be useful for compliance.

Compassion and empathy are essential in engineering and psychology, as understanding coworkers' motivations, pain points, and workflows allows for effective communication and appeal to their incentives. In the digital age, it's essential to go the extra mile to understand coworkers and appeal to their incentives.

Setting key performance indicators (KPIs) around common incident management metrics can help justify the time and energy required to do the job right. These metrics include the number of incidents, time to detection (TTD), time to resolution (TTR), and data downtime (N × [TTD + TTR].

Continually iterating and adjusting processes in the wake of failures and enhancing good pipelines to become great are some ways to improve outcomes. Documentation is crucial for understanding how to fix errors and improve the quality of data pipelines. Postmortems are valuable for analyzing failures and learning from them, leading to fewer events that require recovery. Unit tests are essential for validating small pieces of code and ensuring they produce desired results. Continuous integration/continuous deployment (CI/CD) is a preventative practice to minimize future errors and ensure a consistent code base.

Engineers should simplify and abstract complex code to improve collaboration and reduce errors. Building data systems as code, which can be rolled back and reverted to previous states, can improve observability, disaster recovery, and collaboration. Tools that are difficult or impossible to version control or manipulate through code should be exercised with caution. With responsibilities defined, incentives aligned, and a monitoring/troubleshooting toolkit, engineers can automate and optimize data workflows. Balancing automation and practicality is essential in data engineering, ensuring robust, resilient systems ready for scaling.

Wednesday, October 30, 2024

This section explores dependency management and pipeline orchestration that are aptly dubbed “under-currents” in the book “Fundamentals of Data Engineering” by Matt Housley and Joe Reis.

Data orchestration is a process of dependency management, facilitated through automation. It involves scheduling, triggering, monitoring, and resource allocation. Data orchestrators are different from schedulers, which are cron-based. They can trigger events, webhooks, schedules, and even intra-workflow dependencies. Data orchestration provides a structured, automated, and efficient way to handle large-scale data from diverse sources.

Orchestration steers workflows toward efficiency and functionality, with an orchestrator serving as the tool enabling these workflows. They typically trigger pipelines based on a schedule or a specific event. Event-driven pipelines are beneficial for handling unpredictable data or resource-intensive jobs.

The perks of having an orchestrator in your data engineering toolkit include workflow management, automation, error handling and recovery, monitoring and alerting, and resource optimization. Directed Acyclic Graphs or DAGs for short bring order, control, and repeatability to data workflows, managing dependencies and ensuring a structured and predictable flow of data. They are pivotal for orchestrating and visualizing pipelines, making them indispensable in managing complex workflows, particularly within a team or large-scale setups. For example, a DAG serves as a clear roadmap defining the order of tasks and with this lens, it is possible to organize the creation, scheduling, and monitoring of data pipelines.

Data orchestration tools have evolved significantly over the past few decades, with the creation of Apache Airflow and Luigi being the most dominant tools. However, it is crucial to choose the right tool for the job, as each tool has its strengths and weaknesses. Data orchestrators, like the conductor of an orchestra, balance declarative and imperative frameworks to provide flexibility and efficiency in software engineering best practices.

When selecting an orchestrator, factors such as scalability, code and configuration reusability, and the ability to handle complex logic and dependencies are important to consider. The orchestrator should be able to scale vertically or horizontally, ensuring that the process of orchestrating data is separate from the process of transforming data.

Orchestration is about connections, and platforms like Azure, Amazon, and Google offer hosted versions of popular tools like Airflow. Platform-embedded alternatives like Databricks Workflows provide more visibility into tasks orchestrated by orchestrators. Popular orchestrators have strong community support and continuous development, ensuring they remain up to date with the latest technologies and best practices. Support is crucial for both closed source and paid solutions, as solutions engineers can help resolve issues.

Observability is essential for understanding transformation flows and ensuring your orchestrator supports various methods for alerting your team. To implement an orchestration solution, you can build a solution, buy an off-the-shelf tool, self-host an open-source tool, or use a tool included with your cloud provider or data platform. Apache Airflow, developed by Airbnb, is a popular choice due to its ease of adoption, simple deployment, and ubiquity in the data space. However, it has flaws, such as being engineered to orchestrate, not transform or ingest.

Open-source tools like Airflow and Prefect are popular orchestrators with paid, hosted services and support. Newer tools like Mage, Keboola, and Kestra are also innovating. Open-source tools offer community support and the ability to modify source code. However, they depend on support for continued development and may risk project abandonment or instability. A tool's history, support, and stability must be considered when choosing a solution.

Data orchestration is a crucial aspect of modern data engineering, involving the use of relational databases for data transformation. Tools like dbt, Delta Live Tables, Dataform, and SQLMesh are used as orchestrators to evaluate dependencies, optimize, and execute commands against a database to produce desired results. However, there is a potential limitation in data orchestration due to the need for a mechanism to observe data across different layers, leading to a disconnection between sources and cleaned data. This can be a challenge in identifying errors in downstream data.

Design patterns can significantly enhance the efficiency, reliability, and maintainability of data orchestration processes. Some orchestration solutions make these patterns easier, such as building backfill logic into pipelines, ensuring idempotence, and event-driven data orchestration. These patterns can help avoid one-time thinking and ensure consistent results in data engineering. Choosing a platform-specific data orchestrator can provide greater visibility between and within data workflows, making it essential for ETL workflows.

Orchestrators are complex and difficult to develop locally due to their complex trigger actions. To improve performance, invest in tools that allow for fast feedback loops, error identification, and a local environment that is developer friendly. Retry and fallback logic are essential for handling failures in a complex data stack, ensuring data integrity and system reliability. Idempotent pipelines set up scenarios for retrying operations or skipping and alerting the proper parties. Parameterized execution allows for more malleability in orchestration, allowing for multiple cases and reuse of pipelines. Lineage refers to the path traveled by data through its lifecycle, and a robust lineage solution is crucial for debugging issues and extending pipelines. Column-level lineage is becoming an industry norm in SQL orchestration, and platform-integrated orchestration solutions like Databricks Unity Catalog and Delta Live Tables offer advanced lineage capabilities. Pipeline decomposition breaks pipelines into smaller tasks for better monitoring, error handling, and scalability. Building autonomous DAGs can mitigate dependencies and critical failures, making it easier to build and debug workflows.

The evolution of transformation tools, such as containerized infrastructure and hybrid tools like Prefect and Dagster, may change the landscape of data teams. These tools can save time and resources, enabling better observability and monitoring within data warehouses. Emerging tools like SQLMesh may challenge established players like dbt, while plug-and-play solutions like Databricks Workflows are becoming more appealing. These developments will enable data teams to deliver quality data in a timely and robust manner.

Tuesday, October 29, 2024

The history of data engineering has evolved from big data frameworks like Hadoop and MapReduce to streamlined tools like Spark, Databricks, BigQuery, Redshift, Snowflake, Presto, Trino, and Athena. Cloud storage and transformation tools have made data more accessible, and lakehouses have offered a cost-efficient, unified option for managing data at scale. This evolution has led to a more accessible and efficient data management landscape.

Data transformation environments vary, with common environments being data warehouses, data lakes, and lakehouses. Data warehouses use SQL for transformation, while data lakes store large amounts of data economically. Lakehouses combine aspects of both, offering flexibility and cost-effectiveness. Databricks SQL is a serverless data warehouse that sits on the lakehouse platform. The choice between these environments depends on project needs, team expertise, and long-term data strategy.

Data staging is a crucial process in data transformation, often written in a temporary state to a suitable location, such as cloud storage or an intermediate table. Medallion architecture preserves data history and makes time travel possible. It comprises three distinct layers: Bronze for raw data, Silver for light transformation, and Gold for "clean" data. Bronze data is raw and unfiltered, Silver data is filtered, cleaned, and adjusted, and Gold data is stakeholder-ready and sometimes aggregated. This approach can be used in a lake or warehouse, breaking down each storage layer into discrete stages of data cleanliness.

Data transformation is largely influenced by the tools available, with Python being a popular choice in the digital era. Python's Pandas library is at its core, and it has evolved significantly in data processing. However, scaling Python for large datasets has been challenging, often requiring libraries like Dask and Ray. Python-based data processing is a renaissance, with Rust emerging. To transform data in Python, choose a suitable library and framework, such as Pandas or emerging libraries like Polars and DuckDB. SQL, a declarative language, can be used as a declarative or imperative language, but is limited by a lack of functionality. Languages like Jinja/Python and JavaScript often complement SQL workflows. Rust, a new transformation language, is considered the future of data engineering, but Python has a solid foothold due to its community support and library ecosystem.

Transformation frameworks are multilanguage engines for executing data transformations across machines or clusters, enabling transformations to be manipulated in various languages like Python or SQL. Two popular engines are Hadoop and Spark. Hadoop, an open-source framework, gained traction in the mid-2000s with tech giants like Yahoo, Facebook, and Google. However, its MapReduce was not well-suited for real-time or iterative workloads, leading to the rise of Apache Spark in the early 2010s. Spark, a powerful open-source data processing framework, revolutionized big data analytics by offering speed, versatility, and integration with key technologies. Its key innovation is resilient distributed datasets (RDDs), enabling in-memory data processing and faster computations. With the rise of serverless data warehouses, big data engines may no longer be necessary, but query engines like BigQuery, Databricks SQL, and Redshift should not be disregarded. Recent advancements in in-memory computation may continue to expand data warehouses' transformation capabilities.

Data transformation is a crucial process that involves pattern mapping and understanding the different transformations that should be applied. Enrichment involves enhancing existing data with additional sources, such as adding demographic information to customer records. Joining involves combining two or more datasets based on a common field, like a JOIN operation in SQL. Filtering selects only the necessary data points for analysis based on certain criteria, reducing the volume and improving the quality of the data. Structuring involves translating data into a required format or structure, such as transforming JSON documents into tabular format or vice versa. Conversion is changing the data type of a particular column or field, especially when converting between semi-structured and structured data sources. Aggregation is summarizing and combining data to draw conclusions from large volumes of data, enabling insights to inform business decisions and create value from data assets. Anonymization is masking or obfuscating sensitive information within a dataset to protect privacy. It involves hashing emails or removing personally identifiable information (PII) from records. Splitting is a form of denormalization, dividing a complex data column into multiple columns. Deduplication is the process of removing redundant records to create a unique dataset, often through aggregation, filtering, or other methods.

Data update patterns are essential for transforming data in a target system. Overwrite is the simplest form, which involves a complete drop of an existing source or table and an overwrite with new data. Inserting is a more complex pattern, involving the appending of new data to an existing source without changing existing rows. Upsert is a more complex pattern, with applications for change data capture, sessionization, and deduplication. Platforms like Databricks have MERGE functionality to simplify the process. Data deletion is often misunderstood, with two main types: "hard" and "soft." Soft deletes enable the creation of historical records for an asset's status, while hard deletes eliminate these records, which can be problematic in data recovery cases.

When building a data transformation solution, consider several best practices, including staging, idempotency, normalization, and incrementality. Staging protects against data loss and ensures a low time to recovery (TTR) in case of failure. Idempotency ensures consistency and reliability by performing something multiple times, similar to reproducibility. Normalization refines data to a clean, orderly format, while denormalization duplicates records and information for improved performance. Incrementality determines whether a pipeline is a simple INSERT OVERWRITE or a more complex UPSERT. Predefined patterns for building incremental workflows can be found in tools like dbt and Airflow. Real-time data transformation involves batch, micro-batch, and streaming transformations. Micro-batch approaches, like Apache Spark's PySpark and Spark SQL, are simpler to implement compared to true, single-event transformations. Spark Structured Streaming is a popular streaming application that efficiently handles incremental and continuous updates, achieving latencies as low as 100 milliseconds with exactly once fault tolerance. Continuous Processing, introduced in Spark 2.3, can reduce latencies to as little as 1 millisecond, further enhancing its capability for streaming data transformation.

The modern data stack is experiencing a second renaissance due to new technologies and AI advancements. As a result, new tools and technologies are emerging to redefine data transformation. However, it's crucial to adhere to timeless strategies for managing data and creating cleaned assets. Supercharged tooling and automations can be both beneficial and challenging, but engineers must ensure well-planned and executed transformation systems with a high value-to-cost ratio.

Monday, October 28, 2024

Comparisons of ADF with Apache Airflow

Choosing the right tool for a data transfer job is highly important. Previous articles have introduced Azure Data Factory and Apache Airflow as cloud tools to do large scale and dependable transfers along with a comparison to DEIS workflow. This section enumerates the differences between ADF and Apache Airflow.

Azure offers various services, each with its strength and use cases. ADF is a fully managed serverless data integration service that provides visual designer tools for configuring source, destination and ETL processes. It can handle large scale data pipelines and transformations. Apache Airflow is not a primary citizen of the Azure cloud, and it is an open-source scheduler for workflow management. While it is not a native Azure service, it can be deployed to Azure Kubernetes and Azure Container Instances.

Azure does have a native equivalent of Apache Airflow in the form of Azure Logic Apps which provides workflow automation and app integration. But for more complex scenarios requiring code-based orchestration and dynamic workflows, developers might opt for deploying Airflow on Azure.

Airflow offers more flexibility with its Python based DAGs, allowing for complex logic and dependencies. ADF’s strength lies in its no-code or low code approach and integration with other Azure services.

Apache Airflow is also more suitable for event-driven workflows and streaming data while ADF is suitable for batch processing and regular ETL tasks with an emphasis on visual authoring and monitoring.

Best practices and most efficient use of the resources are continuously updated in the official documentation from both their respective sources.

One of the often-overlooked features is that Apache Airflow can be integrated with Azure Data Factory which allows for the orchestration of complex workflows across various cloud services. This integration leverages the strengths of both platforms to create a robust data pipeline solution. The serverless architecture of the ADF can handle large-scale data workloads, while Airflow’s scheduling capabilities can manage the workflow orchestration.

Airflow’s python-based workflows allow for dynamic pipeline generation which can be tailored to specific needs. Both platforms also offer extensivity connectivity options with various data sources and processing services.

The integration steps can be listed as follows: 1. Create a data factory first with the appropriate storage and compute resources. 2. Configure the airflow environment by ensuring that it has network access to Azure Data Factory. 3. Create custom airflow operators to interact with the ADF APIs for triggering and monitoring pipelines.4. Define workflows to use Airflow DAGs to define the sequence of tasks, including the custom operators to manage ADF activities.5. use the user interface to monitor the workflow execution and manage any necessary interventions.

Setting up an active directory connection to Apache Airflow is a prerequisite.

This can be done with commands such as:

pip install apache-airflow[azure]

airflow connections add azure_ad_conn \

--conn-type azure_data_explorer \

--conn-login <client_id> \

--conn-password <client_secret> \

--conn-extra '{"tenantId": "<tenant_id>"}'

and validated with:

airflow connections get my_azure_connection

The integration can then be tested with:

from airflow import DAG

from airflow.providers.microsoft.azure.operators.data_factory import AzureDataFactoryRunPipelineOperator

with DAG('azure_data_factory_integration', schedule_interval='@daily', default_args=default_args) as dag:

run_pipeline = AzureDataFactoryRunPipelineOperator(

task_id='run_pipeline',

azure_data_factory_conn_id='azure_data_factory_default',

pipeline_name='MyPipeline',

parameters={'param1': 'value1'},

wait_for_completion=True

)

This completes the comparision of ADF with Airflow.

Previous articles: IaCResolutionsPart191.docx

Sunday, October 27, 2024

Number of Ways to Split Array

You are given a 0-indexed integer array nums of length n.

nums contains a valid split at index i if the following are true:

• The sum of the first i + 1 elements is greater than or equal to the sum of the last n - i - 1 elements.

• There is at least one element to the right of i. That is, 0 <= i < n - 1.

Return the number of valid splits in nums.

Example 1:

Input: nums = [10,4,-8,7]

Output: 2

Explanation:

There are three ways of splitting nums into two non-empty parts:

- Split nums at index 0. Then, the first part is [10], and its sum is 10. The second part is [4,-8,7], and its sum is 3. Since 10 >= 3, i = 0 is a valid split.

- Split nums at index 1. Then, the first part is [10,4], and its sum is 14. The second part is [-8,7], and its sum is -1. Since 14 >= -1, i = 1 is a valid split.

- Split nums at index 2. Then, the first part is [10,4,-8], and its sum is 6. The second part is [7], and its sum is 7. Since 6 < 7, i = 2 is not a valid split.

Thus, the number of valid splits in nums is 2.

Example 2:

Input: nums = [2,3,1,0]

Output: 2

Explanation:

There are two valid splits in nums:

- Split nums at index 1. Then, the first part is [2,3], and its sum is 5. The second part is [1,0], and its sum is 1. Since 5 >= 1, i = 1 is a valid split.

- Split nums at index 2. Then, the first part is [2,3,1], and its sum is 6. The second part is [0], and its sum is 0. Since 6 >= 0, i = 2 is a valid split.

Constraints:

• 2 <= nums.length <= 105

• -105 <= nums[i] <= 105

Solution:

class Solution {

public int waysToSplitArray(int[] nums) {

if (nums == null || nums.length <= 1 ) return 0;

int sumSoFar = 0;

int total = 0;

int count = 0;

for (int i = 0; i < nums.length; i++) {

total += nums[i];

}

for (int i = 0; i < nums.length - 1; i++) {

sumSoFar += nums[i];

if (sumSoFar >= total-sumSoFar) {

count += 1;

}

return count;

}

Test cases:

[0] => 0

[1] => 0

[0,1] => 0

[1,0] => 1

[0,0,1] => 0

[0,1,0] => 1

[0,1,1] => 1

[1,0,0] => 2

[1,0,1] => 2

[1,1,0] => 2

[1,1,1] => 1

Saturday, October 26, 2024

Managing copilots:

This section of the series on cloud infrastructure deployments focuses on the proliferation of copilots for different business purposes and internal processes. As with any flight, a copilot is of assistance only to the captain responsible for the flight to be successful. If the captain does not know where she is going, then the copilot immense assistance will still not be enough. It is of secondary importance that the data that a copilot uses might be prone to bias or shortcomings and might even lead to so-called hallucinations for the copilot. Copilots are after all large language models that work entirely on treating data as vectors and leveraging classification, regression and vector algebra to respond to queries. They don’t build a knowledge graph and do not have the big picture on what business purpose they will be applied to. If the purpose is not managed properly, infrastructure engineers might find themselves maintaining many copilots for different use cases and even reducing the benefits of where one would have sufficed.

Consolidation of large language models and their applications to different datasets is only the tip of the iceberg that these emerging technologies have provided as instruments for the data scientists. Machine Learning pipelines and applications are as diverse and silo’ed as the datasets that they operate on and they are not always present in data lakes or virtual warehouses. Consequently, a script or a prediction api written and hosted as an application does not make the best use of infrastructure for customer convenience in terms of interaction streamlining and improvements in touch points. This is not to say that different models cannot be used or that the resources don’t need to proliferate or that there are some cost savings with consolidation, it is about business justification of the number of copilots needed. When we work backwards from what the customer benefits or experiences, one of the salient features that works in favor of infrastructure management is that less is more. Hyperconvergence of infrastructure for various business purposes when those initiatives are bought into by stakeholders that have both business and technical representations makes the investments more deliberate and fulfilling.

And the cloud or the infrastructure management is not restrictive to experimentation, just that it is arguing against the uncontrolled experimentation and placing the customers in a lab. As long as experimentation and initiatives can be specific in terms of duration, budget and outcomes, infrastructure management can go the extra mile of cleaning up, decommissioning and even repurposing so that technical and business outcomes go hand in glove.

Processes are hard to establish when the technology is emerging and processes are also extremely difficult to enforce as new standards in the workplace. The failure of six sigma and the adoption of agile technologies are testament to that. However, best practices for copilot engineering are easier to align with cloud best practices and current software methodologies in most workplaces.

#codingexercise Codingexercise-10-26-2024.docx

Friday, October 25, 2024

This is a summary of the book titled “The Good Drone: How social movements democratize surveillance” written by Austin Choi-FitzPatrick and published by MIT Press in 2020. The author asserts that drones democratize airspace. He does a comprehensive survey of civic use which is both intriguing and compelling to wonder if “unmanned aircraft” is going to be for everyone. For any social scientist, these material technologies will be engrossing. Drones began as nonviolent tools. It might be considered disruptive but it not. Malevolent drones are rare if not exclusively military and there are countermeasures against drones. Tools that accomplish social change must be “visible, accessible, affordable, useful and appropriate”.

Material technologies, such as drones, have become increasingly important in social movements and activism. They serve as tools to collect and disseminate information, sometimes making it costly for those in power to maintain the status quo. Social scientists often view new devices as weapons or threats to civil liberties, but these tools can also be used to influence society. Drones, which began as nonviolent tools, have become practical and affordable when manufacturers made them easy to control. In 2012, drone use by intergovernmental organizations, governments, businesses, scientists, and civil society groups took off. The air is no longer a place where governments and corporations rule and surveil the population, and some drone use is disruptive. Examples include a drone above the Kruger National Park in South Africa, documenting crowds in Moscow, Kyiv, Bangkok, and Istanbul, and a drone in Aleppo, Syria, showcasing the effects of a brutal siege. Other drone use is emergent and nondisruptive, such as Australian biologists investigating the health of humpback whales and the Slavery from Space project searching for brick kilns in India.

A drone helped a Sinti-Roma settlement in Hungary show the world their plight. However, social scientists often focus more on the image itself than the drone. Drones can disrupt traditional photography by taking flight, potentially leading to near-instantaneous monitoring of places or events. They can also provide unfettered views of skyscrapers, prisons, and factory farms, providing democratic surveillance. New forms of data gathering may enable drones to document police actions, potentially allowing citizens to fly over military installations. Drones may also change our concept of private spaces, raising questions about privacy and surveillance.

Drones can be used for malevolent purposes, including surveillance, hunting, and causing harm to people or infrastructure. So, privacy remains a concern, with regulations focusing on police use, surveillance without permission, hunting, and drone control. Countermeasures include drone detection systems like DroneShield and SkySafe, as well as GPS transmitters and energy beams. Drones can also be difficult to fly in strong winds, fog, or dense rain, and warm surroundings can make them undetectable. As tools of resistance, drones can be used to estimate crowd sizes and monitor police officers. However, as drone prices decrease, it is crucial to consider the psychological, economic, political, and social consequences of drone use.