Cluster computing

The history of data engineering has evolved from big data frameworks like Hadoop and MapReduce to streamlined tools like Spark, Databricks, BigQuery, Redshift, Snowflake, Presto, Trino, and Athena. Cloud storage and transformation tools have made data more accessible, and lakehouses have offered a cost-efficient, unified option for managing data at scale. This evolution has led to a more accessible and efficient data management landscape.

Data transformation environments vary, with common environments being data warehouses, data lakes, and lakehouses. Data warehouses use SQL for transformation, while data lakes store large amounts of data economically. Lakehouses combine aspects of both, offering flexibility and cost-effectiveness. Databricks SQL is a serverless data warehouse that sits on the lakehouse platform. The choice between these environments depends on project needs, team expertise, and long-term data strategy.

Data staging is a crucial process in data transformation, often written in a temporary state to a suitable location, such as cloud storage or an intermediate table. Medallion architecture preserves data history and makes time travel possible. It comprises three distinct layers: Bronze for raw data, Silver for light transformation, and Gold for "clean" data. Bronze data is raw and unfiltered, Silver data is filtered, cleaned, and adjusted, and Gold data is stakeholder-ready and sometimes aggregated. This approach can be used in a lake or warehouse, breaking down each storage layer into discrete stages of data cleanliness.

Data transformation is largely influenced by the tools available, with Python being a popular choice in the digital era. Python's Pandas library is at its core, and it has evolved significantly in data processing. However, scaling Python for large datasets has been challenging, often requiring libraries like Dask and Ray. Python-based data processing is a renaissance, with Rust emerging. To transform data in Python, choose a suitable library and framework, such as Pandas or emerging libraries like Polars and DuckDB. SQL, a declarative language, can be used as a declarative or imperative language, but is limited by a lack of functionality. Languages like Jinja/Python and JavaScript often complement SQL workflows. Rust, a new transformation language, is considered the future of data engineering, but Python has a solid foothold due to its community support and library ecosystem.

Transformation frameworks are multilanguage engines for executing data transformations across machines or clusters, enabling transformations to be manipulated in various languages like Python or SQL. Two popular engines are Hadoop and Spark. Hadoop, an open-source framework, gained traction in the mid-2000s with tech giants like Yahoo, Facebook, and Google. However, its MapReduce was not well-suited for real-time or iterative workloads, leading to the rise of Apache Spark in the early 2010s. Spark, a powerful open-source data processing framework, revolutionized big data analytics by offering speed, versatility, and integration with key technologies. Its key innovation is resilient distributed datasets (RDDs), enabling in-memory data processing and faster computations. With the rise of serverless data warehouses, big data engines may no longer be necessary, but query engines like BigQuery, Databricks SQL, and Redshift should not be disregarded. Recent advancements in in-memory computation may continue to expand data warehouses' transformation capabilities.

Data transformation is a crucial process that involves pattern mapping and understanding the different transformations that should be applied. Enrichment involves enhancing existing data with additional sources, such as adding demographic information to customer records. Joining involves combining two or more datasets based on a common field, like a JOIN operation in SQL. Filtering selects only the necessary data points for analysis based on certain criteria, reducing the volume and improving the quality of the data. Structuring involves translating data into a required format or structure, such as transforming JSON documents into tabular format or vice versa. Conversion is changing the data type of a particular column or field, especially when converting between semi-structured and structured data sources. Aggregation is summarizing and combining data to draw conclusions from large volumes of data, enabling insights to inform business decisions and create value from data assets. Anonymization is masking or obfuscating sensitive information within a dataset to protect privacy. It involves hashing emails or removing personally identifiable information (PII) from records. Splitting is a form of denormalization, dividing a complex data column into multiple columns. Deduplication is the process of removing redundant records to create a unique dataset, often through aggregation, filtering, or other methods.

Data update patterns are essential for transforming data in a target system. Overwrite is the simplest form, which involves a complete drop of an existing source or table and an overwrite with new data. Inserting is a more complex pattern, involving the appending of new data to an existing source without changing existing rows. Upsert is a more complex pattern, with applications for change data capture, sessionization, and deduplication. Platforms like Databricks have MERGE functionality to simplify the process. Data deletion is often misunderstood, with two main types: "hard" and "soft." Soft deletes enable the creation of historical records for an asset's status, while hard deletes eliminate these records, which can be problematic in data recovery cases.

When building a data transformation solution, consider several best practices, including staging, idempotency, normalization, and incrementality. Staging protects against data loss and ensures a low time to recovery (TTR) in case of failure. Idempotency ensures consistency and reliability by performing something multiple times, similar to reproducibility. Normalization refines data to a clean, orderly format, while denormalization duplicates records and information for improved performance. Incrementality determines whether a pipeline is a simple INSERT OVERWRITE or a more complex UPSERT. Predefined patterns for building incremental workflows can be found in tools like dbt and Airflow. Real-time data transformation involves batch, micro-batch, and streaming transformations. Micro-batch approaches, like Apache Spark's PySpark and Spark SQL, are simpler to implement compared to true, single-event transformations. Spark Structured Streaming is a popular streaming application that efficiently handles incremental and continuous updates, achieving latencies as low as 100 milliseconds with exactly once fault tolerance. Continuous Processing, introduced in Spark 2.3, can reduce latencies to as little as 1 millisecond, further enhancing its capability for streaming data transformation.

The modern data stack is experiencing a second renaissance due to new technologies and AI advancements. As a result, new tools and technologies are emerging to redefine data transformation. However, it's crucial to adhere to timeless strategies for managing data and creating cleaned assets. Supercharged tooling and automations can be both beneficial and challenging, but engineers must ensure well-planned and executed transformation systems with a high value-to-cost ratio.

Cluster computing

Tuesday, October 29, 2024

No comments:

Post a Comment