Cluster computing

Tuesday, February 20, 2024

Databricks is a unified data analytics platform that combines big data processing, machine learning and collaborative analytics tools in a cloud-based environment. As a collaborative workspace to author data driven workflows, it is usually quick to be adopted in any organization and prone to staggering costs from aging. This article explains that it need not be so and instead leverage some best practices to reduce infrastructure costs. This involves optimization and best practices.

One of the advantages of being a cloud resource is that Databricks workspaces can be spun up as many times and for as many purposes as needed. Given the large number of features, the mixed-use cases of data engineering and analytics, and diverse compute and storage intensive usages such as machine learning and ETL, some segregation of workloads to workspaces even within business divisions and workspace lifetimes is called for.

Usages of databricks for the purposes of leveraging Apache Spark, a powerful open-source distributed computing framework is significantly different from the usages involving Delta Lake, an open-source storage layer that brings ACID transactions over heterogeneous data sources. The former drives compute utilization costs and the latter drives Databricks-Units and network costs. Some of the problems encountered include:

Data skew with uneven distribution of data over partitions lead to bottlenecks and poor execution. This is often addressed by repartitioning using salting or bucketing.

Inefficient data formats increase overhead and prolong query completion. This is addressed with more efficient packing such as parquet file types that offer built-in compression, columnar storage, and predicate pushdown.

Inadequate caching leads to repeated disk access that is costly. Spark’s in-memory caching features can help speed up iterative algorithms.

Large shuffles lead to network congestion, latency, and slower execution. This can be resolved with broadcast joins, filtering data early and using partition aware operations.

Inefficient queries occur when the query parameters and hints are not fully leveraged. Predicate pushdowns, partition pruning and query rewrites can resolve these.

Suboptimal resource allocation occurs when CPU, memory or storage is constrained. Monitoring resource usage and adjusting resource limits accordingly mitigate this.

Garbage collection settings are not proper. Much like resources, these can be monitored and tuned.

Outdated versions and required bug fixes. These can be solved with patching and upgrades.

Similarly, the best practices can be enumerated as:

Turning off compute that are not in use and enabling auto-termination.

Sharing compute between different groups via consolidation at relevant scope and level.

Tracking costs against usages so that they can be better understood.

Auditing usages against users and principals to take corrective action.

Leveraging spot instances for compute that come with a discount.

Using photon acceleration that speeds up SQL queries and Spark SQL API.

Using built-in and custom mitigations for patterns of problems encountered at resource and component levels.

Lastly, turning off features that are not actively used and using appropriate features for their recommended use also help significantly.

Cluster computing

Tuesday, February 20, 2024

No comments:

Post a Comment