Databricks is a unified data analytics platform
that combines big data processing, machine learning and collaborative analytics
tools in a cloud-based environment. As a collaborative workspace to author data
driven workflows, it is usually quick to be adopted in any organization and
prone to staggering costs from aging. This article explains that it need not be
so and instead leverage some best practices to reduce infrastructure
costs. This involves optimization and
best practices.
One of the advantages of being a cloud resource
is that Databricks workspaces can be spun up as many times and for as many
purposes as needed. Given the large number of features, the mixed-use cases of
data engineering and analytics, and diverse compute and storage intensive
usages such as machine learning and ETL, some segregation of workloads to
workspaces even within business divisions and workspace lifetimes is called
for.
Usages of databricks for the purposes of
leveraging Apache Spark, a powerful open-source distributed computing framework
is significantly different from the usages involving Delta Lake, an open-source
storage layer that brings ACID transactions over heterogeneous data sources.
The former drives compute utilization costs and the latter drives
Databricks-Units and network costs. Some of the problems encountered include:
Data skew with uneven distribution of data over
partitions lead to bottlenecks and poor execution. This is often addressed by
repartitioning using salting or bucketing.
Inefficient data formats increase overhead and
prolong query completion. This is addressed with more efficient packing such as
parquet file types that offer built-in compression, columnar storage, and
predicate pushdown.
Inadequate caching leads to repeated disk access
that is costly. Spark’s in-memory caching features can help speed up iterative
algorithms.
Large shuffles lead to network congestion,
latency, and slower execution. This can be resolved with broadcast joins,
filtering data early and using partition aware operations.
Inefficient queries occur when the query
parameters and hints are not fully leveraged. Predicate pushdowns, partition
pruning and query rewrites can resolve these.
Suboptimal resource allocation occurs when CPU,
memory or storage is constrained. Monitoring resource usage and adjusting
resource limits accordingly mitigate this.
Garbage collection settings are not proper. Much
like resources, these can be monitored and tuned.
Outdated versions and required bug fixes. These
can be solved with patching and upgrades.
Similarly, the best practices can be enumerated
as:
Turning off compute that are not in use and
enabling auto-termination.
Sharing compute between different groups via
consolidation at relevant scope and level.
Tracking costs against usages so that they can be
better understood.
Auditing usages against users and principals to
take corrective action.
Leveraging spot instances for compute that come
with a discount.
Using photon acceleration that speeds up SQL
queries and Spark SQL API.
Using built-in and custom mitigations for
patterns of problems encountered at resource and component levels.
Lastly, turning off features that are not
actively used and using appropriate features for their recommended use also
help significantly.
No comments:
Post a Comment