The Databricks Unity Catalog offers centralized access
control, auditing, lineage, and data discovery capabilities across multiple
Databricks workspaces. It includes user management, metastore, clusters, SQL
warehouses, and a standards-compliant security model based on ANSI SQL. The
catalog also includes built-in auditing and lineage, allowing for user-level
audit logs and data discovery. The metadata store is a top-level container,
while the data catalog has a three-level namespace namely catalog.schema.table.
The catalog explorer allows for creation of tables and views, while the tables
of views and volumes provide governance for nontabular data. The catalog is
multi-cloud friendly, allowing for federation across multiple cloud vendors and
unified access. The idea here is that you can define once and secure anywhere.
Databricks Unity Catalog consists of a metastore and a
catalog. The metastore is the top-level logical container for metadata, storing
data assets like tables or models and defining the namespace hierarchy. It
handles access control policies and auditing. The catalog is the first-level
organizational unit within the metastore, grouping related data assets and
providing access controls. However, only one metastore per deployment is used.
Each Databricks region requires its own Unity Catalog metastore.
There is a Unity catalog quick start notebook
in Python. The key steps include creating a workspace with the Unity Catalog
meta store, creating a catalog, creating a managed schema, managing a table,
and using the Unity catalog in the Pandas API on Spark. The code starts with
creating a catalog, selecting show, and then creating a managed schema. The
next step involves creating and managing schemas, extending them, and granting
permissions. The table is managed using the schema created earlier, and the
table is shown and all available tables are shown. The final step involves
using the Pandas API on Spark, which can be found in the official documentation
for Databricks. This quick start is a great way to get a feel for the process
and to toggle back and forth with the key steps inside the code.
The Unity Catalog system employs object security best
practices, including access control lists (ACLs) for granting or restricting
access to specific users and groups on securable objects. ACLs provide
fine-grain control, ensuring access to sensitive data and objects. Less
privilege is used, limiting access to the minimum required, avoiding broad
groups like All Users unless necessary. Access is revoked once the purpose is
served, and policies are reviewed regularly for relevance. This technique
enhances data security and compliance, prevents unnecessary broad access, and
controls a blast radius in case of security breaches.
The Databricks Unity Catalog system offers best practices
for catalogs. First, create a separate catalog for loose coupling, managing
access and compliance at the catalog level. Align catalog boundaries with
business domains or applications, such as marketing analytics or HR. Customize
security policies and governance within the catalog to drill down into specific
domains. Create access control groups and roles specific to a catalog,
fine-tune read-write privileges, and customize settings like resource quotas
and scrum rules. These fine-grain policies provide the best of security and
functionality in catalogs.
To ensure security and manage external connections, limit
visibility by granting access only to specific users, groups, and roles, and
setting lease privileges. Limit access to only necessary users and groups using
granular access control lists or ACLs. Be aware of team activities and avoid
giving them unnecessary access to external resources. Tag connections
effectively for discovery using source categories or data classifications, and
discover connections by use case for organizational visibility. This approach
enhances security, prevents unintended data access, and simplifies external
connection discovery and management.
Databricks Unity Catalog Business Unit Best Practices
emphasize the importance of providing dedicated sandboxes for each business
unit, allowing independent development environments, and preventing
interference between different workflows. Centralizing shareable data into
production catalogs ensures consistency and reduces the need for duplicate
data. Discoverability is crucial, with meaningful naming conventions and
metadata best practices. Federated queries via Lakehouse architecture unify
data access across silos, governing securely via contracts and permissions.
This approach supports autonomy for units, increases productivity through
reuse, and maintains consistency with collaborative governance. This approach
supports autonomy, increases productivity, and maintains consistency.
In conclusion, the Unity catalog standard allows centralized
data governance and best practices for catalogs, connections, and business
units.
https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html#enable-workspace
https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html
No comments:
Post a Comment