Tuesday, April 4, 2023

 

Data Lake is often used with Azure Data Factory to support Copy operations. It allows for downloading of files and folders directly from the Lake and to the order of thousands of files from folders, but the upload is best done with the help of Azure Data Factory which supports even upload of zip file and the in-transit extraction of files before they are stored in the Data Lake. ADF happens to support a variety of file formats such as Avro, binary and text to name a few and uses all available throughput by performing as many reads and writes in parallel as possible.

Data Lakes work best for partition pruning of time-series data which improves performance by reading only a subset of data.  The pipelines that ingest time-series data often place their files with a structured naming such as /DataSet/YYYY/MM/DD/HH/mm/datafile_YYYY_MM_DD_HH_mm.tsv.

The Hitchhiker’s Guide to Data Lake from Azure recommends that monitoring be set up for effective operations management of this cloud resource. This will help to make sure that it is available for use by any workloads which consume data contained within it. Key considerations include auditing the data lake in terms of frequent operations, having visibility into key performance indicators such as operations with high latency and understanding common errors.  All of the telemetry will be available via Azure Storage Logs which is easy to query with Kusto query language.

Common queries are also candidates for reporting via dashboards for Azure Monitor resource. For example,

1.       Frequent operations can be queried with:

StorageBlobLogs

| where TimeGenerated > ago(3d)

| summarize count() by OperationName

| sort by count_ desc

| render piechart

 

2. High latency operations can be queried with:

StorageBlobLogs

| where TimeGenerated > ago(3d)

| top 10 by DurationMs desc

| project TimeGenerated, OperationName, DurationMs, ServerLatencyMs, ClientLatencyMs = DurationMs – ServerLatencyMs

 

3. Operations causing the most error are caused by:

  StorageBlobLogs

| where TimeGenerated > ago(3d) and StatusText !contains "Success"

| summarize count() by OperationName

| top 10 by count_ desc

 

and these can be reported on the dashboard.

No comments:

Post a Comment