Data Lake is often used with Azure Data Factory to support
Copy operations. It allows for downloading of files and folders directly from
the Lake and to the order of thousands of files from folders, but the upload is
best done with the help of Azure Data Factory which supports even upload of zip
file and the in-transit extraction of files before they are stored in the Data
Lake. ADF happens to support a variety of file formats such as Avro, binary and
text to name a few and uses all available throughput by performing as many
reads and writes in parallel as possible. 
Data Lakes work best for partition pruning of time-series
data which improves performance by reading only a subset of data.  The pipelines that ingest time-series data
often place their files with a structured naming such as /DataSet/YYYY/MM/DD/HH/mm/datafile_YYYY_MM_DD_HH_mm.tsv.
The Hitchhiker’s Guide to Data Lake from Azure recommends
that monitoring be set up for effective operations management of this cloud
resource. This will help to make sure that it is available for use by any
workloads which consume data contained within it. Key considerations include auditing
the data lake in terms of frequent operations, having visibility into key
performance indicators such as operations with high latency and understanding
common errors.  All of the telemetry will
be available via Azure Storage Logs which is easy to query with Kusto query
language.
Common queries are also candidates for reporting via
dashboards for Azure Monitor resource. For example, 
1.      
Frequent operations can be queried with:
StorageBlobLogs
| where TimeGenerated > ago(3d)
| summarize count() by
OperationName
| sort by count_ desc
| render piechart
2. High latency operations can be queried with:
StorageBlobLogs
| where TimeGenerated > ago(3d)
| top 10 by DurationMs desc
| project TimeGenerated,
OperationName, DurationMs, ServerLatencyMs, ClientLatencyMs = DurationMs –
ServerLatencyMs
3. Operations causing the most error are caused by:
 
StorageBlobLogs
| where TimeGenerated > ago(3d)
and StatusText !contains "Success"
| summarize count() by
OperationName
| top 10 by count_ desc
and these can be reported on the
dashboard. 
No comments:
Post a Comment