Kusto query language and engine discussion continued from
the previous article.
Let us now look at the Kusto query engine data ingestion
path. Again, the admin node is the one to receive the command. This time the
command is for data ingestion such as a log for a destination table. The admin
node determines if the table has the right schema.
The admin node scans all nodes and finds an available
data node to perform the processing and forwards the command to the target. The
data node creates the data extent, copies the data, and sends back the extent
information to the admin node. The admin node then adds the new shard reference
to the table metadata and commits the new snapshot of the metadata. These
extents now become sealed and immutable.
When the data is deleted, the actions are in reverse
order. First, the reference tree is deleted before the actual data. The
references are garbage collected and the actual delete is deferred until the
threshold of the container is reached. Since the deletion of the data is
deferred, it gives an opportunity to reclaim the reference tree. Queries involving
deleted data are using a previous snapshot of the metadata which remains valid
for the soon-to-be-deleted data.
Kusto query engine has a similar query parsing and
optimization logic as the SQL Server. First, it parses the incoming script into
an abstract syntax tree or AST for short.
Then it performs a semantic pass over this tree. In this process, it
checks the names, resolves them, verifies that the user has the permissions to
access the relevant entity, and then checks the data type and reference. After
the semantic pass, the query engine will build an initial relational operator,
a tree based on the AST Query engine will further attempt to optimize the query
by applying one or multiple predefined rewriting rules. These rewriting rules
involve pushdown predicates, replacing table access with extent union
structure, splitting aggregation OPS into the leaves, and using top and
operators that are replicated to each data extent Together with this parsing an
optimization logic Kusto achieves a common abstract syntax tree that is
suitable for the query to be executed on the cluster. Let us next look at
executing this query on continuously increasing tabular data such as usage data
Kusto has certain limitations it was originally designed
as an ad hoc query engine with immense support for fast querying and text data
processing capabilities but, as a big data platform, it does not really replace
traditional databases like SQL Server. When we attempt to use Kusto as a
replacement to the SQL Server, we will run into some limitations these are
mentioned below with their potential solutions. There are limits on query
concurrency because the cluster runs usually on a collection of eight cores VM
nodes. It can execute up to 8 * 10 queries concurrently. The actual number can
also be determined by using a Kusto command of showing the cluster policy along
with the query throttling limitation of 10 queries per core or node is
necessary for a healthy operation of the Kusto query engine. There are limits
on the node memory which is set to a number that cannot be larger than the
node's physical memory. If it is, then the setting will have no effect. This
also implies that a query can take up almost all the node's memory for its
processing, but it cannot go beyond the limits of what is available as the
node’s physical memory. There are also limits on memory a join or a summarized
operation which protects queries from taking too much memory. Finally, there is
a limit on the result set size. The number of datasets cannot exceed 500,000
rows and the data size itself cannot exceed 64 megabytes. If the script hits
this limitation, it will result in a query error with a partial query failure
message, and this can be overcome by summarizing the data to output so that
only the interesting result is propagated. This can be done with techniques
such as using a take operator to see a small sample of the result and using a
project operator to output only the columns of interest. There are no limits on
query complexity, but it is not advisable to have more than 5000 conditions in
the where class. Lastly, all these limitations are settings that can be
adjusted to suit the workload
No comments:
Post a Comment