Cluster computing: Kusto continued

Monday, July 5, 2021

Kusto continued

Kusto query language and engine discussion continued from the previous article.

Let us now look at the Kusto query engine data ingestion path. Again, the admin node is the one to receive the command. This time the command is for data ingestion such as a log for a destination table. The admin node determines if the table has the right schema.

The admin node scans all nodes and finds an available data node to perform the processing and forwards the command to the target. The data node creates the data extent, copies the data, and sends back the extent information to the admin node. The admin node then adds the new shard reference to the table metadata and commits the new snapshot of the metadata. These extents now become sealed and immutable.

When the data is deleted, the actions are in reverse order. First, the reference tree is deleted before the actual data. The references are garbage collected and the actual delete is deferred until the threshold of the container is reached. Since the deletion of the data is deferred, it gives an opportunity to reclaim the reference tree. Queries involving deleted data are using a previous snapshot of the metadata which remains valid for the soon-to-be-deleted data.

Kusto query engine has a similar query parsing and optimization logic as the SQL Server. First, it parses the incoming script into an abstract syntax tree or AST for short. Then it performs a semantic pass over this tree. In this process, it checks the names, resolves them, verifies that the user has the permissions to access the relevant entity, and then checks the data type and reference. After the semantic pass, the query engine will build an initial relational operator, a tree based on the AST Query engine will further attempt to optimize the query by applying one or multiple predefined rewriting rules. These rewriting rules involve pushdown predicates, replacing table access with extent union structure, splitting aggregation OPS into the leaves, and using top and operators that are replicated to each data extent Together with this parsing an optimization logic Kusto achieves a common abstract syntax tree that is suitable for the query to be executed on the cluster. Let us next look at executing this query on continuously increasing tabular data such as usage data

Kusto has certain limitations it was originally designed as an ad hoc query engine with immense support for fast querying and text data processing capabilities but, as a big data platform, it does not really replace traditional databases like SQL Server. When we attempt to use Kusto as a replacement to the SQL Server, we will run into some limitations these are mentioned below with their potential solutions. There are limits on query concurrency because the cluster runs usually on a collection of eight cores VM nodes. It can execute up to 8 * 10 queries concurrently. The actual number can also be determined by using a Kusto command of showing the cluster policy along with the query throttling limitation of 10 queries per core or node is necessary for a healthy operation of the Kusto query engine. There are limits on the node memory which is set to a number that cannot be larger than the node's physical memory. If it is, then the setting will have no effect. This also implies that a query can take up almost all the node's memory for its processing, but it cannot go beyond the limits of what is available as the node’s physical memory. There are also limits on memory a join or a summarized operation which protects queries from taking too much memory. Finally, there is a limit on the result set size. The number of datasets cannot exceed 500,000 rows and the data size itself cannot exceed 64 megabytes. If the script hits this limitation, it will result in a query error with a partial query failure message, and this can be overcome by summarizing the data to output so that only the interesting result is propagated. This can be done with techniques such as using a take operator to see a small sample of the result and using a project operator to output only the columns of interest. There are no limits on query complexity, but it is not advisable to have more than 5000 conditions in the where class. Lastly, all these limitations are settings that can be adjusted to suit the workload

Cluster computing

Monday, July 5, 2021

Kusto continued

No comments:

Post a Comment