Cluster computing: Predicate push-down for OData clients (continued)...

Predicates are expected to evaluate the same way regardless of which layer they are implemented in. If we have a set of predicates and they are separated by or clause as opposed to and clause, then we will have a result set from each predicate, and they may involve the same records in the results of each predicate. If we filter based on one predicate and we also allow matches based on another predicate, the two result sets may then be merged into one so that the result can then be returned to the caller. The result sets may have duplicates so the merge may have to return only the distinct elements. This can easily be done by comparing the unique identifiers of each record in the result set.

The selection of the result is required prior to determining the section that needs to be returned to the user. This section is determined by the start, offset pair in the enumeration of the results. If the queries remain the same over time, and the request only varies in the paging parameters, then we can even cache the result and return only the paged section. The API will persist the predicate, result sets in cache so that subsequent calls for paging only results the same responses. This can even be done as part of predicate evaluation by simply passing the well-known limit and offset parameter directly in the SQL query. In the enumerator we do this with Skip and Take. The OData Client calls with client-driven paging using $skip and $top query options.

When the technology involved merely wants to expose the database to the web as popularly used with OData albeit incorrectly, then each SQL object is exposed directly over the web API as a resource. Some queries are difficult to write in OData as opposed to others. For example,

oDataClient.Resource.Where(x => x.Name.GetHashCode() % ParallelWorkersCount == WorkerIndex).ToList()

will not achieve the desired partition of a lengthy list of resources for faster, efficient parallel data access

and must be rewritten as something like:

oDataClient.Resource.Where(x => x.Name.startsWith(‘A’)).ToList()

oDataClient.Resource.Where(x => x.Name.startsWith(‘Z’)).ToList()

The system query options from these are $filter, $select, $orderby, $count, $top, and $expand where the last one helps with joins. Although a great deal of parity can be achieved between SQL and OData with the help of these query options, the REST interface does not form a replacement for the analytical queries possible with purely language options such as those available from U-SQL, LINQ or Kusto. Those have their own place higher up in the stack at the business or application logic layer but at the lower levels close to the database, a web interface separation of concerns between the stored data and its access, the primitives provide a challenge as well as an opportunity.

Let us look at how the OData is written. We begin with a database that can be accessed with a connection string that stores data in the form of tables for entities in a database. A web project with an entity data model is then written to prepare a data model from the database. The web project can be implemented with a SOAP-based WCF or REST based webAPIs and EntityFramework. Each API is added by creating an association between the entity and the API. Taking the example of WCF further since it provides terminology for all parts of the service albeit not obsolete, a type is specified with the base DataService and an InitializeService method, the config.SetEntitySetAccessRule is specified. Then the JSONPSupportBehaviour attribute is added to the service class so that the end users can get the data in the well-known format that makes it readable. The service definition as say http://<odata-endpoint>/service.svc can be expected in json or xml format to allow clients to build applications using those objects representing entities. The observation here is that it uses a data model which is not limited to SQL databases, so the problem is isolated away from the database and narrowed down to the operations over the data model. In fact, OData has never been about just exposing the database on the web. We choose which entities are accessed over the web and we can expand the reach with OASIS standard. OASIS is a global consortium that drives the development, convergence, and adoption of web standards. Another observation is that we need not even use the Entity Framework for the data model. Some experts argue that OData main use case is the create, update, and delete of entities over the web and the querying should be facilitated by APIs from web services where rich programmability already exists for writing queries. While it is true that there are language-based options that can come in the compute layer formed by the web services, the exposure remains a common theme to the REST API design for both the REST API over a service or the REST API over a database. The filter predicate used in those APIs will eventually try to push it into the data persistence layer. In our case, we chose an example of a GetHashCode() operator that is more language based rather than a notion for the database. As demonstrated with the SQL statement example above, the addition of a hash to an entity involves adding a compute column attribute to its persistence. Once that is available, the predicate can automatically be pushed into the database for maximum performance and scalability.

The manifestation of data to support simpler queries and their execution is not purely a technical challenge. The boundary between data and compute is complicated by claims to ownerships, responsibilities, and jurisdictions. In fact, clients writing OData applications are forced to work without any changes to master data. At this point, there are two options for these applications. The first involves translating the queries to those that can work on existing data such as the example shown above. The second involves the use of scoping down the size of the data retrieved by techniques such as incremental update polling, paging, sorting etc. and then performing the complex query operations in-memory on that limited set of data. Both these options are sufficient to alleviate the problem encountered.

The strategic problem for the case with the data being large and the queries being arbitrarily complex for OData clients can be resolved with the help of a partition function and the use of a scatter-gather processing by the clients themselves. This can be compared to the partition that is part of the URI path qualifier for REST interfaces to the CosmosDB store.

OData also provides the ability to batch requests. The HTTP specification must be followed when sending a response. A new batch handler must be created and passed when mapping routing for OData service. Batch or response consolidation will be enabled.

Cluster computing

Monday, January 24, 2022

Predicate push-down for OData clients (continued)...

No comments:

Post a Comment