Cluster computing: Query Processor optimizations

Tuesday, January 22, 2013

Query Processor optimizations

The query optimizer is a software module that transforms internal representation of a query into an efficient query plan for executing the query. This query plan is a flow chart for table data to pipe through and the chart contains operators such as sort, scan or join. These operations can be very tricky to perform when the table data is very large. Consider a join of two tables where the number of records in both left and right tables is in the order of millions. A join is a match between the records of two tables based on a common satisfying condition such as for example, a record in both tables that have the same value in one of their columns. Now a query optimizer must choose what is called an appropriate plan size for finding all such matches. Either it could use the records on the left table as a basis for performing a match with those on the right, so the resulting set will have all the records of the left and a few from the right or it could switch the tables so that the resulting set will have all the records of the right with a few from the left. Alternatively, the query optimizer could also compute a Cartesian product of all the records from both tables. A query optimizer could choose a “left-deep” query plan and “postpone Cartesian products” ensuring that the Cartesian products appear only after all the joins in a dataflow. This was being done several years ago. Today the systems use “bushy” trees i.e. a join based on nested right-hand inputs and early use of Cartesian product which are useful in some cases. Both may be used by a query optimizer and the plan space is an import consideration for the query optimizer. Another factor for consideration by the query optimizer is the “Selectivity estimation”. The cardinality of a table or index in a database can be considered a simple selectivity estimation parameter. Most systems however go beyond that. They summarize and analyze such things as the distribution of values in attributes via histograms and other summary statistics. Since this involves visiting every value in each column, it’s quite expensive and is typically done by sampling instead of a complete scan of all the data. Now this can come in useful for the selectivity estimation of a join where the tables are not compared but their histograms are. There are more sophisticated schemes as well but their adoption was hampered because a query optimizer along with the rest of the software modules in a database server has to meet “performance benchmarks” – an accepted industry standard for high performance and the tests in these benchmarks used mostly statistically independent generated data instead of “real” data. Another challenge for a query optimizer is that the errors in optimization made earlier in a plan tree cascade down the plan tree with significant hit to performance or mistakes in subsequent estimations. While talking about factors affecting query optimization, it’s also worthwhile to mention search algorithms. Algorithms could choose one or the other of the strategies such as dynamic programming or “top-down” search. Each of these algorithms has a positive and negative effect on a query optimizer’s use of memory or processing cycles. Top-down search for example can lower the number of plans considered by an optimizer but increases the memory consumption. Another factor for consideration is parallelism where the workload can be shared across one or more CPUs and reduce the time taken for the computations. These are measured with a parameter called the degree of parallelism (DOP)

Cluster computing

Tuesday, January 22, 2013

Query Processor optimizations

No comments:

Post a Comment