The query optimizer is a software module that transforms
internal representation of a query into an efficient query plan for executing
the query. This query plan is a flow chart for table data to pipe through and
the chart contains operators such as sort, scan or join. These operations can
be very tricky to perform when the table data is very large. Consider a join of
two tables where the number of records in both left and right tables is in the
order of millions. A join is a match between the records of two tables based on
a common satisfying condition such as for example, a record in both tables that
have the same value in one of their columns. Now a query optimizer must choose
what is called an appropriate plan size for finding all such matches. Either it
could use the records on the left table as a basis for performing a match with
those on the right, so the resulting set will have all the records of the left
and a few from the right or it could switch the tables so that the resulting
set will have all the records of the right with a few from the left.
Alternatively, the query optimizer could also compute a Cartesian product of all
the records from both tables. A query optimizer could choose a “left-deep”
query plan and “postpone Cartesian products” ensuring that the Cartesian products
appear only after all the joins in a dataflow. This was being done several
years ago. Today the systems use “bushy” trees i.e. a join based on nested
right-hand inputs and early use of
Cartesian product which are useful in some cases. Both may be used by a query
optimizer and the plan space is an import consideration for the query
optimizer. Another factor for consideration by the query optimizer is the “Selectivity
estimation”. The cardinality of a table or index in a database can be
considered a simple selectivity estimation parameter. Most systems however go
beyond that. They summarize and analyze such things as the distribution of
values in attributes via histograms and other summary statistics. Since this
involves visiting every value in each column, it’s quite expensive and is
typically done by sampling instead of a complete scan of all the data. Now this
can come in useful for the selectivity estimation of a join where the tables
are not compared but their histograms are.
There are more sophisticated schemes as well but their adoption was
hampered because a query optimizer along with the rest of the software modules
in a database server has to meet “performance benchmarks” – an accepted
industry standard for high performance and the tests in these benchmarks used
mostly statistically independent generated data instead of “real” data. Another
challenge for a query optimizer is that the errors in optimization made earlier
in a plan tree cascade down the plan tree with significant hit to performance
or mistakes in subsequent estimations. While talking about factors affecting
query optimization, it’s also worthwhile to mention search algorithms.
Algorithms could choose one or the other of the strategies such as dynamic programming
or “top-down” search. Each of these algorithms has a positive and negative effect
on a query optimizer’s use of memory or processing cycles. Top-down search for
example can lower the number of plans considered by an optimizer but increases the
memory consumption. Another factor for
consideration is parallelism where the workload can be shared across one or
more CPUs and reduce the time taken for the computations. These are measured
with a parameter called the degree of parallelism (DOP)
No comments:
Post a Comment