Evaluating Histograms for Selectivity Estimations
Abstract
Histograms have a long history in the field of statistics, and have been used in the past few decades to capture the distribution of data in database management systems. In a database system, they can give the query optimizer information about whether the data is skewed, or if it follows a more uniform distribution. This will in turn help the query optimizer in selecting the best execution plan for a given query.
This thesis looks at seven different histogram types, and looks at the accuracy and complexity of each of them when used for selectivity estimations. We have also implemented four of these in MySQL, and tested them on two different test suites. The results show that the optimizer benefits from better knowledge of the underlying data, as several of the queries tested runs many times faster. However, histograms are very often a trade-off between simplicity, accuracy and storage requirements. The more exotic histogram types are very often the ones with the highest accuracy, but are also the most difficult types to construct and maintain. The V-optimal histogram is a good example of this; it has overall one of the best accuracies, but the construction complexity makes it very difficult to create for large data sets.
One histogram type does not work optimally for all data distributions. At a low number of buckets, the Serial histogram is often the one with the highest accuracy. But at a high number of buckets, it varies which histogram is the most accurate. It is thus important to find the correct histogram type for the specific use case, and pointing out a single histogram as a "winner" is very difficult.