lab3实现的是基于代价的查询优化器
优化图片链接simple-db-hw-2021/controlflow.png at master · MIT-DB-Class/simple-db-hw-2021 · GitHub
简单总结一下查询优化器的构成:
1.Parser.Java在初始化时会收集并构造所有表格的统计信息,并存到statsMap中。当有查询请求发送到Parser中时,会调用parseQuery方法去处理
2.parseQuery方法会把解析器解析后的结果去构造出一个LogicalPlan实例,然后调用LogicalPlan实例的physicalPlan方法去执行,然后返回的是结果记录的迭代器,也就是我们在lab2中做的东西都会在physicalPlan中会被调用。
Exercise 1: IntHistogram.javaYou will need to implement some way to record table statistics for selectivity estimation. We have provided a skeleton class, IntHistogram that will do this. Our intent is that you calculate histograms using the bucket-based method described above, but you are free to use some other method so long as it provides reasonable selectivity estimates.
We have provided a class StringHistogram that uses IntHistogram to compute selecitivites for String predicates. You may modify StringHistogram if you want to implement a better estimator, though you should not need to in order to complete this lab.
我们在这个实验只需要实现IntHistogram,而StringHistogram会将字符串转换为int类型然后调用IntHistogram。首先,是构造器与属性部分。构造器给出直方图的数据范围(最大值最小值),桶的数量。有了这些信息,就可以构造出一个空的直方图。具体实现也是一些很简单的数学计算。
Exercise 2: TableStats.javaYou should fill in the following methods and classes in TableStats:
Implement the TableStats constructor: once you have implemented a method for tracking statistics such as histograms, you should implement the TableStats constructor, adding code to scan the table (possibly multiple times) to build the statistics you need.Implement estimateSelectivity(int field, Predicate.Op op, Field constant): Using your statistics (e.g., an IntHistogram or StringHistogram depending on the type of the field), estimate the selectivity of predicate field op constant on the table.Implement estimateScanCost(): This method estimates the cost of sequentially scanning the file, given that the cost to read a page is costPerPageIO. You can assume that there are no seeks and that no pages are in the buffer pool. This method may use costs or sizes you computed in the constructor.Implement estimateTableCardinality(double selectivityFactor): This method returns the number of tuples in the relation, given that a predicate with selectivity selectivityFactor is applied. This method may use costs or sizes you computed in the constructor.
exercise2要做的是根据给定的tableid,扫描出所有记录,并对每一个字段建立一个直方图。
- 扫描全表,确定每个字段的最大最小值;并且根据最大最小值、表的桶数,构造出每个字段的直方图。再次扫描全表,并在扫描过程中填充数据。
表的扫描代价 = 表的页数 * 单个页的IO代价
Exercise 3: Join Cost EstimationThe class JoinOptimizer.java includes all of the methods for ordering and computing costs of joins. In this exercise, you will write the methods for estimating the selectivity and cost of a join, specifically:
Implement estimateJoinCost(LogicalJoinNode j, int card1, int card2, double cost1, double cost2): This method estimates the cost of join j, given that the left input is of cardinality card1, the right input of cardinality card2, that the cost to scan the left input is cost1, and that the cost to access the right input is card2. You can assume the join is an NL join, and apply the formula mentioned earlier.Implement estimateJoinCardinality(LogicalJoinNode j, int card1, int card2, boolean t1pkey, boolean t2pkey): This method estimates the number of tuples output by join j, given that the left input is size card1, the right input is size card2, and the flags t1pkey and t2pkey that indicate whether the left and right (respectively) field is unique (a primary key).
exercise3要做的是估计连接查询的代价
其实这应该是四个exercise最容易的一个,就是看懂了连接查询的公式,然后写一下就好了,以下是公式:
scancost(t1) + scancost(t2) + joincost(t1 join t2) +
scancost(t3) + joincost((t1 join t2) join t3) +
…
对于等值连接,如果两表中参与连接的两个字段中有一个是主键,则产生的基数要小于等于非主键字段记录数,因为主键是唯一的,也就是说非主键的记录每一条都只能最多连接一个主键记录数。这里我思考了一下,主键唯一,为什么不能选主键记录数当作基数呢?原因是如果非主键字段的记录数远远小于主键字段的记录数,那这个基数可以是十分不准确了。
如果等值连接中两个都是主键,那么应该取记录数最小的字段取做基数了,相通了上面的东西,这个就很容易明白了。
如果等值连接中两个都不是主键,那么基数是难以估计的,可以很大也可以很小,一般会用启发式程序去估算基数。所以这也启示我们,建表时如果字段没有重复值要声明Unique,选择性的估计才好做。
数据库在选择索引时,也是会估计基数,然后计算出选择性,使用选择性可以衡量一个字段的不重复记录数有多少,如果一个字段的选择性很低接近0,那么就没必要用索引了,因为会有大量重复的数据,导致我们不断的去回表;如果一个字段的选择性很高,接近于1,说明该字段的记录数是很多不重复,那样通过索引我们可以加快查询的速度。
exercise3我们完成了连接查询的成本估计与基数估计,而exercise4我们要做的是根据在多表连接的情况下,去选择一个最优的连接顺序,来实现对连接查询的优化。有了这个连接顺序就可以生成执行计划了。
总体的思想是列举出所有的连接顺序,计算出每种连接顺序的代价,然后选择代价最小的连接顺序去执行。但是,如何列举是个问题。举个例子,对于两表连接,连接顺序有2 * 1种可能;对于三表连接,有3 * 2 * 1 = 6种可能。可以发现,按照枚举的方式去弄,有n!种方案。当n = 7时,方案数有655280种;当n = 10时,方案数可以达到176亿。可以看到,这个缺点特别明显,就是当连接的表数一多,我们的方案数回很多,时间复杂度很高。
实验总结lab3虽然不算很难,但很多懵懂的概念,留坑之后读书或者翻翻原视频深挖一下



