MIT6.830 Lab3 Query Optimization 实验报告

一、实验概览

lab3实现的是基于代价的查询优化器，以下是讲义给出的实验的大纲：

Recall that the main idea of a cost-based optimizer is to:

Use statistics about tables to estimate “costs” of different
query plans. Typically, the cost of a plan is related to the cardinalities（基数） of
(number of tuples produced by) intermediate joins and selections, as well as the
selectivity of filter and join predicates.
Use these statistics to order joins and selections in an
optimal way, and to select the best implementation for join
algorithms from amongst several alternatives.

In this lab, you will implement code to perform both of these
functions.

我们可以使用表的统计数据去估计不同查询计划的代价。通常，查询计划的代价与中间进行连接和选择产生的记录数的基数有关，以及过滤和连接的选择性。

通过这些统计信息，我们可以选择最佳的连接和选择顺序，从多个查询方案中选择一个最佳的计划去执行。

优化器结构概览：

The key at the bottom explains the symbols; you
will implement the components with double-borders. The classes and
methods will be explained in more detail in the text that follows (you may wish to refer back
to this diagram), but
the basic operation is as follows:

Parser.java constructs a set of table statistics (stored in the
statsMap container) when it is initialized. It then waits for a
query to be input, and calls the method parseQuery on that query.
parseQuery first constructs a LogicalPlan that
represents the parsed query. parseQuery then calls the method physicalPlan on the
LogicalPlan instance it has constructed. The physicalPlan method returns a
DBIterator object that can be used to actually run the query.

In the exercises to come, you will implement the methods that help
physicalPlan devise an optimal plan.

简单总结一下查询优化器的构成：

1.Parser.Java在初始化时会收集并构造所有表格的统计信息，并存到statsMap中。当有查询请求发送到Parser中时，会调用parseQuery方法去处理‘

2.parseQuery方法会把解析器解析后的结果去构造出一个LogicalPlan实例，然后调用LogicalPlan实例的physicalPlan方法去执行，然后返回的是结果记录的迭代器，也就是我们在lab2中做的东西都会在physicalPlan中会被调用。

为了验证文档的描述，大致看了一下LogicalPlan的代码，可以看到有很多添加运算的方法，其中physicalPlan方法是核心，它主要是将逻辑计划翻译成实实在在的查询计划，然后通过JoinOptimize去找出最优的连接顺序然后生成逻辑计划。然后去创建各种运算器实例，然后根据计划去执行，下面是部分代码：

可以看到，lab2我们保证的是一般的SQL语句能够执行；而在lab3，我们要考虑的事情是怎么让SQL执行得更快，最佳的连接的顺序是什么等待。

个人理解，总体的，lab3的查询优化应该分为两个阶段：

第一阶段：收集表的统计信息，有了统计信息我们才可以进行估计；
第二阶段：根据统计信息进行估计，找出最优的执行方案。

lab3共有4个exercise，前面两个exercise做的是第一阶段事情，后面两个exercise做的是第二阶段是事情（分工明确hhh）。

除了上面信息，实验的文档outline部分还给出了很多十分有用的信息，告诉我们如何去统计数据，如何去计算代价等等，可以说是指导方针了。

二、实验过程 Exercise 1: IntHistogram.java

想要估计查询计划的代价，首先是得有数据。那么数据是怎么从table中获取，以怎样的形式收集呢？这里用到了直方图。

简单来讲，一个直方图用于表示一个字段的统计信息，直方图将字段的值分为多个相同的区间，并统计每个区间的记录数，每个区间可以看做是一个桶，单个区间的范围大小看成桶的宽，记录数看成桶的宽，可以说是非常的形象：

如果看不懂，可以看一下《数据库系统概念》里的图，帆船书里面的图会更容易懂一些。一张人员信息表格，年龄字段的直方图如下：

exercise1要做的就是根据给定的数据去构造出这样的直方图,然后是根据直方图的统计信息去估算某个值的选择性(selectivity)。下面是文档描述信息：

You will need to implement
some way to record table statistics for selectivity estimation. We have
provided a skeleton class, IntHistogram that will do this. Our
intent is that you calculate histograms using the bucket-based method described
above, but you are free to use some other method so long as it provides
reasonable selectivity estimates.

We have provided a class StringHistogram that uses
IntHistogram to compute selecitivites for String
predicates. You may modify StringHistogram if you want to
implement a better estimator, though you should not need to in order to
complete this lab.

After completing this exercise, you should be able to pass the
IntHistogramTest unit test (you are not required to pass this test if you
choose not to implement histogram-based selectivity estimation).

我们在这个实验只需要实现IntHistogram，而StringHistogram会将字符串转换为int类型然后调用IntHistogram。首先，是构造器与属性部分。构造器给出直方图的数据范围（最大值最小值），桶的数量。有了这些信息，就可以构造出一个空的直方图。具体实现也是一些很简单的数学计算：

public class IntHistogram {

    private int[] buckets;
    private int min;
    private int max;
    private double width;
    private int ntups;

    public IntHistogram(int buckets, int min, int max) {
    	// some code goes here
        this.buckets = new int[buckets];
        this.min = min;
        this.max = max;
        this.ntups = 0;
        this.width = Math.max(1, (1. + max - min) / this.buckets.length);
    }

这里为了保证数据的准确，后期修改了一点计算宽度的方法，就是要保证宽度在1以上，才不会导致一些值的选择性大于1（一般是0-1，0表示这些记录一个都用不到，也就是代价很小了，1表示这些记录都会被选上，代价最大）。

然后就是构造直方图统计数据的过程，也很简单，以下是实现代码：

    private int getIndex(int v) {
//        if (v < min || v > max) throw new IllegalArgumentException("illegal value");
        return Math.min((int) ((v - min) / width), buckets.length - 1);
    }

    
    public void addValue(int v) {
    	// some code goes here
        if (v < min || v > max) return;
        int index = getIndex(v);
        ntups ++;
        buckets[index] ++;
    }

接下来是这个exercise的大块头，根据直方图已有的统计信息，去计算进行某种运算时某个值表格记录的选择性。这部分的资料在outline很详细的给出如何估计：

To estimate the selectivity of an equality expression,
f=const, compute the bucket that contains value const.
Suppose the width (range of values) of the bucket is w, the height (number of
tuples) is h,
and the number of tuples in the table is ntups. Then, assuming
values are uniformly distributed throughout the bucket, the selectivity of
the
expression is roughly (h / w) / ntups, since (h/w)
represents the expected number of tuples in the bin with value
const.

To estimate the selectivity of a range expression f>const,
compute the bucket b that const is in, with width w_b and height
h_b.

Then, b contains a fraction b_f = h_b / ntups of the
total tuples. Assuming tuples are uniformly distributed throughout b（元组均匀分布）,
the fraction b_part of b that is > const is
(b_right - const) / w_b, where b_right is the right endpoint of
b’s bucket. Thus, bucket b contributes (b_f x
b_part) selectivity to the predicate. In addition, buckets
b+1…NumB-1 contribute all of their
selectivity (which can be computed using a formula similar to
b_f above). Summing the selectivity contributions of all the
buckets will yield the overall selectivity of the expression.
Figure 2 illustrates this process

简单总结一下：

对于等值运算f = const，我们要利用直方图估计一个等值表达式f = const的选择性，首先需要计算出包含该const值的桶，然后进行计算：选择性 =（桶高/桶宽）/ 记录数，也就是 (h / w) / ntups。其中 *(h/w)*我们可以容易理解到这是符合f = const的记录数，然后去除以总记录数，就可以得到该等值运算对表格记录的选择性了。
对于非等值运算，我们采用的也是同样的思想：找出这个范围的记录数，然后除以总记录数。以下的直方图给出了如何计算一个f > const的过程

首先是计算第一小部分：(const， b.right]这部分的选择性

b_f = h_b / ntups

b_part = (b_right - const) / w_b

b_selectivity = b_f x b_part

然后计算总的选择性（其中cnt是b.right到结尾的记录总数）：

selectivity = b_selectivity + 1.0 * cnt / ntups;

计算其它的非等值运算，我们也可以使用同样的思路去做。其实，明白了选择性是干什么用的，整个exercise做起来就会简单一点。下面是代码实现：

public double estimateSelectivity(Predicate.Op op, int v) {

    	// some code goes here
        int index = getIndex(v);
        if (op.equals(Predicate.Op.EQUALS)) {
            //(high / width) / ntups
            if (index < 0 || index >= buckets.length) return 0.0;
            double selectivity = (buckets[index] / width) / ntups;
            return selectivity;
        }
        if (op.equals(Predicate.Op.NOT_EQUALS)) {
            return 1 - estimateSelectivity(Predicate.Op.EQUALS, v);
        }
        if (op.equals(Predicate.Op.GREATER_THAN)) {
            if (v <= min) return 1.0;
            if (v >= max) return 0.0;

            int cnt = 0;
            for(int i = index + 1; i < buckets.length; i++) {
                cnt += buckets[i];
            }
            //b_f = h_b / ntups
            //b_part = (b_right - const) / w_b
            //b_selectivity = b_f x b_part
            double b_f = 1.0 * buckets[index] / ntups;
            double b_part = ((index + 1) * width - v) / width;
            double selectivity = b_f * b_part + 1.0 * cnt / ntups;
            return selectivity;
        }
        if (op.equals(Predicate.Op.GREATER_THAN_OR_EQ)) {
            if (v < min) return 1.0;
            if (v > max) return 0.0;
            return estimateSelectivity(Predicate.Op.GREATER_THAN, v - 1);
        }
        if (op.equals(Predicate.Op.LESS_THAN)) {
            if (v <= min) return 0.0;
            if (v >= max) return 1.0;
            return 1.0 - estimateSelectivity(Predicate.Op.EQUALS, v) - estimateSelectivity(Predicate.Op.GREATER_THAN, v);
        }
        if (op.equals(Predicate.Op.LESS_THAN_OR_EQ)) {
            if (v <= min) return 0.0;
            if (v >= max) return 1.0;
            return 1.0 - estimateSelectivity(Predicate.Op.GREATER_THAN, v);
        }
        return -1.0;
    }

最后还要我们计算表格的平均选择性：

    public double avgSelectivity()
    {
        // some code goes here
        int sum = 0;
        for (int b : buckets) sum += b;
        return 1.0 * sum / ntups;
    }

最后的测试：

Exercise 2: TableStats.java

exercise2要做的是根据给定的tableid，扫描出所有记录，并对每一个字段建立一个直方图。下面是outline给出的指导方案：

Construct a histogram for every attribute in the table. A simple
approach is to use a fixed number of buckets NumB,
with
each bucket representing the number of records in a fixed range of the
domain of the attribute of the histogram. For example, if a field
f ranges from 1 to 100, and there are 10 buckets, then bucket 1 might
contain the count of the number of records between 1 and 10, bucket
2 a count of the number of records between 11 and 20, and so on.

Scan the table again, selecting out all of fields of all of the
tuples and using them to populate the counts of the buckets
in each histogram.（再次扫描该表，选择所有元组的所有字段，并使用它们填充每个直方图中存储桶的计数）

简单总结一下：

扫描全表，确定每个字段的最大最小值；并且根据最大最小值、表的桶数，构造出每个字段的直方图。
再次扫描全表，并在扫描过程中填充数据。

要做的就是上面这两件事，然后就是写代码过测试了：

public TableStats(int tableid, int ioCostPerPage)  {
        // For this function, you'll have to get the
        // DbFile for the table in question,
        // then scan through its tuples and calculate
        // the values that you need.
        // You should try to do this reasonably efficiently, but you don't
        // necessarily have to (for example) do everything
        // in a single scan of the table.
        // some code goes here
        HeapFile dbFile = (HeapFile)Database.getCatalog().getDatabaseFile(tableid);
        this.numPages = dbFile.numPages();
        this.ioCostPerPage = ioCostPerPage;
        DbFileIterator child = dbFile.iterator(null);
        this.td = dbFile.getTupleDesc();
        Map minMap = new HashMap<>();
        Map maxMap = new HashMap<>();
        try {
            child.open();
            while (child.hasNext()) {
                numTuples ++;
                Tuple tuple = child.next();
                for (int i = 0; i < td.numFields(); i++) {
                    if (td.getFieldType(i).equals(Type.INT_TYPE)) {
                        //如果是Int类型，需要先统计各个属性的最大最小值
                        IntField field = (IntField) tuple.getField(i);
                        //最小值
                        Integer minVal = minMap.getOrDefault(i, Integer.MAX_VALUE);
                        minMap.put(i, Math.min(minVal, field.getValue()));
                        //最大值
                        Integer maxVal = maxMap.getOrDefault(i, Integer.MIN_VALUE);
                        maxMap.put(i, Math.max(maxVal, field.getValue()));
                    } else {
                        //如果是String类型，直接放入String直方图即可
                        StringHistogram histogram = this.strHistograms.getOrDefault(i, new StringHistogram(NUM_HIST_BINS));
                        StringField field = (StringField) tuple.getField(i);
                        histogram.addValue(field.getValue());
                        this.strHistograms.put(i, histogram);
                    }
                }
            }
            //根据最大最小值构造直方图
            for (int i = 0; i < td.numFields(); i++) {
                if (minMap.get(i) != null) {
                    //初始化构造int型直方图
                    Integer min = minMap.get(i);
                    Integer max = maxMap.get(i);
                    this.intHistograms.put(i, new IntHistogram(NUM_HIST_BINS, min, max));
                }
            }
            //再扫描一次表，往Int直方图添加数据
            child.rewind();
            System.out.println("Filling data to the Histograms");
            while (child.hasNext()) {
                Tuple tuple = child.next();
                //填充直方图的数据
                for (int i = 0; i < td.numFields(); i++) {
                    if (td.getFieldType(i).equals(Type.INT_TYPE)) {
                        IntField f = (IntField) tuple.getField(i);
                        IntHistogram intHis = this.intHistograms.get(i);
                        if (intHis == null) throw new IllegalArgumentException("参数错误");
                        intHis.addValue(f.getValue());
                        this.intHistograms.put(i, intHis);
                    }
                }
            }
            System.out.println("Finish Fill");
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

计算IO成本：

    public double estimateScanCost() {
        // some code goes here
        //因为是全表扫描了两次，所以需要乘以2
        //感觉可以优化的地方就是用一个list来保存数据，遍历list，不过感觉差不了多少，用内存的空间换速度而已
        return numPages * ioCostPerPage * 2;
    }

根据选择性因子估计结果的记录数：

    
    public int estimateTableCardinality(double selectivityFactor) {
        // some code goes here
        return (int) (numTuples * selectivityFactor);
    }

然后是估算平均选择性，主要调用exercise1的写的东西：

    public double avgSelectivity(int field, Predicate.Op op) {
        // some code goes here
        if (td.getFieldType(field).equals(Type.INT_TYPE)) {
            IntHistogram intHis = intHistograms.get(field);
            return intHis.avgSelectivity();
        } else {
            StringHistogram strHis = (StringHistogram)strHistograms.get(field);
            return strHis.avgSelectivity();
        }
    }

最后是估计某个字段的选择性，也是调用exercise1写的估计某个值的选择性那些东西：

    public double estimateSelectivity(int field, Predicate.Op op, Field constant) {
        // some code goes here
        if (td.getFieldType(field).equals(Type.INT_TYPE)) {
            IntHistogram intHis = intHistograms.get(field);
            IntField intField = (IntField)constant;
            return intHis.estimateSelectivity(op, intField.getValue());
        } else {
            StringHistogram strHis = (StringHistogram)strHistograms.get(field);
            StringField strField = (StringField)constant;
            return strHis.estimateSelectivity(op, strField.getValue());
        }
    }

最后的测试：

Exercise 3: Join Cost Estimation

exercise3要做的是估计连接查询的代价，以下是讲义：

The class JoinOptimizer.java includes all of the methods
for ordering and computing costs of joins. In this exercise, you
will write the methods for estimating the selectivity and cost of
a join, specifically:

Implement estimateJoinCost(LogicalJoinNode j, int card1, int card2, double cost1, double cost2): This method estimates the cost of
join j, given that the left input is of cardinality card1, the
right input of cardinality card2, that the cost to scan the left
input is cost1, and that the cost to access the right input is
card2. You can assume the join is an NL join, and apply
the formula mentioned earlier.
Implement estimateJoinCardinality(LogicalJoinNode j, int card1, int card2, boolean t1pkey, boolean t2pkey): This
method estimates the number of tuples output by join j, given that
the left input is size card1, the right input is size card2, and
the flags t1pkey and t2pkey that indicate whether the left and
right (respectively) field is unique (a primary key).

After implementing these methods, you should be able to pass the unit
tests estimateJoinCostTest and estimateJoinCardinality in JoinOptimizerTest.java.

其实这应该是四个exercise最容易的一个，就是看懂了连接查询的公式，然后写一下就好了，以下是公式：

scancost(t1) + scancost(t2) + joincost(t1 join t2) +
scancost(t3) + joincost((t1 join t2) join t3) +
…

这里提一下基于成本的估计。一般查询的成本分为I/O成本和CPU成本，I/O成本就是我们扫描表获取记录时，需要发生磁盘I/O，产生的时间成本为I/O成本；而有了记录，我们需要判断这些记录符不符合查询的条件，这需要CPU去做，其中产生的时间就是CPU成本。对于连接查询来说，以两表连接为例，首先需要扫描一张表然后过滤出一些记录，然后把过滤完的记录，每一条都去与第二张表进行匹配，这里第一张表称为驱动表t1，第二张表称为被驱动表t2。在两表连接中，驱动表只需要扫描一次，然后产生card1条记录，而被驱动表则需要扫描card1次，这是总的IO成本；然后假设t2表有card2条记录，则产生的CPU成本应该为card1 * card2。所有总成本应该为：

t1的IO成本 + t1的记录数*t2的IO成本 (I/O成本) +

t1的记录数*t2的记录数（CPU成本）

当然，实际的数据库去计算这些成本，都会有一些参数去调节，但总体的公式就是这样。

根据公式写出来的代码：

    public double estimateJoinCost(LogicalJoinNode j, int card1, int card2,
            double cost1, double cost2) {
        if (j instanceof LogicalSubplanJoinNode) {
            // A LogicalSubplanJoinNode represents a subquery.
            // You do not need to implement proper support for these for Lab 3.
            return card1 + cost1 + cost2;
        } else {
            // Insert your code here.
            // HINT: You may need to use the variable "j" if you implemented
            // a join algorithm that's more complicated than a basic
            // nested-loops join.
            double cost = cost1 + card1 * cost2 + card1 * card2;
            return cost;
        }
    }

做完这个之后，该exercise的另一个任务就是估计连接产生的基数(cardinality)，也就是产生结果的记录数。计算选择性的公式为：基数/总记录数。

下面是讲义给出的计算基数的讲解：

Finally, observe that the cost for the join plan p above
includes expressions of the form joincost((t1 join t2) join t3). To evaluate this expression, you need some way to estimate
the size (ntups) of t1 join t2. This join
cardinality estimation problem is harder than the filter selectivity
estimation problem. In this lab, you aren’t required to do anything
fancy for this, though one of the optional excercises in Section 2.4
includes a histogram-based method for join selectivity estimation.

While implementing your simple solution, you should keep in mind the following:

For equality joins, when one of the attributes is a primary key, the number of tuples produced by the join cannot
be larger than the cardinality of the non-primary key attribute.(对于等值连接，如果有一个是主键，则连接生成的主键

数不能比非主键字段的基数大，也就是结果记录数要小于等于非主键字段记录数)

For equality joins when there is no primary key, it’s hard to say much about what the size of the output
is – it could be the size of the product of the cardinalities of the tables (if both tables have the
same value for all tuples) – or it could be 0. It’s fine to make up a simple heuristic (say,
the size of the larger of the two tables).

（参与连接的字段没有主键，则结果记录数很难确定，可以很大也可以很小）

For range scans, it is similarly hard to say anything accurate about sizes.
The size of the output should be proportional to
the sizes of the inputs. It is fine to assume that a fixed fraction
of the cross-product is emitted by range scans (say, 30%). In general, the cost of a range
join should be larger than the cost of a non-primary key equality join of two tables
of the same size.

通常，范围联接的代价应该大于相同大小的两个表的非主键相等联接的代价。

简单总结一下：

对于等值连接，如果两表中参与连接的两个字段中有一个是主键，则产生的基数要小于等于非主键字段记录数，因为主键是唯一的，也就是说非主键的记录每一条都只能最多连接一个主键记录数。这里我思考了一下，主键唯一，为什么不能选主键记录数当作基数呢？原因是如果非主键字段的记录数远远小于主键字段的记录数，那这个基数可以是十分不准确了。
如果等值连接中两个都是主键，那么应该取记录数最小的字段取做基数了，相通了上面的东西，这个就很容易明白了。
如果等值连接中两个都不是主键，那么基数是难以估计的，可以很大也可以很小，一般会用启发式程序去估算基数。所以这也启示我们，建表时如果字段没有重复值要声明Unique，选择性的估计才好做。

数据库在选择索引时，也是会估计基数，然后计算出选择性，使用选择性可以衡量一个字段的不重复记录数有多少，如果一个字段的选择性很低接近0，那么就没必要用索引了，因为会有大量重复的数据，导致我们不断的去回表；如果一个字段的选择性很高，接近于1，说明该字段的记录数是很多不重复，那样通过索引我们可以加快查询的速度。具体内容在《MySQL是怎样运行的》和《MySQL技术内幕》都有讲解，感兴趣的朋友可以看看。

下面是《数据库系统概念》给出的估计选择基数的内容，也值得一看：

有了上面的理论基础，就可以写出代码了：

    
    public static int estimateTableJoinCardinality(Predicate.Op joinOp,
                                                   String table1Alias, String table2Alias, String field1PureName,
                                                   String field2PureName, int card1, int card2, boolean t1pkey,
                                                   boolean t2pkey, Map stats,
                                                   Map tableAliasToId) {
        int card = 1;
        // some code goes here
        switch (joinOp) {
            case EQUALS:
                if (t1pkey && !t2pkey) {
                    card = card2;
                } else if (!t1pkey && t2pkey) {
                    card = card1;
                } else if (t1pkey && t2pkey) {
                    card = Math.min(card1, card2);
                } else {
                    card = Math.max(card1, card2);
                }
                break;
            case NOT_EQUALS:
                //使用记录总数-等值记录数的方法去算
                if (t1pkey && !t2pkey) {
                    card = card1 * card2 - card2;
                } else if (!t1pkey && t2pkey) {
                    card = card1 * card2 - card1;
                } else if (t1pkey && t2pkey) {
                    card = card1 * card2 - Math.min(card1, card2);
                } else {
                    card = card1 * card2 - Math.max(card, card2);
                }
                break;
            default:
                //其它记录按范围查询来算
                card = (int) (0.3 * card1 * card2);
        }
        return card <= 0 ? 1 : card;
    }

最后是测试环节(注意exercise3只需要过以下两个用例)：

Exercise 4: Join Ordering

exercise3我们完成了连接查询的成本估计与基数估计，而exercise4我们要做的是根据在多表连接的情况下，去选择一个最优的连接顺序，来实现对连接查询的优化。有了这个连接顺序就可以生成执行计划了。

总体的思想是列举出所有的连接顺序，计算出每种连接顺序的代价，然后选择代价最小的连接顺序去执行。但是，如何列举是个问题。举个例子，对于两表连接，连接顺序有2 * 1种可能；对于三表连接，有3 * 2 * 1 = 6种可能。可以发现，按照枚举的方式去弄，有n!种方案。当n = 7时，方案数有655280种；当n = 10时，方案数可以达到176亿。可以看到，这个缺点特别明显，就是当连接的表数一多，我们的方案数回很多，时间复杂度很高。

所以本实验采用的是一种基于动态规划的查询计划生成。我们的思路是这样的，先找出部分表的最优连接顺序，然后固定这些表的顺序，然后去连接其它表，这样也可以达到最优。举个例子，我们有5张表进行连接；首先，我们找到5张表中两表连接代价最低的两张表，固定这两张表的顺序；然后用产生的结果去连接剩下的三张表，选出最底代价的顺序；然后5张表的连接顺序就完成了。这样，排列问题就变成了一个子集问题了，如ABCDE五张表，可以(A, B)(C, D, E)，可以(B, A)(C, D, E)等等。所以，对给定n个关系的集合，最多有2的n次方个子集，就算是n = 10，方案数也才1024，可见优化了很多。

下面是实验讲义给出的动态规划法伪代码：

1. j = set of join nodes
2. for (i in 1...|j|):
3.     for s in {all length i subsets of j}
4.       bestPlan = {}
5.       for s' in {all length d-1 subsets of s}
6.            subplan = optjoin(s')
7.            plan = best way to join (s-s') to subplan
8.            if (cost(plan) < cost(bestPlan))
9.               bestPlan = plan
10.      optjoin(s) = bestPlan
11. return optjoin(j)

《数据库系统概念》的伪代码可能更好的理解：

帆船书的两种方法对比：

核心是：分割成两个子集，然后选出两个子集中各自最佳的连接顺序，最后两个子集的最优结果去确定连接顺序。

为了实现这个算法，exercise4给我们提供了两个很好用的方法：

1.enumerateSubsets(List v, int size)：分割子集，给定集合和分割大小，返回指定大小子集的所有可能

2.computeCostAndCardOfSubplan：计算子计划的代价

依据上面的思路，以及提供的两个方法，代码实现如下：

    public List orderJoins(
            Map stats,
            Map filterSelectivities, boolean explain)
            throws ParsingException {

        // some code goes here
        //Replace the following
        CostCard bestCostCard = new CostCard();
        PlanCache planCache = new PlanCache();
        int size = joins.size();
        for (int i = 1; i <= size; i++) {
            //找出给定size的所有子集，全排列的方式
            Set> subsets = enumerateSubsets(joins, i);
            for (Set set : subsets) {
                double bestCostSoFar = Double.MAX_VALUE;
                bestCostCard = new CostCard();
                for (LogicalJoinNode logicalJoinNode : set) {
                    //找出最优的方案：依据planCache和计算set - logicalJoinNode的最优值
                    CostCard costCard = computeCostAndCardOfSubplan(stats, filterSelectivities, logicalJoinNode, set, bestCostSoFar, planCache);
                    if (costCard == null) continue;
                    bestCostSoFar = costCard.cost;
                    bestCostCard = costCard;
                }
                if (bestCostSoFar != Double.MAX_VALUE) {
                    planCache.addPlan(set, bestCostCard.cost, bestCostCard.card, bestCostCard.plan);
                }
            }
        }
        if (explain) printJoins(bestCostCard.plan, planCache, stats, filterSelectivities);
        return bestCostCard.plan;
        // 4s862ms  ==>  2s558ms
//        return joins;
    }

我们可以对比一下优化前后提高的速度。优化前默认的连接顺序：耗时5s645ms

优化后选择了最佳的连接顺序：耗时2s841ms

可以看到，速度快了一倍。可见查询优化是多么重要。

三、代码优化

除了完成上面的exercise，lab3还给了我们一些优化的思路。其中一个就是上面求子集的优化。实验给的代码是这样的：

    public  Set> enumerateSubsets(List v, int size) {
        Set> els = new HashSet<>();
        els.add(new HashSet<>());
        // Iterator it;
        // long start = System.currentTimeMillis();

        for (int i = 0; i < size; i++) {
            Set> newels = new HashSet<>();
            for (Set s : els) {
                for (T t : v) {
                    Set news = new HashSet<>(s);
                    if (news.add(t))
                        newels.add(news);
                }
            }
            els = newels;
        }
        return els;

    }

可以看到，是真的很傻，因为创建的对象实在太多了，要不停的去创建新的HashSet对象。我优化的方式是采用回溯算法去枚举子集，这样避免了大量对象的产生。具体实现代码如下：

    public  Set> enumerateSubsets(List v, int size) {
        Set> els = new HashSet<>();
        dfs(v, size,0, els, new ArrayDeque<>());
        return els;

    }

    private  void dfs(List list, int size, int begin, Set> res, Deque path) {
        if (path.size() == size) {
            res.add(new HashSet<>(path));
        }
        for (int i = begin; i < list.size(); i++) {
            path.addLast(list.get(i));
            dfs(list, size, i + 1, res, path);
            path.removeLast();
        }
    }

看下运行时间与连接顺序：可以看到连接顺序和前面使用优化的一样，说明我们写的算法是正确的。但是时间上没有改善多少。

但总体的好处还是有的，就是避免了大量的创建新对象。算法的复杂度：

时间复杂度：O(2的n次方)

空间复杂度: O(n)

四、实验总结

lab3的总体难度不大（个人觉得比lab2难度小一些），但是学到的东西有很多。一开始看文档感觉一脸懵逼，因为很多概念都没有见到过，包括直方图统计数据、基数、选择性这些，然后交流群有同学建议看下帆船书会好很多，结合着讲义理解确实会容易一些。一个一个exercise做下来，一开始感觉是没有整体的一个概念的，尤其是那张优化器结构图，不能李姐。后面整个实验做下来，然后结合那几本数据库的书籍，对整理的理解算是清晰了一点，知道了为什么要做这个。但是对比MySQL那些数据库，是远远不能比得上的。然后子查询的优化也还没有了解，后面需要去看看帆船书。还有两表连接时没有主键，如何去估计基数，这个也是没有做的，在帆船书里讲的是启发式算法去做的，这也是需要充充电的hhh。总之，数据库的东西太多了，尽量在做实验的时候多去了解，慢慢积累吧。明天再补充看一点查询优化的知识，然后就可以开始做lab4了。lab4听说做的是事务，这部分之前也是没有细致了解过的，希望能在lab4收获多多hhh。

实验时间：2021.10.09-2021.10.14

报告撰写时间：2021.10.15

MIT6.830 Lab3 Query Optimization 实验报告

Java相关栏目本月热门文章