ES中BKD VS doc value_大数据系统

ES中BKD VS doc value

基于bkd的数字范围查询性能很好，但是由于BKD-Tree内的docId非有序，不能采用类似skipList的向后跳的方式，如果跟其他查询做交集，必须先构造BitSet，这一步可能非常耗时。Lucene中通过IndexOrDocValuesQuery对一些场景做了优化。

      @Override
      public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException {
        final ScorerSupplier indexScorerSupplier = indexWeight.scorerSupplier(context);
        final ScorerSupplier dvScorerSupplier = dvWeight.scorerSupplier(context);
        if (indexScorerSupplier == null || dvScorerSupplier == null) {
          return null;
        }
        return new ScorerSupplier() {
          @Override
          public Scorer get(long leadCost) throws IOException {
            // At equal costs, doc values tend to be worse than points since they
            // still need to perform one comparison per document while points can
            // do much better than that given how values are organized. So we give
            // an arbitrary 8x penalty to doc values.
            final long threshold = cost() >>> 3;
            if (threshold <= leadCost) {
              return indexScorerSupplier.get(leadCost);
            } else {
              return dvScorerSupplier.get(leadCost);
            }
          }

          @Override
          public long cost() {
            return indexScorerSupplier.cost();
          }
        };
      }

 
  public abstract long cost();

Something that is interesting to notice here is that this query planning optimization does not only depend on the fields that are used and their cardinalities, it goes further and estimates the total number of matches for each node of the query tree in order to make good decisions. This means that taking a query and slightly changing the range of values might completely change how the query is executed under the hood.

计算cost的方法

  private long computeCost() {
    OptionalLong minRequiredCost = Stream.concat(
        subs.get(Occur.MUST).stream(),
        subs.get(Occur.FILTER).stream())
        .mapToLong(ScorerSupplier::cost)
        .min();
    if (minRequiredCost.isPresent() && minShouldMatch == 0) {
      return minRequiredCost.getAsLong();
    } else {
      final Collection optionalScorers = subs.get(Occur.SHOULD);
      final long shouldCost = MinShouldMatchSumScorer.cost(
          optionalScorers.stream().mapToLong(ScorerSupplier::cost),
          optionalScorers.size(), minShouldMatch);
      return Math.min(minRequiredCost.orElse(Long.MAX_VALUE), shouldCost);
    }
  }

参考文档

1. 工作中组内遇到的 elasticsearch 使用上的踩坑总结 - AIQ

2. https://www.elastic.co/cn/blog/better-query-planning-for-range-queries-in-elasticsearch

3. [LUCENE-7055] Better execution path for costly queries - ASF JIRA

ES中BKD VS doc value

大数据系统相关栏目本月热门文章