记一次Java性能优化及分析

hugegraph core性能优化 - 提升插入速度35% 背景

HugeGraph是一个Java语言编写的图数据库，也是开源的项目，本文记录了一次HugeGraph早期优化性能的过程分析与结果。

HugeGraph之前使用Cassandra后端时，插入数据基本上90%时间花在写数据库操作上。随着后端存储RocksDB性能的不断优化（目前已达到 12w edges/s），HugeGraph在插入数据时core本身所占时间比重越来越大，因此core本身性能的优化又一次提上日程。

如何优化性能？大体步骤

找一个性能分析工具监控程序运行状态（这里用的是hugegraph本身性能工具watched）；
利用watched监控插入数据整个过程的状态，拿到各个方法的调用次数与时间等信息；
分析性能数据，找出时间比重大的方法，进行代码分析；
找到可改进的代码进行优化，重新运行并验证结果。

哪些地方容易成为瓶颈代码？

大循环中
同步加锁
频繁调用
申请较大内存操作
大集合的增加、删除、查找等操作
反射

哪些代码可以优化？

日志打印中构造的无用message
错误检查中构造的无用message
频繁申请过量内存而未充分使用
频繁被调用的非final|private小方法
频繁调用lambda函数
其它，如冗余代码、逻辑本身问题等

优化结果环境

Mac 10.11
Java 8
HugeGraph v0.5
RocksDB 5.8.6（uncommitted）
PerfExample3 thread=1 times=10000 multiple=1

PerfExample3中的性能分析基准代码如下（每次插入100个顶点，循环1万次）：

    protected void testInsert(GraphManager graph, int times, int multiple) {
        final int TIMES = times * multiple;
        final int BATCH = 100;
        long total = 0;
        // Insert in order
        for (int i = 0; i < TIMES; i++) {
            for (int j = 0; j < BATCH; j++) {
                String name = String.format("p-%08d", total++);
                Vertex v = graph.addVertex(T.label, "person", "name", name,
                                           "city", "Hongkong", "age", 18);
                this.vertices.add(v.id());
            }
            graph.tx().commit();
        }
    }

优化前后结果对比

	第1次	第2次	第3次	第4次	第5次	平均
优化前 (vertices/s)	44232	44772	43631	44849	45071	44511
优化后 (vertices/s)	59262	60771	59276	60125	60602	60007

优化后平均性能提升34.81%，具体优化代码见http://github.com/hugegraph/hugegraph/3991111/files

优化过程分析准备工作 1、埋点

首先使用性能工具watched对需要监控的方法进行埋点，比如想对addVertex方法进行监控，则使用watched工具中的@Watched注解对其进行装饰即可：

  @Watched(prefix = "graph")
  public Vertex addVertex(Object... keyValues) {
      ...
  }

2、开启监控模式

默认情况下watched是不进行监控的（本身影响性能），需要在程序启动时打开监控模式。监控哪些类或包都可以指定，下面我们对整个com.baidu.hugegraph包下的代码都进行监控：

import com.baidu.hugegraph.perf.PerfUtil;

public static void profile() {
    PerfUtil.instance().profilePackage("com.baidu.hugegraph");
}

3、收集监控结果

在程序结束时，需要收集并打印监控结果信息，结果可以通过json格式返回，也可以通过echarts环形辐射图格式返回，这里我们使用echarts格式：

LOG.info("option = {}", PerfUtil.instance().toECharts());

分析过程首次运行后的监控结果

从图中看出，占比最大的是graph.addVertex方法，占总时间的67.05%，被调用10w次共花去时间4045ms（分析过程中没有设置times=10000，而是设置times=1000，下同），剩下基本是tx.commit方法，占30.63%，被调用3000次共花去1848ms。

在tx.commit方法下的调用比较简单，先从它开始分析，从图中可以看出，它下面有4层调用分别如下：

tx.prepareCommit()：准备提交的内容（1000次 23.9%）；
bin.writeVertex()：系列化顶点对象（10w次 18.49%）；
bin.formatProperty()系列化顶点里面的属性（30w次 14.61%）；
bin.formatPropertyName()系列化顶点里面的属性名称（30w次 2.21%）；kryo.toKryo()：系列化顶点里面的属性值（30w次 8.3% 500ms）。

每一层调用几乎都要损失几个百分点的性能（本身损失加上多次循环调用下一层方法），经过分析前3层的代码没有找到什么可优化的地方，不过最外面一层的kryo.toKryo()很值得关注，其代码如下：

    public static byte[] toKryo(Object value) {
        try (OutputStream bos = new ByteArrayOutputStream();
             Output output = new Output(bos)) {
            kryo().writeObject(output , value);
            return output.toBytes();
        } catch (IOException e) {
            throw new BackendException("Failed to serialize: %s", e, value);
        }
    }

第1次优化

开始以为kryo系列化本身就是需要花去那么多时间，但是经过分析Output类发现，其内部会默认构造一个4k大小的buffer用于缓冲write数据，但是一个属性值大部分场景下数据量并没有那么多，大概几十个字节~上百字节，所以这是一个颇浪费的操作，于是改为分配256字节再试了一下Output output = new Output(bos, 256)，重新运行并获得如下监控结果：

从图中可以看出kryo.toKryo()占比减少到3.77%，时间减少到228ms，均较低一半多。

第2次优化

继续分析，我们将目光转移到占比最大的graph.addVertex方法，其外层方法如下：

vertex.property：设置顶点属性（30w次 37.41%）；
element.addProperty：调用父类增加一个属性（30w次 25.09% 1516ms）；
vertex.onUpdateProperty：属性更新事件（30w次 8.79% 531ms）。

其中vertex.onUpdateProperty是外层占比较大的方法，因为这个方法里面最重点的操作setProperty仅仅占了1.92%，没有干其它的事情却损失了4.79个百分点，值得分析。其代码如下：

    public  HugeProperty addProperty(PropertyKey pkey, V value,
                                           boolean notify) {
        HugeProperty prop = null;
        switch (pkey.cardinality()) {
            case SINGLE:
                prop = this.newProperty(pkey, value);
                if (notify) {
                    
                    this.onUpdateProperty(pkey.cardinality(), prop);
                }
                this.setProperty(prop);
                break;
            ...
        }
        ...
    }

    protected  void onUpdateProperty(Cardinality cardinality,
                                        HugeProperty prop) {
        if (prop != null) {
            assert prop instanceof HugeVertexProperty;
            // Use addVertexProperty() to update
            this.tx().addVertexProperty((HugeVertexProperty) prop);
        }
    }

    public  void addVertexProperty(HugeVertexProperty prop) {
        // NOTE: this method can also be used to update property

        HugeVertex vertex = prop.element();
        E.checkState(vertex != null,
                     "No owner for updating property '%s'", prop.key());

        // Add property in memory for new created vertex
        if (vertex.fresh()) {
            // The owner will do property update
            vertex.setProperty(prop);
            return;
        }
        ...
    }

从代码中发现，调用addVertexProperty方法后又调回了vertex.setProperty方法（在创建顶点场景时），这属于冗余的调用，逻辑上只需要在addProperty方法中调用vertex.setProperty方法即可。也就是说这里可以预先判断是否为创建顶点场景，若是则无需冗余调用，只有更新顶点属性时才调用tx.addVertexProperty方法，于是修改代码如下：

    protected  void onUpdateProperty(Cardinality cardinality,
                                        HugeProperty prop) {
        if (prop != null && !this.fresh()) { // 这里增加了对非新创建顶点的判断
            assert prop instanceof HugeVertexProperty;
            // Use addVertexProperty() to update
            this.tx().addVertexProperty((HugeVertexProperty) prop);
        }
    }

重新运行并获得如下监控结果：

从图中可看到vertex.onUpdateProperty方法占比降为1.46%，具体如下：

vertex.property：设置顶点属性（30w次 37.41% -> 33.77%）；
element.addProperty：调用父类增加一个属性（30w次 25.09% -> 14.19%，1516ms -> 772ms）；
vertex.onUpdateProperty：属性更新事件（30w次 8.79% -> 1.46%，531ms -> 79ms）。

第3次优化

继续分析，我们将目光转移到vertex.onUpdateProperty方法旁边的vertex.newProperty方法，该方法仅创建了一个属性对象就占比4.48%用时244ms，值得分析。

代码如下：

    protected  HugeVertexProperty newProperty(PropertyKey pkey, V val) {
        return new HugeVertexProperty<>(this, pkey, val);
    }

    public HugeVertexProperty(HugeElement owner, PropertyKey key, V value) {
        super(owner, key, value);
    }

    public HugeProperty(HugeElement owner, PropertyKey pkey, V value) {
        E.checkArgument(owner != null, "Property owner can't be null");
        E.checkArgument(pkey != null, "Property key can't be null");
        E.checkArgument(value != null, "Property value can't be null");

        this.owner = owner;
        this.pkey = pkey;
        this.value = pkey.validValue(value);

        E.checkArgument(this.value != null,
                        "Invalid property value '%s' for key '%s', " +
                        "expect a value of type %s, actual type %s",
                        value, pkey.name(),
                        pkey.clazz().getSimpleName(),
                        value.getClass().getSimpleName());
    }

方法vertex.newProperty最终调用HugeProperty构造函数，这里面并没有太复杂的内容，仅仅是4次参数检查和3次赋值操作。一开始分析了validValue()方法占用的时间，发现其实并不多。然后试着把最后一段检查去掉，发现时间占比立马降下来了，也就是说E.checkArgument()那一段用时不少。经过详细分析发现，通过反射方式调用Class的getSimpleName()方法非常耗时。而在大部分场景下，参数V value的类型是正确的，这是也就无需构造那些错误信息，但这里浪费了（即使30w次正确的调用，也准备好了用于错误信息的类名字符串）。于是将代码改正如下：

    public HugeProperty(HugeElement owner, PropertyKey pkey, V value) {
        E.checkArgument(owner != null, "Property owner can't be null");
        E.checkArgument(pkey != null, "Property key can't be null");
        E.checkArgument(value != null, "Property value can't be null");

        this.owner = owner;
        this.pkey = pkey;
        this.value = pkey.validValue(value);

        if (this.value == null) { // 只有当发生错误时才进行错误信息的准备并抛出异常
            E.checkArgument(false,
                            "Invalid property value '%s' for key '%s', " +
                            "expect a value of type %s, actual type %s",
                            value, pkey.name(),
                            pkey.clazz().getSimpleName(),
                            value.getClass().getSimpleName());
        }
    }

重新运行并获得如下监控结果：

从图中可看到vertex.newProperty方法占比降为1.69%，具体如下：

vertex.property：设置顶点属性（30w次 33.77% -> 25.27%）；
element.addProperty：调用父类增加一个属性（30w次 14.19% -> 12.1%，772ms -> 644ms）；
vertex.onUpdateProperty：属性更新事件（30w次 1.46% -> 1.43%，79ms -> 76ms）;
vertex.newProperty：新建顶点属性方法（30w次 4.48% -> 1.69%，244ms -> 90ms）。

其它小优化

其它一些小优化影响不是那么明显，这里就不一一分析了。具体优化代码见http://github.com/hugegraph/hugegraph/3991111/files

总结

这些代码优化本身并没有太困难的地方，而且每一次优化对整体的性能影响并不是很明显，不过多个被频繁调用的小优化累加在一起，还是能达到预想不到的效果：最终提升性能35%。core代码已经过多次优化了，不太容易找到提升几倍性能的优化点，提升35%也是不错的。

本文的重点更是在于梳理出：如何分析并找到性能瓶颈。至于如何优化则是千变万化的了。

记一次Java性能优化及分析

Java相关栏目本月热门文章