Flink 1.14.4 standalone 问题

官方配置：Configuration | Apache Flink

1、TM进程过一段时间就停止

报错信息：org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Task did not exit gracefully within 180 + seconds.
org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791) [flink-dist_2.11-1.14.4.jar:1.14.4]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]

原因：任务取消超时

解决：TM配置文件${FLINK_HOME}/conf/flink-conf.yml

#取消任务取消watchdog

task.cancellation.timeout: 0

参数说明：Timeout in milliseconds after which a task cancellation times out and leads to a fatal TaskManager error. A value of 0 deactivates the watch dog. Notice that a task cancellation is different from both a task failure and a clean shutdown. Task cancellation timeout only applies to task cancellation and does not apply to task closing/clean-up caused by a task failure or a clean shutdown.

2、web端上传的jar包，在独立集群重启后全部丢失

原因：文件默认保存在/tmp目录，会被清除

解决：JM配置文件${FLINK_HOME}/conf/flink-conf.yml

web.upload.dir: /usr/local/flink/upload
web.tmpdir: /usr/local/flink/tmpdir

3、JM stop-cluster.sh stop不能停止独立集群

原因：pid文件默认保存在/tmp目录，会被清除导致脚本找不到pid结束进程

解决：JM配置文件${FLINK_HOME}/conf/flink-conf.yml

env.pid.dir: /usr/local/flink/piddir

4、zookeeper存储value太长，zookeeper集群down掉导致TM全部down掉，zookeeper报错信息：

Unexpected exception causing shutdown while sock still open
java.io.IOException: Unreasonable length = 1970218037

at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:95)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:249)

Zookeeper server went down in HA cluster. Please replay if there is any work around.

You can attempt to increase your jute.maxbuffer Java System Property on the ZK servers to a value higher than 2-3 GB (in bytes) to overcome this. It appears a very large record was somehow placed into your ZK by an application, which appears to have then caused this issue.

解决方法：配置zookeeper的jute.maxbuffer参数到合适的长度

5、java.lang.OutOfMemoryError: Metaspace. 详细报错信息：

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...
at java.lang.ClassLoader.defineClass1(Native Method) ~[?:1.8.0_291]
at java.lang.ClassLoader.defineClass(ClassLoader.java:756) ~[?:1.8.0_291]
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ~[?:1.8.0_291]
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[?:1.8.0_291]
at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[?:1.8.0_291]
at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[?:1.8.0_291]
at java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[?:1.8.0_291]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_291]
at java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[?:1.8.0_291]

原因：没有找到具体原因，持续观察，网上搜索有两种说法：代码阻塞、背压

短期解决方案：TM配置文件${FLINK_HOME}/conf/flink-conf.yml

修改配置(默认256m)taskmanager.memory.jvm-metaspace.size: 512m

Flink 1.14.4 standalone 问题

Java相关栏目本月热门文章