cromwell-配置集群/云作业管理系统

cromwell 不仅支持本地计算机任务调度，同时支持集群/云计算作业管理系统，只需要进行简单配置，就可以实现大规模计算。

配置文件

与前文提到的cromwell 命令行配置一样，cromwell进行集群运算也需要进行文件配置。官方针对不同的集群/云作业管理系统提供了相关的配置文件（https://github.com/broadinstitute/cromwell/tree/develop/cromwell.example.backends），但是本质都是讲调度命令嵌入其中，下面我们以大家常用的SGE与Docker作业调度系统为例，进行介绍。

作业调度系统的配置文件并非完整的配置文件，必须添加到 https://github.com/broadinstitute/cromwell/blob/develop/cromwell.example.backends/cromwell.examples.conf 的 backend 部分

SGE

# This is an example of how you can use the the Sungrid Engine backend
# for Cromwell. *This is not a complete configuration file!* The
# content here should be copy pasted into the backend -> providers section
# of cromwell.example.backends/cromwell.examples.conf in the root of the repository.
# You should uncomment lines that you want to define, and read carefully to customize
# the file. If you have any questions, please open an issue at
# https://www.github.com/broadinstitute/cromwell/issues

# documentation:
# https://cromwell.readthedocs.io/en/stable/backends/SGE

backend {
# 选择默认的providers， 名字需要与 providers 内部的名字一致
  default = SGE

  providers {
  # 配置SGE providers 
    SGE {
    
   # 所有调度系统都是在该配置文件的基础上进行
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      # 具体的配置文件
      config {
        
        # Limits the number of concurrent jobs， 主要针对cromwell server 设计
        concurrent-job-limit = 5

        # If an 'exit-code-timeout-seconds' value is specified:
        # - check-alive will be run at this interval for every job
        # - if a job is found to be not alive, and no RC file appears after this interval
        # - Then it will be marked as Failed.
        # Warning: If set, Cromwell will run 'check-alive' for every job at this interval
        # 多长时间检查（check-alive）一次任务状态，默认是120s
        exit-code-timeout-seconds = 120

        # 运行环境的属性， 需要与 task/workflow 中的 `runtime` 模块属性 以及 本文submit 命令中的变量 一致
        # 同时说明，可以通过修改配置文件，任意修改 task/workflow 中的 `runtime` 模块属性 满足个性化需求
        runtime-attributes = """
        Int cpu = 1
        Float? memory_gb
        String? sge_queue
        String? sge_project
        """

# submit/kill/check-alive 对调度系统中对应命令进行封装，将cromwell变量嵌入其中

# submit 命令中提到的 jobs_name, cwd, out, err, jobs_id 均为cromwell内置变量
# 其他变量需要提前在 runtime-attributes 中声明

        submit = """
        qsub 
        -terse 
        -V 
        -b y 
        -N ${job_name} 
        -wd ${cwd} 
        -o ${out}.qsub 
        -e ${err}.qsub 
        -pe smp ${cpu} 
        ${"-l mem_free=" + memory_gb + "g"} 
        ${"-q " + sge_queue} 
        ${"-P " + sge_project} 
        /usr/bin/env bash ${script}
        """

        kill = "qdel ${job_id}"
        check-alive = "qstat -j ${job_id}"
        job-id-regex = "(\d+)"
      }
    }
}

docker

dockerRoot=/cromwell-executions
backend {
  default = Docker

  providers {

    # Example backend that _only_ runs workflows that specify docker for every command.
    Docker {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        run-in-background = true
        runtime-attributes = "String docker"
        # 嵌入 docker 的运行命令
        # docker_cwd 通过 dockerRoot（默认 /cromwell-executions） 设置, 与当前目录（${cwd}）下 ./cromwell-executions 相对应
        submit-docker = "docker run --rm -v ${cwd}:${docker_cwd} -i ${docker} /bin/bash < ${docker_script}"
      }
    }
}

SGE + docker

SGE + docker 可能是目前生物信息学分析过程中常见的组合配置，但是官方仅有单独的SGE配置，单独的docker配置，并没有SGE+docker的配置。我单独做了一个完整的配置可以直接在命令行中使用。

# cromwell.sge.docker.config
# 完整配置文件
include required(classpath("application"))

backend {
  default = SGE_Docker

  providers {
    SGE_Docker {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {

        # Limits the number of concurrent jobs
        concurrent-job-limit = 500

        # If an 'exit-code-timeout-seconds' value is specified:
        # - check-alive will be run at this interval for every job
        # - if a job is found to be not alive, and no RC file appears after this interval
        # - Then it will be marked as Failed.
        # Warning: If set, Cromwell will run 'check-alive' for every job at this interval

        # exit-code-timeout-seconds = 120

        # `script-epilogue` configures a shell command to run after the execution of every command block.
        #
        # If this value is not set explicitly, the default value is `sync`, equivalent to:
        # script-epilogue = "sync"
        #
        # To turn off the default `sync` behavior set this value to an empty string:
        # script-epilogue = ""

        script-epilogue = "sync && sleep 8 "

        # 运行环境的属性， 需要与 task/workflow 中的 `runtime` 模块属性 以及 本文submit 命令中的变量 一致
        # 同时说明，可以通过修改配置文件，任意修改 task/workflow 中的 `runtime` 模块属性 满足个性化需求
        runtime-attributes = """
        String docker
        String? root = '/'
        Int? cpu = 1
        Int? memory_gb = 2
        String? sge_queue
        """

        # submit/kill/check-alive 对调度系统中对应命令进行封装，将cromwell变量嵌入其中
        # submit 命令中提到的 jobs_name, cwd, out, err, jobs_id 均为cromwell内置变量
        # 其他变量需要提前在 runtime-attributes 中声明
        submit-docker  = """
        qsub 
        -terse 
        -V 
        -b y 
        -N ${job_name} 
        -wd ${cwd} 
        -o ${out}.qsub 
        -e ${err}.qsub 
        -l vf=${memory_gb}G 
        ${"-pe smp " + cpu} 
        ${"-q " + sge_queue} 
        docker run --rm  --user $(id -u):$(id -g) -a STDERR -v ${root}:${root} ${docker}  /usr/bin/env bash ${docker_script}
        """
        # docker 运行变量，本身是有 -v ${cwd}:${docker_cwd} 的文件夹映射
        #  docker_cwd 通过 dockerRoot（默认 /cromwell-executions） 设置, 与当前目录（${cwd}）下 ./cromwell-executions 相对应， 比如 
        # -v /your/current_work_path/cromwell-executions/TestHelloWorld/3888ab6f-4dcf-4d03-8d84-17ee5623c2bb/call-Hel
loWorld/shard-0:/cromwell-executions/TestHelloWorld/3888ab6f-4dcf-4d03-8d84-17ee5623c2bb/call-HelloWorld/shard-0
        # 这时候比较麻烦的就是，如果你有以绝对路径表示的输出文件，那么docker容器内部的路径名与外部路径不一致，就会造成混乱
        # 所以，我一般会去掉-v ${cwd}:${docker_cwd} , 引入 root 变量，把docker 容器内部、外部的路径打通，方便文件读取、写入
        # 需要注意的是，root 的目录的层级，很明显 根目录/ 最简单，但是权限很高，风险也就高 

        kill = "qdel ${job_id}"
        check-alive = "qstat -j ${job_id}"
        job-id-regex = "(\d+)"

        kill-docker = "qdel ${job_id}"
        check-alive-docker = "qstat -j ${job_id}"
        job-id-regex-docker = "(\d+)"
      }
    }
}
}
docker.hash-lookup.enabled = false

将上述文件cromwell.sge.docker.config作为参数输入，即可使用cromwell调度sge+docker

java -Dconfig.file=cromwell.sge.docker.config cromwell.jar run ...

注意，如果计算节点没有docker image， cromwell 可以自动pull, 如果docker hub 也没有，需要手动 docker build

其他调度系统配置文件 Cloud Providers（云计算配置）

AWS(AWS.conf): Amazon Web Services (https://cromwell.readthedocs.io/en/stable/tutorials/AwsBatch101/)
BCS(BCS.conf) Alibaba Cloud Batch Compute (BCS) backend (https://cromwell.readthedocs.io/en/stable/backends/BCS/)
TES(TES.conf) is a backend that submits jobs to a server with protocol defined by GA4GH (https://cromwell.readthedocs.io/en/stable/backends/TES/)
PAPIv2(PAPIv2.conf): Google Pipelines API backend (version 2!) (https://cromwell.readthedocs.io/en/stable/backends/Google/)

Containers

Docker(Docker.conf): an example backend that only runs workflows with docker in every command
Singularity(singularity.conf): run Singularity containers locally (documentation)
Singularity+Slurm(singularity.slurm.conf): An example using Singularity with SLURM (documentation)
TESK(TESK.conf) is the same, but intended for Kubernetes. See the TES docs at the bottom.
udocker(udocker.conf): to interact with udocker locally documentation
udocker+Slurm(udocker.slurm.conf): to interact with udocker on SLURM (documentation)

Workflow Managers

HtCondor(HtCondor.conf): a workload manager at UW-Madison (https://cromwell.readthedocs.io/en/stable/backends/HTcondor/)
LSF(LSF.conf): the Platform Load Sharing Facility backend(https://cromwell.readthedocs.io/en/stable/backends/LSF/)
SGE(SGE.conf): a backend for Sungrid Engine (https://cromwell.readthedocs.io/en/stable/backends/SGE)
slurm(slurm.conf): SLURM workload manager (https://cromwell.readthedocs.io/en/stable/backends/SLURM/)

Custom

LocalExample: What you should use if you want to define a new backend provider (documentation)

cromwell-配置集群/云作业管理系统

Java相关栏目本月热门文章