cromwell 不仅支持本地计算机任务调度,同时支持集群/云计算作业管理系统,只需要进行简单配置,就可以实现大规模计算。
配置文件与前文提到的cromwell 命令行配置一样,cromwell进行集群运算也需要进行文件配置。官方针对不同的集群/云作业管理系统提供了相关的配置文件(https://github.com/broadinstitute/cromwell/tree/develop/cromwell.example.backends),但是本质都是讲调度命令嵌入其中,下面我们以大家常用的SGE与Docker作业调度系统为例,进行介绍。
SGE作业调度系统的配置文件并非完整的配置文件,必须添加到 https://github.com/broadinstitute/cromwell/blob/develop/cromwell.example.backends/cromwell.examples.conf 的 backend 部分
# This is an example of how you can use the the Sungrid Engine backend
# for Cromwell. *This is not a complete configuration file!* The
# content here should be copy pasted into the backend -> providers section
# of cromwell.example.backends/cromwell.examples.conf in the root of the repository.
# You should uncomment lines that you want to define, and read carefully to customize
# the file. If you have any questions, please open an issue at
# https://www.github.com/broadinstitute/cromwell/issues
# documentation:
# https://cromwell.readthedocs.io/en/stable/backends/SGE
backend {
# 选择默认的providers, 名字需要与 providers 内部的名字一致
default = SGE
providers {
# 配置SGE providers
SGE {
# 所有调度系统都是在该配置文件的基础上进行
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
# 具体的配置文件
config {
# Limits the number of concurrent jobs, 主要针对cromwell server 设计
concurrent-job-limit = 5
# If an 'exit-code-timeout-seconds' value is specified:
# - check-alive will be run at this interval for every job
# - if a job is found to be not alive, and no RC file appears after this interval
# - Then it will be marked as Failed.
# Warning: If set, Cromwell will run 'check-alive' for every job at this interval
# 多长时间检查(check-alive)一次任务状态,默认是120s
exit-code-timeout-seconds = 120
# 运行环境的属性, 需要与 task/workflow 中的 `runtime` 模块属性 以及 本文submit 命令中的变量 一致
# 同时说明,可以通过修改配置文件,任意修改 task/workflow 中的 `runtime` 模块属性 满足个性化需求
runtime-attributes = """
Int cpu = 1
Float? memory_gb
String? sge_queue
String? sge_project
"""
# submit/kill/check-alive 对调度系统中对应命令进行封装,将cromwell变量嵌入其中
# submit 命令中提到的 jobs_name, cwd, out, err, jobs_id 均为cromwell内置变量
# 其他变量需要提前在 runtime-attributes 中声明
submit = """
qsub
-terse
-V
-b y
-N ${job_name}
-wd ${cwd}
-o ${out}.qsub
-e ${err}.qsub
-pe smp ${cpu}
${"-l mem_free=" + memory_gb + "g"}
${"-q " + sge_queue}
${"-P " + sge_project}
/usr/bin/env bash ${script}
"""
kill = "qdel ${job_id}"
check-alive = "qstat -j ${job_id}"
job-id-regex = "(\d+)"
}
}
}
docker
dockerRoot=/cromwell-executions
backend {
default = Docker
providers {
# Example backend that _only_ runs workflows that specify docker for every command.
Docker {
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
run-in-background = true
runtime-attributes = "String docker"
# 嵌入 docker 的运行命令
# docker_cwd 通过 dockerRoot(默认 /cromwell-executions) 设置, 与当前目录(${cwd})下 ./cromwell-executions 相对应
submit-docker = "docker run --rm -v ${cwd}:${docker_cwd} -i ${docker} /bin/bash < ${docker_script}"
}
}
}
SGE + docker
SGE + docker 可能是目前生物信息学分析过程中常见的组合配置,但是官方仅有单独的SGE配置,单独的docker配置,并没有SGE+docker的配置。我单独做了一个完整的配置可以直接在命令行中使用。
# cromwell.sge.docker.config
# 完整配置文件
include required(classpath("application"))
backend {
default = SGE_Docker
providers {
SGE_Docker {
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
config {
# Limits the number of concurrent jobs
concurrent-job-limit = 500
# If an 'exit-code-timeout-seconds' value is specified:
# - check-alive will be run at this interval for every job
# - if a job is found to be not alive, and no RC file appears after this interval
# - Then it will be marked as Failed.
# Warning: If set, Cromwell will run 'check-alive' for every job at this interval
# exit-code-timeout-seconds = 120
# `script-epilogue` configures a shell command to run after the execution of every command block.
#
# If this value is not set explicitly, the default value is `sync`, equivalent to:
# script-epilogue = "sync"
#
# To turn off the default `sync` behavior set this value to an empty string:
# script-epilogue = ""
script-epilogue = "sync && sleep 8 "
# 运行环境的属性, 需要与 task/workflow 中的 `runtime` 模块属性 以及 本文submit 命令中的变量 一致
# 同时说明,可以通过修改配置文件,任意修改 task/workflow 中的 `runtime` 模块属性 满足个性化需求
runtime-attributes = """
String docker
String? root = '/'
Int? cpu = 1
Int? memory_gb = 2
String? sge_queue
"""
# submit/kill/check-alive 对调度系统中对应命令进行封装,将cromwell变量嵌入其中
# submit 命令中提到的 jobs_name, cwd, out, err, jobs_id 均为cromwell内置变量
# 其他变量需要提前在 runtime-attributes 中声明
submit-docker = """
qsub
-terse
-V
-b y
-N ${job_name}
-wd ${cwd}
-o ${out}.qsub
-e ${err}.qsub
-l vf=${memory_gb}G
${"-pe smp " + cpu}
${"-q " + sge_queue}
docker run --rm --user $(id -u):$(id -g) -a STDERR -v ${root}:${root} ${docker} /usr/bin/env bash ${docker_script}
"""
# docker 运行变量,本身是有 -v ${cwd}:${docker_cwd} 的文件夹映射
# docker_cwd 通过 dockerRoot(默认 /cromwell-executions) 设置, 与当前目录(${cwd})下 ./cromwell-executions 相对应, 比如
# -v /your/current_work_path/cromwell-executions/TestHelloWorld/3888ab6f-4dcf-4d03-8d84-17ee5623c2bb/call-Hel
loWorld/shard-0:/cromwell-executions/TestHelloWorld/3888ab6f-4dcf-4d03-8d84-17ee5623c2bb/call-HelloWorld/shard-0
# 这时候比较麻烦的就是,如果你有以绝对路径表示的输出文件,那么docker容器内部的路径名与外部路径不一致,就会造成混乱
# 所以,我一般会去掉-v ${cwd}:${docker_cwd} , 引入 root 变量,把docker 容器内部、外部的路径打通,方便文件读取、写入
# 需要注意的是,root 的目录的层级,很明显 根目录/ 最简单,但是权限很高,风险也就高
kill = "qdel ${job_id}"
check-alive = "qstat -j ${job_id}"
job-id-regex = "(\d+)"
kill-docker = "qdel ${job_id}"
check-alive-docker = "qstat -j ${job_id}"
job-id-regex-docker = "(\d+)"
}
}
}
}
docker.hash-lookup.enabled = false
将上述文件cromwell.sge.docker.config作为参数输入,即可使用cromwell调度sge+docker
java -Dconfig.file=cromwell.sge.docker.config cromwell.jar run ...
其他调度系统配置文件 Cloud Providers(云计算配置)注意,如果计算节点没有docker image, cromwell 可以自动pull, 如果docker hub 也没有,需要手动 docker build
- AWS(AWS.conf): Amazon Web Services (https://cromwell.readthedocs.io/en/stable/tutorials/AwsBatch101/)
- BCS(BCS.conf) Alibaba Cloud Batch Compute (BCS) backend (https://cromwell.readthedocs.io/en/stable/backends/BCS/)
- TES(TES.conf) is a backend that submits jobs to a server with protocol defined by GA4GH (https://cromwell.readthedocs.io/en/stable/backends/TES/)
- PAPIv2(PAPIv2.conf): Google Pipelines API backend (version 2!) (https://cromwell.readthedocs.io/en/stable/backends/Google/)
- Docker(Docker.conf): an example backend that only runs workflows with docker in every command
- Singularity(singularity.conf): run Singularity containers locally (documentation)
- Singularity+Slurm(singularity.slurm.conf): An example using Singularity with SLURM (documentation)
- TESK(TESK.conf) is the same, but intended for Kubernetes. See the TES docs at the bottom.
- udocker(udocker.conf): to interact with udocker locally documentation
- udocker+Slurm(udocker.slurm.conf): to interact with udocker on SLURM (documentation)
- HtCondor(HtCondor.conf): a workload manager at UW-Madison (https://cromwell.readthedocs.io/en/stable/backends/HTcondor/)
- LSF(LSF.conf): the Platform Load Sharing Facility backend(https://cromwell.readthedocs.io/en/stable/backends/LSF/)
- SGE(SGE.conf): a backend for Sungrid Engine (https://cromwell.readthedocs.io/en/stable/backends/SGE)
- slurm(slurm.conf): SLURM workload manager (https://cromwell.readthedocs.io/en/stable/backends/SLURM/)
- LocalExample: What you should use if you want to define a new backend provider (documentation)



