尝试了kubeflow上给的tf-operator的example跑了minist分布式的例子,官方github上写得比较笼统,这里把详细的过程记录一下
URL:https://github.com/kubeflow/tf-operator/tree/master/examples/tensorflow/distribution_strategy/keras-API
流程-
代码download到服务器上
-
编写代码,生成Dockerfile
FROM tensorflow/tensorflow:2.1.0-gpu-py3 RUN pip install tensorflow_datasets==2.1.0 # 前面是容器外的路径,后面是容器内的路径,在容器内的工作目录一定要在这个目录下 COPY multi_worker_strategy-with-keras.py / # 命令行运行python代码 ENTRYPOINT ["python", "/multi_worker_strategy-with-keras.py", "--saved_model_dir", "/train/saved_model/", "--checkpoint_dir", "/train/checkpoint"]
- 打包镜像
docker build -f Dockerfile -t kubeflow/multi_worker_strategy:v1.0 .
- 查看镜像
docker images
- 创建PV(可选)
apiVersion: v1
kind: PersistentVolume
metadata:
name: test-pv #
labels:
app: nfs
spec:
storageClassName: nfs # 在此之前需要搭好nfs服务,网上有教程
capacity:
storage: 2Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Recycle
nfs:
path: /nfs/kubeflow # nfs文件路径
server: master # nfs server IP,也可以用在/etc/hosts中定义的主机名
- 创建PVC(可选)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
labels:
app: test-pvc
spec:
storageClassName: nfs #
accessModes:
- ReadWriteMany
resources:
requests:
storage: 2Gi
selector:
matchLabels:
app: nfs
- 查看pv和pvc
kubectl get pv kubectl get pvc
- 编写TFJob的yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: multi-worker
spec:
runPolicy:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow # 这个要注意,之前半天找不到kubeflow/multi_worker_strategy这个容器的日志,才发现这里给它起了个别名叫tensorflow
image: kubeflow/multi_worker_strategy:v1.0 #
volumeMounts:
- mountPath: /train
name: training
resources:
limits:
aliyun.com/gpu-mem: 2 # 这里我用的是阿里云提供的共享GPU的解决方案,GPU分配单位是GB;如果是nvidia提供的方案就还是nvidia.com/gpu: 1,单位是个
volumes:
- name: training
persistentVolumeClaim:
claimName: test-pvc
- 创建和删除pod
kubectl create -f multi_worker_tfjob.yaml kubectl delete -f multi_worker_tfjob.yaml
10.查看pod的log
kube logs [Pod名称]报错(坑!)
情况:
容器在反复申请GPU,最终导致显存OOM
原因:
官方给的代码里没有限制tensorflow代码GPU的使用,需要加上下面这句或者别的限制GPU使用的代码
os.environ["CUDA_VISIBLE_DEVICES"] = "0" gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.333) sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))
注意!
代码修改完之后一定要重新打包镜像!
代码修改完之后一定要重新打包镜像!
代码修改完之后一定要重新打包镜像!
不然改了等于白改



