由 SuKai August 3, 2021
在前面介绍了Kubernetes集群中提高GPU资源使用率的两个途径:1,GPU虚拟化,共享使用GPU。2,弹性调度,动态地创建和销毁占用GPU资源的Jupyter Pod。今天主要介绍如何通过腾讯开源的tkestack/elastic-jupyter-operator实现GPU资源弹性调度。
弹性调度原理
Jupyter Enterprise Gateway
Jupyter Enterprise Gateway是一个支持多用户和多集群环境的可插拔框架。这样Jupyter Notebook能够在分布式集群中启动远程内核,远程内核可以在使用时创建,在空闲时销毁,不再需要一直占用宝贵的GPU资源。
tkestack/elastic-jupyter-operator
在使用Jupyter Enterprise Gateway过程中,我们需要将远程内核配置到Gateway注册,启动远程内核实例。elastic-jupyter-operator解决了这个过程自动化问题,动态地管理内核,为Gateway生成内核配置,并增加了KernelLauncher新方法,实现Kernel Pod的生命周期管理。通过kubeflow-launcher在Kubernetes中创建jupyter kernel Pod,当Kernel空闲时,删除Kernel的CR,实现Kernel占用资源的回收释放。
部署使用
部署elastic-jupyter-operator
kubectl apply -f ./hack/enterprise_gateway/prepare.yaml
make deploy
创建Gateway CR
apiVersion: kubeflow.tkestack.io/v1alpha1
kind: JupyterGateway
metadata:
name: jupytergateway-elastic-tensorflow
spec:
cullIdleTimeout: 10
cullInterval: 10
logLevel: DEBUG
image: ccr.ccs.tencentyun.com/kubeflow-oteam/enterprise-gateway:dev
# Use the kernel which is defined in JupyterKernelSpec CR.
defaultKernel: python-tensorflow
kernels:
- python-tensorflow
创建KernelSpec CR和KernelTemplate CR
apiVersion: kubeflow.tkestack.io/v1alpha1
kind: JupyterKernelSpec
metadata:
name: python-tensorflow
spec:
language: Python
displayName: "Elastic tensorlfow Kernel on Kubernetes"
image: elyra/kernel-tf-py:2.5.1
# Use the template defined in JupyterKernelTemplate CR.
template:
namespace: default
name: jupyterkerneltemplate-tensorflow
command:
# Use the default scripts to launch the kernel.
- "kubeflow-launcher"
- "--verbose"
- "--RemoteProcessProxy.kernel-id"
- "{kernel_id}"
- "--RemoteProcessProxy.port-range"
- "{port_range}"
- "--RemoteProcessProxy.response-address"
- "{response_address}"
---
apiVersion: kubeflow.tkestack.io/v1alpha1
kind: JupyterKernelTemplate
metadata:
name: jupyterkerneltemplate-tensorflow
spec:
template:
metadata:
app: enterprise-gateway-tensorflow
component: kernel-tensorflow
spec:
restartPolicy: Always
containers:
- name: kernel
env:
- name: "SUKAI"
value: "sukai"
resources:
limits:
cpu: "8"
memory: 4Gi
aliyun.com/gpu-mem: 1
requests:
cpu: "2"
memory: 4Gi
创建Notebook CR
apiVersion: kubeflow.tkestack.io/v1alpha1
kind: JupyterNotebook
metadata:
name: jupyternotebook-elastic-tensorlfow
spec:
gateway:
name: jupytergateway-elastic-tensorflow
namespace: default
# Disable the password and token based auth in this example,
# please do not do it in PROD.
auth:
mode: enable
token: "sukai"
template:
metadata:
labels:
notebook: simple
spec:
containers:
- name: notebook
image: jupyter/base-notebook:python-3.8.6
command: ["tini", "-g", "--", "start-notebook.sh"]
env:
- name: JUPYTER_ENABLE_LAB
value: "true"
访问Jupyter Notebook
sukai@sukai:~$ kubectl port-forward deploy/jupyternotebook-elastic-tensorlfow --address='0.0.0.0' 8889:8888