Jupyter Notebook弹性使用Kubernetes集群GPU资源

SuKai August 3, 2021

在前面介绍了Kubernetes集群中提高GPU资源使用率的两个途径:1,GPU虚拟化,共享使用GPU。2,弹性调度,动态地创建和销毁占用GPU资源的Jupyter Pod。今天主要介绍如何通过腾讯开源的tkestack/elastic-jupyter-operator实现GPU资源弹性调度。

弹性调度原理

Jupyter Enterprise Gateway

Jupyter Enterprise Gateway是一个支持多用户和多集群环境的可插拔框架。这样Jupyter Notebook能够在分布式集群中启动远程内核,远程内核可以在使用时创建,在空闲时销毁,不再需要一直占用宝贵的GPU资源。

tkestack/elastic-jupyter-operator

在使用Jupyter Enterprise Gateway过程中,我们需要将远程内核配置到Gateway注册,启动远程内核实例。elastic-jupyter-operator解决了这个过程自动化问题,动态地管理内核,为Gateway生成内核配置,并增加了KernelLauncher新方法,实现Kernel Pod的生命周期管理。通过kubeflow-launcher在Kubernetes中创建jupyter kernel Pod,当Kernel空闲时,删除Kernel的CR,实现Kernel占用资源的回收释放。

部署使用

部署elastic-jupyter-operator
kubectl apply -f ./hack/enterprise_gateway/prepare.yaml
make deploy
创建Gateway CR
apiVersion: kubeflow.tkestack.io/v1alpha1
kind: JupyterGateway
metadata:
  name: jupytergateway-elastic-tensorflow
spec:
  cullIdleTimeout: 10
  cullInterval: 10
  logLevel: DEBUG
  image: ccr.ccs.tencentyun.com/kubeflow-oteam/enterprise-gateway:dev
  # Use the kernel which is defined in JupyterKernelSpec CR.
  defaultKernel: python-tensorflow
  kernels:
    - python-tensorflow
创建KernelSpec CR和KernelTemplate CR
apiVersion: kubeflow.tkestack.io/v1alpha1
kind: JupyterKernelSpec
metadata:
  name: python-tensorflow
spec:
  language: Python
  displayName: "Elastic tensorlfow Kernel on Kubernetes"
  image: elyra/kernel-tf-py:2.5.1
  # Use the template defined in JupyterKernelTemplate CR.
  template:
    namespace: default
    name: jupyterkerneltemplate-tensorflow
  command:
    # Use the default scripts to launch the kernel.
    - "kubeflow-launcher"
    - "--verbose"
    - "--RemoteProcessProxy.kernel-id"
    - "{kernel_id}"
    - "--RemoteProcessProxy.port-range"
    - "{port_range}"
    - "--RemoteProcessProxy.response-address"
    - "{response_address}"
---
apiVersion: kubeflow.tkestack.io/v1alpha1
kind: JupyterKernelTemplate
metadata:
  name: jupyterkerneltemplate-tensorflow
spec:
  template:
    metadata:
      app: enterprise-gateway-tensorflow
      component: kernel-tensorflow
    spec:
      restartPolicy: Always
      containers:
        - name: kernel
          env:
            - name: "SUKAI"
              value: "sukai"
          resources:
            limits:
              cpu: "8"
              memory: 4Gi
              aliyun.com/gpu-mem: 1
            requests:
              cpu: "2"
              memory: 4Gi
创建Notebook CR
apiVersion: kubeflow.tkestack.io/v1alpha1
kind: JupyterNotebook
metadata:
  name: jupyternotebook-elastic-tensorlfow
spec:
  gateway:
    name: jupytergateway-elastic-tensorflow
    namespace: default
  # Disable the password and token based auth in this example,
  # please do not do it in PROD.
  auth:
    mode: enable
    token: "sukai"
  template:
    metadata:
      labels:
        notebook: simple
    spec:
      containers:
        - name: notebook
          image: jupyter/base-notebook:python-3.8.6
          command: ["tini", "-g", "--", "start-notebook.sh"]
          env:
            - name: JUPYTER_ENABLE_LAB
              value: "true"
访问Jupyter Notebook
sukai@sukai:~$ kubectl port-forward deploy/jupyternotebook-elastic-tensorlfow --address='0.0.0.0' 8889:8888

image-20211105154205295

image-20211105155145047

image-20211105155031476

image-20211105155551417