intel-device-plugins-for-ku.../cmd/operator/README.md
Tuomas Katila 26b8b9bd3d Update README kubectl applies for zsh support
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-12-29 12:25:29 +02:00

176 lines
6.7 KiB
Markdown

# Intel Device Plugins Operator
Table of Contents
* [Introduction](#introduction)
* [Installation](#installation)
* [Upgrade](#upgrade)
* [Known issues](#known-issues)
## Introduction
Intel Device Plugins Operator is a Kubernetes custom controller whose goal is to serve the
installation and lifecycle management of Intel device plugins for Kubernetes.
It provides a single point of control for GPU, QAT, SGX, FPGA, DSA and DLB devices to a cluster
administrators.
## Installation
Install NFD (if it's not already installed) and node labelling rules (requires NFD v0.10+):
```
# either with default NFD installation
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd?ref=<RELEASE_VERSION>'
# or when setting up with SGX
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd/overlays/sgx?ref=<RELEASE_VERSION>'
# and finally, NodeFeatureRules
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/nfd/overlays/node-feature-rules?ref=<RELEASE_VERSION>'
```
Make sure both NFD master and worker pods are running:
```
$ kubectl get pods -n node-feature-discovery
NAME READY STATUS RESTARTS AGE
nfd-master-599c58dffc-9wql4 1/1 Running 0 25h
nfd-worker-qqq4h 1/1 Running 0 25h
```
Note that labelling is not performed immediately. Give NFD 1 minute to pick up the rules and label nodes.
As a result all found devices should have correspondent labels, e.g. for Intel DLB devices the label is
intel.feature.node.kubernetes.io/dlb:
```
$ kubectl get no -o json | jq .items[].metadata.labels |grep intel.feature.node.kubernetes.io/dlb
"intel.feature.node.kubernetes.io/dlb": "true",
```
Full list of labels can be found in the deployments/operator/samples directory:
```
$ grep -r feature.node.kubernetes.io/ deployments/operator/samples/
deployments/operator/samples/deviceplugin_v1_dlbdeviceplugin.yaml: intel.feature.node.kubernetes.io/dlb: 'true'
deployments/operator/samples/deviceplugin_v1_qatdeviceplugin.yaml: intel.feature.node.kubernetes.io/qat: 'true'
deployments/operator/samples/deviceplugin_v1_sgxdeviceplugin.yaml: intel.feature.node.kubernetes.io/sgx: 'true'
deployments/operator/samples/deviceplugin_v1_gpudeviceplugin.yaml: intel.feature.node.kubernetes.io/gpu: "true"
deployments/operator/samples/deviceplugin_v1_fpgadeviceplugin.yaml: intel.feature.node.kubernetes.io/fpga-arria10: 'true'
deployments/operator/samples/deviceplugin_v1_dsadeviceplugin.yaml: intel.feature.node.kubernetes.io/dsa: 'true'
```
The default operator deployment depends on [cert-manager](https://cert-manager.io/) running in the cluster.
See installation instructions [here](https://cert-manager.io/docs/installation/kubectl/).
Make sure all the pods in the `cert-manager` namespace are up and running:
```
$ kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-7747db9d88-bd2nl 1/1 Running 0 21d
cert-manager-cainjector-87c85c6ff-59sb5 1/1 Running 0 21d
cert-manager-webhook-64dc9fff44-29cfc 1/1 Running 0 21d
```
Also if your cluster operates behind a corporate proxy make sure that the API
server is configured not to send requests to cluster services through the
proxy. You can check that with the following command:
```bash
$ kubectl describe pod kube-apiserver --namespace kube-system | grep -i no_proxy | grep "\.svc"
```
In case there's no output and your cluster was deployed with `kubeadm` open
`/etc/kubernetes/manifests/kube-apiserver.yaml` at the control plane nodes and
append `.svc` and `.svc.cluster.local` to the `no_proxy` environment variable:
```yaml
apiVersion: v1
kind: Pod
metadata:
...
spec:
containers:
- command:
- kube-apiserver
- --advertise-address=10.237.71.99
...
env:
- name: http_proxy
value: http://proxy.host:8080
- name: https_proxy
value: http://proxy.host:8433
- name: no_proxy
value: 127.0.0.1,localhost,.example.com,10.0.0.0/8,.svc,.svc.cluster.local
...
```
**Note:** To build clusters using `kubeadm` with the right `no_proxy` settings from the very beginning,
set the cluster service names to `$no_proxy` before `kubeadm init`:
```
$ export no_proxy=$no_proxy,.svc,.svc.cluster.local
```
Finally deploy the operator itself:
```
$ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes/deployments/operator/default?ref=<RELEASE_VERSION>'
```
Now you can deploy the device plugins by creating corresponding custom resources.
The samples for them are available [here](/deployments/operator/samples/).
## Usage
Deploy your device plugin by applying its custom resource, e.g.
`GpuDevicePlugin` with
```bash
$ kubectl apply -f https://raw.githubusercontent.com/intel/intel-device-plugins-for-kubernetes/main/deployments/operator/samples/deviceplugin_v1_gpudeviceplugin.yaml
```
Observe it is up and running:
```bash
$ kubectl get GpuDevicePlugin
NAME DESIRED READY NODE SELECTOR AGE
gpudeviceplugin-sample 1 1 5s
```
In order to limit the deployment to a specific device type,
use one of kustomizations under deployments/operator/device.
For example, to limit the deployment to FPGA, use:
```bash
$ kubectl apply -k deployments/operator/device/fpga
```
Operator also supports deployments with multiple selected device types.
In this case, create a new kustomization with the necessary resources
that passes the desired device types to the operator using `--device`
command line argument multiple times.
## Upgrade
The upgrade of the deployed plugins can be done by simply installing a new release of the operator.
The operator auto-upgrades operator-managed plugins (CR images and thus corresponding deployed daemonsets) to the current release of the operator.
The [registry-url]/[namespace]/[image] are kept intact on the upgrade.
No upgrade is done for:
- Non-operator managed deployments
- Operator deployments without numeric tags
## Known issues
When the operator is run with leader election enabled, that is with the option
`--leader-elect`, make sure the cluster is not overloaded with excessive
number of pods. Otherwise a heart beat used by the leader election code may trigger
a timeout and crash. We are going to use different clients for the controller and
leader election code to alleviate the issue. See more details in
https://github.com/intel/intel-device-plugins-for-kubernetes/issues/476.
In case the deployment is limited to specific device type(s),
the CRDs for other device types are still created, but no controllers
for them are registered.