mirror of
https://github.com/intel/intel-device-plugins-for-kubernetes.git
synced 2025-06-03 03:59:37 +00:00
gpu: levelzero: documentation
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com> Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
This commit is contained in:
parent
518a8606ff
commit
606ac77647
@ -24,6 +24,7 @@ Table of Contents
|
||||
* [IAA device plugin](#iaa-device-plugin)
|
||||
* [Device Plugins Operator](#device-plugins-operator)
|
||||
* [XeLink XPU Manager sidecar](#xelink-xpu-manager-sidecar)
|
||||
* [Intel GPU Level-Zero sidecar](#intel-gpu-levelzero)
|
||||
* [Demos](#demos)
|
||||
* [Workload Authors](#workload-authors)
|
||||
* [Developers](#developers)
|
||||
@ -201,6 +202,12 @@ To support interconnected GPUs in Kubernetes, XeLink sidecar is needed.
|
||||
|
||||
The [XeLink XPU Manager sidecar README](cmd/xpumanager_sidecar/README.md) gives information how the sidecar functions and how to use it.
|
||||
|
||||
## Intel GPU Level-Zero sidecar
|
||||
|
||||
Sidecar uses Level-Zero API to provide additional GPU information for the GPU plugin that it cannot get through sysfs interfaces.
|
||||
|
||||
See [Intel GPU Level-Zero sidecar README](cmd/gpu_levelzero/README.md) for more details.
|
||||
|
||||
## Demos
|
||||
|
||||
The [demo subdirectory](demo/readme.md) contains a number of demonstrations for
|
||||
|
38
cmd/gpu_levelzero/README.md
Normal file
38
cmd/gpu_levelzero/README.md
Normal file
@ -0,0 +1,38 @@
|
||||
# Intel GPU Level-Zero sidecar
|
||||
|
||||
Table of Contents
|
||||
|
||||
* [Introduction](#introduction)
|
||||
* [Install](#install)
|
||||
|
||||
## Introduction
|
||||
|
||||
Intel GPU Level-Zero sidecar is an extension for the Intel GPU plugin to query additional GPU details from the oneAPI/Level-Zero API. As the Level-Zero is a C/C++ API, it is preferred to keep the original GPU plugin as-is and add the additional functionality via the Level-Zero sidecar. The GPU plugin can be configured to use the Level-Zero sidecar with an overlay, see [install](#install).
|
||||
|
||||
Intel GPU plugin and the Level-Zero sidecar communicate via gRPC on a local socket visible only to the containers.
|
||||
|
||||
> **NOTE**: Intel Device Plugin Operator doesn't yet support enabling Level-Zero sidecar in the GPU CR object.
|
||||
|
||||
## Modes and Configuration Options
|
||||
|
||||
| Flag | Argument | Default | Meaning |
|
||||
|:---- |:-------- |:------- |:------- |
|
||||
| -socket | unix socket path | /var/lib/levelzero/server.sock | Unix socket path which the server registers itself into. |
|
||||
| -wsl | - | disabled | Adapt sidecar to run in the WSL environment. |
|
||||
| -v | verbosity | 1 | Log verbosity |
|
||||
|
||||
## Install
|
||||
|
||||
Installing the sidecar along with the GPU plugin happens via two possible overlays: [health](../../deployments/gpu_plugin/overlays/health/) and [wsl](../../deployments/gpu_plugin/overlays/wsl/).
|
||||
|
||||
Health overlay adds the sidecar to the base GPU plugin deployment and configures GPU plugin to retrieve device health indicators from the Level-Zero API:
|
||||
|
||||
```bash
|
||||
$ kubectl -k deployments/gpu_plugin/overlays/health
|
||||
```
|
||||
|
||||
WSL layer enables Intel GPU detection with WSL (Windows Subsystem for Linux) Kubernetes clusters. It also leverages the Level-Zero sidecar:
|
||||
|
||||
```bash
|
||||
$ kubectl -k deployments/gpu_plugin/overlays/wsl
|
||||
```
|
@ -18,6 +18,7 @@ Table of Contents
|
||||
* [SR-IOV use with the plugin](#sr-iov-use-with-the-plugin)
|
||||
* [CDI support](#cdi-support)
|
||||
* [KMD and UMD](#kmd-and-umd)
|
||||
* [Health management](#health-management)
|
||||
* [Issues with media workloads on multi-GPU setups](#issues-with-media-workloads-on-multi-gpu-setups)
|
||||
* [Workaround for QSV and VA-API](#workaround-for-qsv-and-va-api)
|
||||
|
||||
@ -56,6 +57,8 @@ For workloads on different KMDs, see [KMD and UMD](#kmd-and-umd).
|
||||
|:---- |:-------- |:------- |:------- |
|
||||
| -enable-monitoring | - | disabled | Enable '*_monitoring' resource that provides access to all Intel GPU devices on the node, [see use](./monitoring.md) |
|
||||
| -resource-manager | - | disabled | Enable fractional resource management, [see use](./fractional.md) |
|
||||
| -health-management | - | disabled | Enable health management by requesting data from oneAPI/Level-Zero interface. Requires [GPU Level-Zero](../gpu_levelzero/) sidecar. See [health management](#health-management) |
|
||||
| -wsl | - | disabled | Adapt plugin to run in the WSL environment. Requires [GPU Level-Zero](../gpu_levelzero/) sidecar. |
|
||||
| -shared-dev-num | int | 1 | Number of containers that can share the same GPU device |
|
||||
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. For shared-dev-num > 1: _balanced_ mode spreads workloads among GPU devices, _packed_ mode fills one GPU fully before moving to next, and _none_ selects first available device from kubelet. Default is _none_. Allocation policy does not have an effect when resource manager is enabled. |
|
||||
|
||||
@ -257,6 +260,14 @@ Creating a workload that would support all the different KMDs is not currently p
|
||||
| Media | Default | [ENABLE_PRODUCTION_KMD](https://github.com/intel/media-driver/blob/a66b076e83876fbfa9c9ab633ad9c5517f8d74fd/CMakeLists.txt#L58) | [ENABLE_XE_KMD](https://github.com/intel/media-driver/blob/a66b076e83876fbfa9c9ab633ad9c5517f8d74fd/media_driver/cmake/linux/media_feature_flags_linux.cmake#L187-L190) | Xe with upstream or backport i915, not all three. |
|
||||
| Graphics | Default | Unknown | [intel-xe-kmd](https://gitlab.freedesktop.org/mesa/mesa/-/blob/e9169881dbd1f72eab65a68c2b8e7643f74489b7/meson_options.txt#L708) | i915 and xe KMDs can be supported at the same time. |
|
||||
|
||||
### Health management
|
||||
|
||||
Kubernetes Device Plugin API allows passing device's healthiness to Kubelet. By default GPU plugin reports all devices to be `Healthy`. If health management is enabled, GPU plugin retrieves health related data from oneAPI/Level-Zero interface via [GPU levelzero](../gpu_levelzero/). Depending on the data received, GPU plugin will report device to be `Unhealthy` if:
|
||||
1) Direct health indicators report issues: [memory](https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-mem-health-t) & [pci](https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-pci-link-status-t)
|
||||
1) Device temperature is over the limit
|
||||
|
||||
Temperature limit can be provided via the command line argument, default is 100C.
|
||||
|
||||
### Issues with media workloads on multi-GPU setups
|
||||
|
||||
OneVPL media API, 3D and compute APIs provide device discovery
|
||||
|
10
cmd/internal/levelzero/README.md
Normal file
10
cmd/internal/levelzero/README.md
Normal file
@ -0,0 +1,10 @@
|
||||
To update the golang gRPC/protobuf files, use the following `protoc` commandline:
|
||||
|
||||
```
|
||||
protoc --go_out=. --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative levelzero.proto
|
||||
# To fix bad package name
|
||||
sed -i -e 's/gpu_levelzero/gpulevelzero/' levelzero.pb.go levelzero_grpc.pb.go
|
||||
```
|
||||
|
||||
> *Note*: Running `protoc` will erase copyright header and change the package name from "gpulevelzero" to "gpu.levelzero". The header and the package name needs to be added/modified after regeneration.
|
||||
|
Loading…
Reference in New Issue
Block a user