Commit Graph

127 Commits

Author SHA1 Message Date
Tuomas Katila
fd3ad4003f gpu: restructure readme
Split readme into smaller chunks, show only one "easy installation"
and hide the rest. Add some notes about tile resources.

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-12-08 08:42:08 +02:00
Tuomas Katila
8640b1501c gpu: default to flat/combined mode for l0 affinity mask
With tile requests, the level zero affinity mask now defaults to
flat/combined mode. If ZE_FLAT_DEVICE_HIERARCHY is set to COMPOSITE
in the Pod's specification, plugin will use the previous "x.y" format
instead of the new "x" in the affinity mask.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-12-08 08:42:02 +02:00
Eero Tamminen
3ade6d44ce List writable render devices with no render-device.sh args
To help debugging potential kubernetes device usage issues.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2023-12-01 21:07:30 +02:00
Eero Tamminen
4b3944600f Fix (harmless) render-device.sh shellcheck warnings
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2023-12-01 21:07:30 +02:00
Mikko Ylinen
33e0e21a8b gpu: fix klog formatting typo
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-11-03 09:29:21 +02:00
Mikko Ylinen
834f598f80 deployments: update to NFD v0.14.1 and drop custom GPU deployment
With the NFD recent versions (v0.13+), it's no longer necessary to
start NFD with custom nfd-master args/rbac settings to get numeric
labels registered as extended resources.

The same can be specified via NodeFeatureRules which also works for
"local" source with feature files.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-09-20 14:02:52 +03:00
Tuomas Katila
031ee64590 gpu/doc: Add Max Series support and a note about SR-IOV
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-14 13:21:30 +03:00
Tuomas Katila
827b9a0ced fix crash with rm when kubelet request timeouts
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
ea659a5e4b nfd: add rules to label nodes with different GPUs
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
691dfc3483 gpu: refactor nfdhook functionality to plugin
NFD v0.14+ doesn't support binary NFD hooks by default, so there is
a need to move the label creation away from the GPU nfdhook.

Move extended resource label creation to plugin, and drop labels that were
already marked deprecated (platform_gen, media_version etc.).

Drop init-container from deployment files and operator. It is still possible
to use an initcontainer, but the default deployments do not support it.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
532f2fe8cd gpu/rm: add error check in kubelet flow
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-24 09:52:07 +03:00
Mikko Ylinen
e428cd6c19 go.mod: update to k8s 1.27.1 and controller runtime 0.15.x
k8s 1.27.x triggers build errors on controller-runtime 0.14.x
so we will need to update to 0.15.x at the same time.

Changes include:

* k8s e2e framework moved to use Ginkgo context so we add
  test context to all our test nodes.
* adapt Ginkgo parameter modifications.
* adapt SGX admissionwebhook to InjectDecoder removal.
* adapt deviceplugins and FPGA CRDs to controller-runtime
  API changes.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-05-09 14:49:24 +03:00
Tuomas Katila
4e645d823c gpu: change 'none' allocation policy
With shared-dev-num and multiple i915s in the resource request,
try to find as many individual GPUs to expose to the container.

Previously, with multiple i915 resources, it was typical to
get only one GPU device in the container.

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-04 13:39:10 +03:00
Tuomas Katila
342554c666 lint fixes found from 0.26.1 release preparation
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-02 13:52:36 +03:00
Tuomas Katila
8971280215 gpu: add notes about gpu-plugin modes
Fixes: #1381

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-26 14:28:36 +03:00
Tuomas Katila
2a365263b0 gpu: add note about dry-run and yaml output
Fixes: #1059

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-24 09:52:36 +03:00
Tuomas Katila
9cb08cffb8
Merge pull request #1386 from eero-t/gpu-drivers
Update GPU plugin README driver information
2023-04-20 15:02:08 +03:00
Tuomas Katila
943e34f3af gpu: mount by-path directory
oneCCL requires the /dev/dri/by-path folder to be available
to create a mapping between GPUs.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-20 14:56:59 +03:00
Eero Tamminen
92b8fe9380 Update GPU plugin README driver information
Fixes: #1382

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2023-04-20 13:53:13 +03:00
Tuomas Katila
974829ff7c gpu: try to fetch PodList from kubelet API
In large clusters and with resource management, the load
from gpu-plugins can become heavy for the api-server.
This change will start fetching pod listings from kubelet
and use api-server as a backup. Any other error than timeout
will also move the logic back to using api-server.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-03-30 12:43:02 +03:00
Ukri Niemimuukko
3feb185277 randomize cleanup interval and increase it to 20 minutes
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2023-03-24 10:39:55 +02:00
Tuomas Katila
527f638367 test: gpu: add fake target for grpc.Dial
In preparation for grpc 1.52.0.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-01-12 11:50:47 +02:00
Tuomas Katila
d1e8350c6e gpu: add new nfd + monitoring + shared-dev deployment option
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-01-05 14:13:13 +02:00
Ukri Niemimuukko
8ed705d79c unexport internal types
ContainerAssignments and PodAssignementDetauls need not be exported.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-12-30 16:39:41 +02:00
Lukas Kalbertodt
ae0f9c5334
Fix command in docs by adding single quotes
Otherwise most shells will interpreted `?` in an unintended way.
2022-12-16 12:28:52 +01:00
Ukri Niemimuukko
41b7b55727 gpu: log errors from pod listing
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-10-11 14:31:56 +03:00
Mikko Ylinen
75bff62ba1
Merge pull request #1183 from tkatila/gpu-demo-updates
gpu: improve demo run instructions
2022-10-07 13:08:54 +03:00
Eero Tamminen
0b47ebd3e7 Add information on new DKMS kernel GPU driver packages
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-10-06 18:08:53 +03:00
Tuomas Katila
56bc5ebeee Modifications based on Eero's comments
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-10-06 17:55:04 +03:00
Tuomas Katila
63cbe808a7 gpu: improve demo run instructions
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-10-05 16:10:03 +03:00
Eero Tamminen
647b484e7a Improve GPU drivers installation instructions
- Add note about LTS kernel DKMS source repo
- Correct note about the demo (unlike FPGA demo,
  GPU demo is not in docker hub)

Fixes: 89d3c5a4f3

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-09-28 12:40:30 +03:00
Eero Tamminen
9b3ee06cb1 Add GPU plugin README prerequisites section
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-09-23 20:32:46 +03:00
Tuomas Katila
eac635e439 gpu: fix documentation links
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-23 20:32:46 +03:00
Tuomas Katila
e375186458
Update cmd/gpu_plugin/README.md
Co-authored-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-09-15 15:30:23 +03:00
Tuomas Katila
c562db9b28 gpu: Improve installation options and documentation
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-15 15:19:23 +03:00
Tuomas Katila
230570f12e gpu: add mentions about data center gpu support
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-09 13:07:50 +03:00
Ed Bartosh
f0dd95274e
Merge pull request #1126 from mythi/PR-2022-054
docs: rework development guide
2022-09-02 17:59:15 +03:00
Mikko Ylinen
1b3accacc2 docs: rework development guide
Currently, each individual plugin README documents roughly the same
daily development steps to git clone, build, and deploy. Re-purpose
the plugin READMEs more towards cluster admin type of documentation
and start moving all development related documentation to DEVEL.md.

The same is true for e2e testing documentation which is scattered
in places where they don't belong to. Having all day-to-day
development Howtos is good to have in a centralized place.

Finally, the cleanup includes some harmonization to plugins'
table of contents which now follows the pattern:

* [Introduction](#introduction)
(* [Modes and Configuration Options](#modes-and-configuration-options))
* [Installation](#installation)
    (* [Prerequisites](#prerequisites))
    * [Pre-built Images](#pre-built-images)
    * [Verify Plugin Registration](#verify-plugin-registration)
* [Testing and Demos](#testing-and-demos)
    * ...

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-08-31 20:00:15 +03:00
Eero Tamminen
fb18923298 Log GPU device share count & type count changes separately
And instead of accessing DeviceTree internals, add suitable method for it.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-31 17:23:57 +03:00
Mikko Ylinen
d826548d29
Merge pull request #1113 from eero-t/gpu-count-log
More detailed log for number of found GPU devices / resource types
2022-08-29 09:58:53 +03:00
Eero Tamminen
ddf2c8bc8f More detailed log for number of found GPU devices / resource types
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-26 17:51:27 +03:00
Eero Tamminen
5666b8fa30 Add "prefix" option to GPU plugin for scalability testing
GPU plugin code assumes container paths to match host paths, and
container runtime prevents creating fake files under real paths. When
non-standard paths are used, devices can be faked for scalability
testing.

Note: If one wants to run both normal GPU plugin and faked one in same
cluster, all nodes providing fake "i915" resources should be labeled
differently from ones with real GPU plugin + devices, so that real GPU
workloads can be limited to correct nodes with a suitable
nodeSelector.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-24 14:32:53 +03:00
Mikko Ylinen
642c4f7b59 build: move to Go 1.19 and golangci-lint 1.48 because of that
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-08-15 10:13:37 +03:00
Mikko Ylinen
2adad5ae76 drop deprecated grpc.WithInsecure()
grpc-go v1.43.0 deprecated grpc.WithInsecure() in favor of
insecure.NewCredentials(). Move to use the recommended approach
and drop the linter annotations.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-04-07 13:40:51 +03:00
Mikko Ylinen
0f36cde605
Merge pull request #935 from tkatila/gpu/tiles-support-and-numa-mapping
gpu: add tiles annotation support
2022-03-30 19:33:09 +03:00
Tuomas Katila
8f6a235b5d gpu: Start using GetPreferredAllocation with fractional resources
Move reallocate logic to getpreferredallocation and simplify
allocate to use the kubelet's device ids.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-03-30 11:32:49 +03:00
Hyeongju Johannes Lee
7eeaddc563 gpu: fix typo in implmentation of preferredAllocator interface
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
2022-03-28 05:04:32 -07:00
Tuomas Katila
db7e5bfc55 Add support for gas-container-tiles annotation
Adds functionality to convert container's tile annotation
in to corresponding L0 affinity mask. This helps to target
container's workload to specific L0 subdevices.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-03-24 14:13:35 +02:00
Mikko Ylinen
c064bfc4f1 demo: add intel-opencl-icd
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-02-24 11:06:27 +02:00
Ed Bartosh
55f3e17dd0 add 'annotations' parameter to the NewDeviceInfo API
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
2022-02-07 15:15:30 +02:00