Commit Graph

144 Commits

Author SHA1 Message Date
Ukri Niemimuukko
daf64052e5
Update cmd/gpu_plugin/fractional.md
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
2025-01-24 15:33:15 +02:00
Ukri Niemimuukko
1c40eaaa83 Add deprecation notices about GAS 2025-01-23 20:21:36 +02:00
Mikko Ylinen
6255810e0d gpu/rm: move to fake.NewClientSet()
k8s v1.32 client-go makes FakePods private so the current
resourcemanager fake client won't work anymore.

client-go provides a simple fake Client that works easily so
just move to use it.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2025-01-02 12:00:34 +02:00
Tuomas Katila
d9cb0fc3f9 gpu: add a note about non-default namespaces with fractional resources
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-09-25 13:15:36 +03:00
Tuomas Katila
fc2dce588c Rename pci to PCI in various places
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-09-19 19:14:15 +03:00
Tuomas Katila
606ac77647 gpu: levelzero: documentation
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-09-19 19:14:15 +03:00
Tuomas Katila
518a8606ff gpu: add levelzero sidecar support for plugin and the deployment files
In addition to the levelzero's health data use, this adds support to
scan devices in WSL. Scanning happens by retrieving Intel device
indices from the Level-Zero API.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-09-19 19:14:15 +03:00
Tuomas Katila
402fb8d9cd gpu: add support for CDI devices
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-09-11 09:29:55 +03:00
Tuomas Katila
fa6d027b24 Fix some lint errors
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-08-27 11:40:29 +03:00
Tuomas Katila
20b7b5a4d7
Merge pull request #1748 from mythi/PR-2024-013
pkg/deviceplugin: move to grpc.NewClient()
2024-05-28 12:09:22 +03:00
Mikko Ylinen
4d858c5364 pkg/deviceplugin: move to grpc.NewClient()
grpc.NewClient(), added in grpc-go v1.63, is the preferred way to
create a new ClientConn. In most of our usages, moving away from
grpc.Dial*() to it is straightforward.

However, we've also relied on grpc.Dial*()'s behavior to automatically
make a new connection to "test" a connection is successful isn't available
anymore. Combined with grpc.WithBlock dialoption this usage is considered
"especially bad" way to handle a client connection.

The recommended approach to test a server connection is to separately
make a connection and watch the connection state to become Ready. This
change follows that recommendation.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2024-05-28 08:17:06 +03:00
Ed Bartosh
988fbed528 deviceplugin: add DeviceInfo.hooks field 2024-05-22 13:13:38 +03:00
Mikko Ylinen
54f9d730e9 ci: move to golangci-lint v1.57.2
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2024-05-02 09:18:27 +03:00
Tuomas Katila
4946b26018 gpu: doc: monitoring resource notes
Also align xelink-sidecar deployment with the new files in
the xpu manager project.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-03-13 08:16:16 +02:00
Tuomas Katila
1de1024530 gpu: add xe notes
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-03-12 15:41:44 +02:00
Tuomas Katila
e600fe9313 gpu: add support for the upcoming xe-driver
Plugin can support both i915 and xe drivers dynamically. But
having both drivers on same node with RM is not possible.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-03-12 11:34:01 +02:00
hugo-syn
039865aec8
chore: Fix multiple typos (#1653)
* chore: Fix multiple typos

Signed-off-by: hugo-syn <hugo.vincent@synacktiv.com>
2024-01-25 08:18:48 +02:00
Tuomas Katila
fd3ad4003f gpu: restructure readme
Split readme into smaller chunks, show only one "easy installation"
and hide the rest. Add some notes about tile resources.

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-12-08 08:42:08 +02:00
Tuomas Katila
8640b1501c gpu: default to flat/combined mode for l0 affinity mask
With tile requests, the level zero affinity mask now defaults to
flat/combined mode. If ZE_FLAT_DEVICE_HIERARCHY is set to COMPOSITE
in the Pod's specification, plugin will use the previous "x.y" format
instead of the new "x" in the affinity mask.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-12-08 08:42:02 +02:00
Eero Tamminen
3ade6d44ce List writable render devices with no render-device.sh args
To help debugging potential kubernetes device usage issues.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2023-12-01 21:07:30 +02:00
Eero Tamminen
4b3944600f Fix (harmless) render-device.sh shellcheck warnings
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2023-12-01 21:07:30 +02:00
Mikko Ylinen
33e0e21a8b gpu: fix klog formatting typo
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-11-03 09:29:21 +02:00
Mikko Ylinen
834f598f80 deployments: update to NFD v0.14.1 and drop custom GPU deployment
With the NFD recent versions (v0.13+), it's no longer necessary to
start NFD with custom nfd-master args/rbac settings to get numeric
labels registered as extended resources.

The same can be specified via NodeFeatureRules which also works for
"local" source with feature files.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-09-20 14:02:52 +03:00
Tuomas Katila
031ee64590 gpu/doc: Add Max Series support and a note about SR-IOV
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-14 13:21:30 +03:00
Tuomas Katila
827b9a0ced fix crash with rm when kubelet request timeouts
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
ea659a5e4b nfd: add rules to label nodes with different GPUs
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
691dfc3483 gpu: refactor nfdhook functionality to plugin
NFD v0.14+ doesn't support binary NFD hooks by default, so there is
a need to move the label creation away from the GPU nfdhook.

Move extended resource label creation to plugin, and drop labels that were
already marked deprecated (platform_gen, media_version etc.).

Drop init-container from deployment files and operator. It is still possible
to use an initcontainer, but the default deployments do not support it.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
532f2fe8cd gpu/rm: add error check in kubelet flow
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-24 09:52:07 +03:00
Mikko Ylinen
e428cd6c19 go.mod: update to k8s 1.27.1 and controller runtime 0.15.x
k8s 1.27.x triggers build errors on controller-runtime 0.14.x
so we will need to update to 0.15.x at the same time.

Changes include:

* k8s e2e framework moved to use Ginkgo context so we add
  test context to all our test nodes.
* adapt Ginkgo parameter modifications.
* adapt SGX admissionwebhook to InjectDecoder removal.
* adapt deviceplugins and FPGA CRDs to controller-runtime
  API changes.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-05-09 14:49:24 +03:00
Tuomas Katila
4e645d823c gpu: change 'none' allocation policy
With shared-dev-num and multiple i915s in the resource request,
try to find as many individual GPUs to expose to the container.

Previously, with multiple i915 resources, it was typical to
get only one GPU device in the container.

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-04 13:39:10 +03:00
Tuomas Katila
342554c666 lint fixes found from 0.26.1 release preparation
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-02 13:52:36 +03:00
Tuomas Katila
8971280215 gpu: add notes about gpu-plugin modes
Fixes: #1381

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-26 14:28:36 +03:00
Tuomas Katila
2a365263b0 gpu: add note about dry-run and yaml output
Fixes: #1059

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-24 09:52:36 +03:00
Tuomas Katila
9cb08cffb8
Merge pull request #1386 from eero-t/gpu-drivers
Update GPU plugin README driver information
2023-04-20 15:02:08 +03:00
Tuomas Katila
943e34f3af gpu: mount by-path directory
oneCCL requires the /dev/dri/by-path folder to be available
to create a mapping between GPUs.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-20 14:56:59 +03:00
Eero Tamminen
92b8fe9380 Update GPU plugin README driver information
Fixes: #1382

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2023-04-20 13:53:13 +03:00
Tuomas Katila
974829ff7c gpu: try to fetch PodList from kubelet API
In large clusters and with resource management, the load
from gpu-plugins can become heavy for the api-server.
This change will start fetching pod listings from kubelet
and use api-server as a backup. Any other error than timeout
will also move the logic back to using api-server.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-03-30 12:43:02 +03:00
Ukri Niemimuukko
3feb185277 randomize cleanup interval and increase it to 20 minutes
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2023-03-24 10:39:55 +02:00
Tuomas Katila
527f638367 test: gpu: add fake target for grpc.Dial
In preparation for grpc 1.52.0.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-01-12 11:50:47 +02:00
Tuomas Katila
d1e8350c6e gpu: add new nfd + monitoring + shared-dev deployment option
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-01-05 14:13:13 +02:00
Ukri Niemimuukko
8ed705d79c unexport internal types
ContainerAssignments and PodAssignementDetauls need not be exported.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-12-30 16:39:41 +02:00
Lukas Kalbertodt
ae0f9c5334
Fix command in docs by adding single quotes
Otherwise most shells will interpreted `?` in an unintended way.
2022-12-16 12:28:52 +01:00
Ukri Niemimuukko
41b7b55727 gpu: log errors from pod listing
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-10-11 14:31:56 +03:00
Mikko Ylinen
75bff62ba1
Merge pull request #1183 from tkatila/gpu-demo-updates
gpu: improve demo run instructions
2022-10-07 13:08:54 +03:00
Eero Tamminen
0b47ebd3e7 Add information on new DKMS kernel GPU driver packages
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-10-06 18:08:53 +03:00
Tuomas Katila
56bc5ebeee Modifications based on Eero's comments
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-10-06 17:55:04 +03:00
Tuomas Katila
63cbe808a7 gpu: improve demo run instructions
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-10-05 16:10:03 +03:00
Eero Tamminen
647b484e7a Improve GPU drivers installation instructions
- Add note about LTS kernel DKMS source repo
- Correct note about the demo (unlike FPGA demo,
  GPU demo is not in docker hub)

Fixes: 89d3c5a4f3

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-09-28 12:40:30 +03:00
Eero Tamminen
9b3ee06cb1 Add GPU plugin README prerequisites section
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-09-23 20:32:46 +03:00
Tuomas Katila
eac635e439 gpu: fix documentation links
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-23 20:32:46 +03:00