Commit Graph

135 Commits

Author SHA1 Message Date
Tuomas Katila
20b7b5a4d7
Merge pull request #1748 from mythi/PR-2024-013
pkg/deviceplugin: move to grpc.NewClient()
2024-05-28 12:09:22 +03:00
Mikko Ylinen
4d858c5364 pkg/deviceplugin: move to grpc.NewClient()
grpc.NewClient(), added in grpc-go v1.63, is the preferred way to
create a new ClientConn. In most of our usages, moving away from
grpc.Dial*() to it is straightforward.

However, we've also relied on grpc.Dial*()'s behavior to automatically
make a new connection to "test" a connection is successful isn't available
anymore. Combined with grpc.WithBlock dialoption this usage is considered
"especially bad" way to handle a client connection.

The recommended approach to test a server connection is to separately
make a connection and watch the connection state to become Ready. This
change follows that recommendation.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2024-05-28 08:17:06 +03:00
Ed Bartosh
988fbed528 deviceplugin: add DeviceInfo.hooks field 2024-05-22 13:13:38 +03:00
Mikko Ylinen
54f9d730e9 ci: move to golangci-lint v1.57.2
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2024-05-02 09:18:27 +03:00
Tuomas Katila
4946b26018 gpu: doc: monitoring resource notes
Also align xelink-sidecar deployment with the new files in
the xpu manager project.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-03-13 08:16:16 +02:00
Tuomas Katila
1de1024530 gpu: add xe notes
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-03-12 15:41:44 +02:00
Tuomas Katila
e600fe9313 gpu: add support for the upcoming xe-driver
Plugin can support both i915 and xe drivers dynamically. But
having both drivers on same node with RM is not possible.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2024-03-12 11:34:01 +02:00
hugo-syn
039865aec8
chore: Fix multiple typos (#1653)
* chore: Fix multiple typos

Signed-off-by: hugo-syn <hugo.vincent@synacktiv.com>
2024-01-25 08:18:48 +02:00
Tuomas Katila
fd3ad4003f gpu: restructure readme
Split readme into smaller chunks, show only one "easy installation"
and hide the rest. Add some notes about tile resources.

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-12-08 08:42:08 +02:00
Tuomas Katila
8640b1501c gpu: default to flat/combined mode for l0 affinity mask
With tile requests, the level zero affinity mask now defaults to
flat/combined mode. If ZE_FLAT_DEVICE_HIERARCHY is set to COMPOSITE
in the Pod's specification, plugin will use the previous "x.y" format
instead of the new "x" in the affinity mask.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-12-08 08:42:02 +02:00
Eero Tamminen
3ade6d44ce List writable render devices with no render-device.sh args
To help debugging potential kubernetes device usage issues.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2023-12-01 21:07:30 +02:00
Eero Tamminen
4b3944600f Fix (harmless) render-device.sh shellcheck warnings
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2023-12-01 21:07:30 +02:00
Mikko Ylinen
33e0e21a8b gpu: fix klog formatting typo
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-11-03 09:29:21 +02:00
Mikko Ylinen
834f598f80 deployments: update to NFD v0.14.1 and drop custom GPU deployment
With the NFD recent versions (v0.13+), it's no longer necessary to
start NFD with custom nfd-master args/rbac settings to get numeric
labels registered as extended resources.

The same can be specified via NodeFeatureRules which also works for
"local" source with feature files.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-09-20 14:02:52 +03:00
Tuomas Katila
031ee64590 gpu/doc: Add Max Series support and a note about SR-IOV
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-14 13:21:30 +03:00
Tuomas Katila
827b9a0ced fix crash with rm when kubelet request timeouts
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
ea659a5e4b nfd: add rules to label nodes with different GPUs
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
691dfc3483 gpu: refactor nfdhook functionality to plugin
NFD v0.14+ doesn't support binary NFD hooks by default, so there is
a need to move the label creation away from the GPU nfdhook.

Move extended resource label creation to plugin, and drop labels that were
already marked deprecated (platform_gen, media_version etc.).

Drop init-container from deployment files and operator. It is still possible
to use an initcontainer, but the default deployments do not support it.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
532f2fe8cd gpu/rm: add error check in kubelet flow
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-24 09:52:07 +03:00
Mikko Ylinen
e428cd6c19 go.mod: update to k8s 1.27.1 and controller runtime 0.15.x
k8s 1.27.x triggers build errors on controller-runtime 0.14.x
so we will need to update to 0.15.x at the same time.

Changes include:

* k8s e2e framework moved to use Ginkgo context so we add
  test context to all our test nodes.
* adapt Ginkgo parameter modifications.
* adapt SGX admissionwebhook to InjectDecoder removal.
* adapt deviceplugins and FPGA CRDs to controller-runtime
  API changes.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-05-09 14:49:24 +03:00
Tuomas Katila
4e645d823c gpu: change 'none' allocation policy
With shared-dev-num and multiple i915s in the resource request,
try to find as many individual GPUs to expose to the container.

Previously, with multiple i915 resources, it was typical to
get only one GPU device in the container.

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-04 13:39:10 +03:00
Tuomas Katila
342554c666 lint fixes found from 0.26.1 release preparation
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-02 13:52:36 +03:00
Tuomas Katila
8971280215 gpu: add notes about gpu-plugin modes
Fixes: #1381

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-26 14:28:36 +03:00
Tuomas Katila
2a365263b0 gpu: add note about dry-run and yaml output
Fixes: #1059

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-24 09:52:36 +03:00
Tuomas Katila
9cb08cffb8
Merge pull request #1386 from eero-t/gpu-drivers
Update GPU plugin README driver information
2023-04-20 15:02:08 +03:00
Tuomas Katila
943e34f3af gpu: mount by-path directory
oneCCL requires the /dev/dri/by-path folder to be available
to create a mapping between GPUs.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-20 14:56:59 +03:00
Eero Tamminen
92b8fe9380 Update GPU plugin README driver information
Fixes: #1382

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2023-04-20 13:53:13 +03:00
Tuomas Katila
974829ff7c gpu: try to fetch PodList from kubelet API
In large clusters and with resource management, the load
from gpu-plugins can become heavy for the api-server.
This change will start fetching pod listings from kubelet
and use api-server as a backup. Any other error than timeout
will also move the logic back to using api-server.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-03-30 12:43:02 +03:00
Ukri Niemimuukko
3feb185277 randomize cleanup interval and increase it to 20 minutes
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2023-03-24 10:39:55 +02:00
Tuomas Katila
527f638367 test: gpu: add fake target for grpc.Dial
In preparation for grpc 1.52.0.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-01-12 11:50:47 +02:00
Tuomas Katila
d1e8350c6e gpu: add new nfd + monitoring + shared-dev deployment option
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-01-05 14:13:13 +02:00
Ukri Niemimuukko
8ed705d79c unexport internal types
ContainerAssignments and PodAssignementDetauls need not be exported.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-12-30 16:39:41 +02:00
Lukas Kalbertodt
ae0f9c5334
Fix command in docs by adding single quotes
Otherwise most shells will interpreted `?` in an unintended way.
2022-12-16 12:28:52 +01:00
Ukri Niemimuukko
41b7b55727 gpu: log errors from pod listing
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-10-11 14:31:56 +03:00
Mikko Ylinen
75bff62ba1
Merge pull request #1183 from tkatila/gpu-demo-updates
gpu: improve demo run instructions
2022-10-07 13:08:54 +03:00
Eero Tamminen
0b47ebd3e7 Add information on new DKMS kernel GPU driver packages
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-10-06 18:08:53 +03:00
Tuomas Katila
56bc5ebeee Modifications based on Eero's comments
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-10-06 17:55:04 +03:00
Tuomas Katila
63cbe808a7 gpu: improve demo run instructions
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-10-05 16:10:03 +03:00
Eero Tamminen
647b484e7a Improve GPU drivers installation instructions
- Add note about LTS kernel DKMS source repo
- Correct note about the demo (unlike FPGA demo,
  GPU demo is not in docker hub)

Fixes: 89d3c5a4f3

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-09-28 12:40:30 +03:00
Eero Tamminen
9b3ee06cb1 Add GPU plugin README prerequisites section
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-09-23 20:32:46 +03:00
Tuomas Katila
eac635e439 gpu: fix documentation links
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-23 20:32:46 +03:00
Tuomas Katila
e375186458
Update cmd/gpu_plugin/README.md
Co-authored-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-09-15 15:30:23 +03:00
Tuomas Katila
c562db9b28 gpu: Improve installation options and documentation
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-15 15:19:23 +03:00
Tuomas Katila
230570f12e gpu: add mentions about data center gpu support
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-09 13:07:50 +03:00
Ed Bartosh
f0dd95274e
Merge pull request #1126 from mythi/PR-2022-054
docs: rework development guide
2022-09-02 17:59:15 +03:00
Mikko Ylinen
1b3accacc2 docs: rework development guide
Currently, each individual plugin README documents roughly the same
daily development steps to git clone, build, and deploy. Re-purpose
the plugin READMEs more towards cluster admin type of documentation
and start moving all development related documentation to DEVEL.md.

The same is true for e2e testing documentation which is scattered
in places where they don't belong to. Having all day-to-day
development Howtos is good to have in a centralized place.

Finally, the cleanup includes some harmonization to plugins'
table of contents which now follows the pattern:

* [Introduction](#introduction)
(* [Modes and Configuration Options](#modes-and-configuration-options))
* [Installation](#installation)
    (* [Prerequisites](#prerequisites))
    * [Pre-built Images](#pre-built-images)
    * [Verify Plugin Registration](#verify-plugin-registration)
* [Testing and Demos](#testing-and-demos)
    * ...

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-08-31 20:00:15 +03:00
Eero Tamminen
fb18923298 Log GPU device share count & type count changes separately
And instead of accessing DeviceTree internals, add suitable method for it.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-31 17:23:57 +03:00
Mikko Ylinen
d826548d29
Merge pull request #1113 from eero-t/gpu-count-log
More detailed log for number of found GPU devices / resource types
2022-08-29 09:58:53 +03:00
Eero Tamminen
ddf2c8bc8f More detailed log for number of found GPU devices / resource types
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-26 17:51:27 +03:00
Eero Tamminen
5666b8fa30 Add "prefix" option to GPU plugin for scalability testing
GPU plugin code assumes container paths to match host paths, and
container runtime prevents creating fake files under real paths. When
non-standard paths are used, devices can be faked for scalability
testing.

Note: If one wants to run both normal GPU plugin and faked one in same
cluster, all nodes providing fake "i915" resources should be labeled
differently from ones with real GPU plugin + devices, so that real GPU
workloads can be limited to correct nodes with a suitable
nodeSelector.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-24 14:32:53 +03:00