intel-device-plugins-for-kubernetes

github/intel-device-plugins-for-kubernetes

mirror of https://github.com/intel/intel-device-plugins-for-kubernetes.git synced 2025-06-03 03:59:37 +00:00

Author	SHA1	Message	Date
Tuomas Katila	20b7b5a4d7	Merge pull request #1748 from mythi/PR-2024-013 pkg/deviceplugin: move to grpc.NewClient()	2024-05-28 12:09:22 +03:00
Mikko Ylinen	4d858c5364	pkg/deviceplugin: move to grpc.NewClient() grpc.NewClient(), added in grpc-go v1.63, is the preferred way to create a new ClientConn. In most of our usages, moving away from grpc.Dial() to it is straightforward. However, we've also relied on grpc.Dial()'s behavior to automatically make a new connection to "test" a connection is successful isn't available anymore. Combined with grpc.WithBlock dialoption this usage is considered "especially bad" way to handle a client connection. The recommended approach to test a server connection is to separately make a connection and watch the connection state to become Ready. This change follows that recommendation. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2024-05-28 08:17:06 +03:00
Ed Bartosh	988fbed528	deviceplugin: add DeviceInfo.hooks field	2024-05-22 13:13:38 +03:00
Mikko Ylinen	54f9d730e9	ci: move to golangci-lint v1.57.2 Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2024-05-02 09:18:27 +03:00
Tuomas Katila	4946b26018	gpu: doc: monitoring resource notes Also align xelink-sidecar deployment with the new files in the xpu manager project. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2024-03-13 08:16:16 +02:00
Tuomas Katila	1de1024530	gpu: add xe notes Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com> Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2024-03-12 15:41:44 +02:00
Tuomas Katila	e600fe9313	gpu: add support for the upcoming xe-driver Plugin can support both i915 and xe drivers dynamically. But having both drivers on same node with RM is not possible. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2024-03-12 11:34:01 +02:00
hugo-syn	039865aec8	chore: Fix multiple typos (#1653 ) * chore: Fix multiple typos Signed-off-by: hugo-syn <hugo.vincent@synacktiv.com>	2024-01-25 08:18:48 +02:00
Tuomas Katila	fd3ad4003f	gpu: restructure readme Split readme into smaller chunks, show only one "easy installation" and hide the rest. Add some notes about tile resources. Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com> Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-12-08 08:42:08 +02:00
Tuomas Katila	8640b1501c	gpu: default to flat/combined mode for l0 affinity mask With tile requests, the level zero affinity mask now defaults to flat/combined mode. If ZE_FLAT_DEVICE_HIERARCHY is set to COMPOSITE in the Pod's specification, plugin will use the previous "x.y" format instead of the new "x" in the affinity mask. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-12-08 08:42:02 +02:00
Eero Tamminen	3ade6d44ce	List writable render devices with no render-device.sh args To help debugging potential kubernetes device usage issues. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2023-12-01 21:07:30 +02:00
Eero Tamminen	4b3944600f	Fix (harmless) render-device.sh shellcheck warnings Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2023-12-01 21:07:30 +02:00
Mikko Ylinen	33e0e21a8b	gpu: fix klog formatting typo Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2023-11-03 09:29:21 +02:00
Mikko Ylinen	834f598f80	deployments: update to NFD v0.14.1 and drop custom GPU deployment With the NFD recent versions (v0.13+), it's no longer necessary to start NFD with custom nfd-master args/rbac settings to get numeric labels registered as extended resources. The same can be specified via NodeFeatureRules which also works for "local" source with feature files. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2023-09-20 14:02:52 +03:00
Tuomas Katila	031ee64590	gpu/doc: Add Max Series support and a note about SR-IOV Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-09-14 13:21:30 +03:00
Tuomas Katila	827b9a0ced	fix crash with rm when kubelet request timeouts Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-09-12 16:20:33 +03:00
Tuomas Katila	ea659a5e4b	nfd: add rules to label nodes with different GPUs Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-09-12 16:20:33 +03:00
Tuomas Katila	691dfc3483	gpu: refactor nfdhook functionality to plugin NFD v0.14+ doesn't support binary NFD hooks by default, so there is a need to move the label creation away from the GPU nfdhook. Move extended resource label creation to plugin, and drop labels that were already marked deprecated (platform_gen, media_version etc.). Drop init-container from deployment files and operator. It is still possible to use an initcontainer, but the default deployments do not support it. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-09-12 16:20:33 +03:00
Tuomas Katila	532f2fe8cd	gpu/rm: add error check in kubelet flow Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-05-24 09:52:07 +03:00
Mikko Ylinen	e428cd6c19	go.mod: update to k8s 1.27.1 and controller runtime 0.15.x k8s 1.27.x triggers build errors on controller-runtime 0.14.x so we will need to update to 0.15.x at the same time. Changes include: * k8s e2e framework moved to use Ginkgo context so we add test context to all our test nodes. * adapt Ginkgo parameter modifications. * adapt SGX admissionwebhook to InjectDecoder removal. * adapt deviceplugins and FPGA CRDs to controller-runtime API changes. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2023-05-09 14:49:24 +03:00
Tuomas Katila	4e645d823c	gpu: change 'none' allocation policy With shared-dev-num and multiple i915s in the resource request, try to find as many individual GPUs to expose to the container. Previously, with multiple i915 resources, it was typical to get only one GPU device in the container. Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com> Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-05-04 13:39:10 +03:00
Tuomas Katila	342554c666	lint fixes found from 0.26.1 release preparation Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-05-02 13:52:36 +03:00
Tuomas Katila	8971280215	gpu: add notes about gpu-plugin modes Fixes: #1381 Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com> Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-04-26 14:28:36 +03:00
Tuomas Katila	2a365263b0	gpu: add note about dry-run and yaml output Fixes: #1059 Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-04-24 09:52:36 +03:00
Tuomas Katila	9cb08cffb8	Merge pull request #1386 from eero-t/gpu-drivers Update GPU plugin README driver information	2023-04-20 15:02:08 +03:00
Tuomas Katila	943e34f3af	gpu: mount by-path directory oneCCL requires the /dev/dri/by-path folder to be available to create a mapping between GPUs. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-04-20 14:56:59 +03:00
Eero Tamminen	92b8fe9380	Update GPU plugin README driver information Fixes: #1382 Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2023-04-20 13:53:13 +03:00
Tuomas Katila	974829ff7c	gpu: try to fetch PodList from kubelet API In large clusters and with resource management, the load from gpu-plugins can become heavy for the api-server. This change will start fetching pod listings from kubelet and use api-server as a backup. Any other error than timeout will also move the logic back to using api-server. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-03-30 12:43:02 +03:00
Ukri Niemimuukko	3feb185277	randomize cleanup interval and increase it to 20 minutes Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>	2023-03-24 10:39:55 +02:00
Tuomas Katila	527f638367	test: gpu: add fake target for grpc.Dial In preparation for grpc 1.52.0. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-01-12 11:50:47 +02:00
Tuomas Katila	d1e8350c6e	gpu: add new nfd + monitoring + shared-dev deployment option Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-01-05 14:13:13 +02:00
Ukri Niemimuukko	8ed705d79c	unexport internal types ContainerAssignments and PodAssignementDetauls need not be exported. Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>	2022-12-30 16:39:41 +02:00
Lukas Kalbertodt	ae0f9c5334	Fix command in docs by adding single quotes Otherwise most shells will interpreted `?` in an unintended way.	2022-12-16 12:28:52 +01:00
Ukri Niemimuukko	41b7b55727	gpu: log errors from pod listing Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>	2022-10-11 14:31:56 +03:00
Mikko Ylinen	75bff62ba1	Merge pull request #1183 from tkatila/gpu-demo-updates gpu: improve demo run instructions	2022-10-07 13:08:54 +03:00
Eero Tamminen	0b47ebd3e7	Add information on new DKMS kernel GPU driver packages Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2022-10-06 18:08:53 +03:00
Tuomas Katila	56bc5ebeee	Modifications based on Eero's comments Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2022-10-06 17:55:04 +03:00
Tuomas Katila	63cbe808a7	gpu: improve demo run instructions Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2022-10-05 16:10:03 +03:00
Eero Tamminen	647b484e7a	Improve GPU drivers installation instructions - Add note about LTS kernel DKMS source repo - Correct note about the demo (unlike FPGA demo, GPU demo is not in docker hub) Fixes: `89d3c5a4f3` Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2022-09-28 12:40:30 +03:00
Eero Tamminen	9b3ee06cb1	Add GPU plugin README prerequisites section Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2022-09-23 20:32:46 +03:00
Tuomas Katila	eac635e439	gpu: fix documentation links Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2022-09-23 20:32:46 +03:00
Tuomas Katila	e375186458	Update cmd/gpu_plugin/README.md Co-authored-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>	2022-09-15 15:30:23 +03:00
Tuomas Katila	c562db9b28	gpu: Improve installation options and documentation Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2022-09-15 15:19:23 +03:00
Tuomas Katila	230570f12e	gpu: add mentions about data center gpu support Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2022-09-09 13:07:50 +03:00
Ed Bartosh	f0dd95274e	Merge pull request #1126 from mythi/PR-2022-054 docs: rework development guide	2022-09-02 17:59:15 +03:00
Mikko Ylinen	1b3accacc2	docs: rework development guide Currently, each individual plugin README documents roughly the same daily development steps to git clone, build, and deploy. Re-purpose the plugin READMEs more towards cluster admin type of documentation and start moving all development related documentation to DEVEL.md. The same is true for e2e testing documentation which is scattered in places where they don't belong to. Having all day-to-day development Howtos is good to have in a centralized place. Finally, the cleanup includes some harmonization to plugins' table of contents which now follows the pattern: * [Introduction](#introduction) (* [Modes and Configuration Options](#modes-and-configuration-options)) * [Installation](#installation) (* [Prerequisites](#prerequisites)) * [Pre-built Images](#pre-built-images) * [Verify Plugin Registration](#verify-plugin-registration) * [Testing and Demos](#testing-and-demos) * ... Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2022-08-31 20:00:15 +03:00
Eero Tamminen	fb18923298	Log GPU device share count & type count changes separately And instead of accessing DeviceTree internals, add suitable method for it. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2022-08-31 17:23:57 +03:00
Mikko Ylinen	d826548d29	Merge pull request #1113 from eero-t/gpu-count-log More detailed log for number of found GPU devices / resource types	2022-08-29 09:58:53 +03:00
Eero Tamminen	ddf2c8bc8f	More detailed log for number of found GPU devices / resource types Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2022-08-26 17:51:27 +03:00
Eero Tamminen	5666b8fa30	Add "prefix" option to GPU plugin for scalability testing GPU plugin code assumes container paths to match host paths, and container runtime prevents creating fake files under real paths. When non-standard paths are used, devices can be faked for scalability testing. Note: If one wants to run both normal GPU plugin and faked one in same cluster, all nodes providing fake "i915" resources should be labeled differently from ones with real GPU plugin + devices, so that real GPU workloads can be limited to correct nodes with a suitable nodeSelector. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2022-08-24 14:32:53 +03:00

1 2 3

135 Commits