intel-device-plugins-for-kubernetes

github/intel-device-plugins-for-kubernetes

mirror of https://github.com/intel/intel-device-plugins-for-kubernetes.git synced 2025-06-03 03:59:37 +00:00

Author	SHA1	Message	Date
Mikko Ylinen	33e0e21a8b	gpu: fix klog formatting typo Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2023-11-03 09:29:21 +02:00
Tuomas Katila	691dfc3483	gpu: refactor nfdhook functionality to plugin NFD v0.14+ doesn't support binary NFD hooks by default, so there is a need to move the label creation away from the GPU nfdhook. Move extended resource label creation to plugin, and drop labels that were already marked deprecated (platform_gen, media_version etc.). Drop init-container from deployment files and operator. It is still possible to use an initcontainer, but the default deployments do not support it. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-09-12 16:20:33 +03:00
Tuomas Katila	4e645d823c	gpu: change 'none' allocation policy With shared-dev-num and multiple i915s in the resource request, try to find as many individual GPUs to expose to the container. Previously, with multiple i915 resources, it was typical to get only one GPU device in the container. Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com> Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-05-04 13:39:10 +03:00
Tuomas Katila	342554c666	lint fixes found from 0.26.1 release preparation Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-05-02 13:52:36 +03:00
Tuomas Katila	943e34f3af	gpu: mount by-path directory oneCCL requires the /dev/dri/by-path folder to be available to create a mapping between GPUs. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2023-04-20 14:56:59 +03:00
Eero Tamminen	fb18923298	Log GPU device share count & type count changes separately And instead of accessing DeviceTree internals, add suitable method for it. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2022-08-31 17:23:57 +03:00
Mikko Ylinen	d826548d29	Merge pull request #1113 from eero-t/gpu-count-log More detailed log for number of found GPU devices / resource types	2022-08-29 09:58:53 +03:00
Eero Tamminen	ddf2c8bc8f	More detailed log for number of found GPU devices / resource types Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2022-08-26 17:51:27 +03:00
Eero Tamminen	5666b8fa30	Add "prefix" option to GPU plugin for scalability testing GPU plugin code assumes container paths to match host paths, and container runtime prevents creating fake files under real paths. When non-standard paths are used, devices can be faked for scalability testing. Note: If one wants to run both normal GPU plugin and faked one in same cluster, all nodes providing fake "i915" resources should be labeled differently from ones with real GPU plugin + devices, so that real GPU workloads can be limited to correct nodes with a suitable nodeSelector. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>	2022-08-24 14:32:53 +03:00
Mikko Ylinen	0f36cde605	Merge pull request #935 from tkatila/gpu/tiles-support-and-numa-mapping gpu: add tiles annotation support	2022-03-30 19:33:09 +03:00
Tuomas Katila	8f6a235b5d	gpu: Start using GetPreferredAllocation with fractional resources Move reallocate logic to getpreferredallocation and simplify allocate to use the kubelet's device ids. Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>	2022-03-30 11:32:49 +03:00
Hyeongju Johannes Lee	7eeaddc563	gpu: fix typo in implmentation of preferredAllocator interface Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>	2022-03-28 05:04:32 -07:00
Ed Bartosh	55f3e17dd0	add 'annotations' parameter to the NewDeviceInfo API Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>	2022-02-07 15:15:30 +02:00
Ed Bartosh	cec004c398	lint: enable wsl check Fixes: #392 Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>	2021-12-17 11:48:48 +02:00
Xu, Guoshu	e4c4a8f7ac	GPU devices resource preferred allocation methods. 1. Implement PreferredAllocator interface. 2. Provide 3 preferred allocation policies: balancedPolicy, packedPolicy and nonePolicy. 3. Provide the cmdline interface: -allocation-policy balanced/packed/none, to select which preferred allocation policy to use. 4. Add operator support. Co-authored-by: Mikko Ylinen <mikko.ylinen@intel.com>	2021-11-17 22:55:10 +08:00
Hyeongju Johannes Lee	8fc5df7e37	Add govet-fieldalignment Add govet-fieldalignment to .golangci.yml Fix errors that come from adding govet-fieldalignment - by reordering the fields of structs - by putting nolint:govet annotations Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>	2021-09-20 20:59:04 +03:00
Hyeongju Johannes Lee	09ba9fde00	Update tool versions and fix errors and warnings that originated from the update Update tool versions Fix the errors and warnings originated from the update: -Correct type deviceInfo (->DeviceInfo) to make it public -Fix gpu_plugin.go and vpu_plugin_test.go where stylecheck errors occur -Fix deprecation warnings -Rename type 'PatcherManager' to 'Manager' to solve exported errors -Rename type 'SgxMutator' to 'Mutator' to solve exported errors Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>	2021-08-25 07:09:34 +00:00
Ukri Niemimuukko	7ca5cfcfd6	add pf skip to gpu nfdhook This corresponds to the previous gpu-plugin skip code. Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>	2021-06-10 18:44:57 +03:00
Dmitry Rozhkov	6aa1a47c9a	Merge pull request #638 from uniemimu/fractional gpu_plugin: fractional resource management	2021-06-09 10:58:10 +03:00
Ukri Niemimuukko	2c4d529d66	gpu_plugin: fractional resource management Fractional resource management feature Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com> Signed-off-by: Dmitry Rozhkov <dmitry.rozhkov@intel.com>	2021-06-04 13:06:50 +03:00
Mikko Ylinen	facb4214a2	tree-wide: drop deprecated io/ioutil Go 1.16 release notes announced the deprecation of io/ioutil [1]. It's easy for us to move to use what is was recommended so just do it. [1] https://golang.org/doc/go1.16#ioutil Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2021-06-02 13:41:15 +03:00
Eero Tamminen	ca9aa32556	Add "-enable-monitoring" option to GPU plugin Make "i915_monitoring" resource (granting access to all GPUs) optional so that it can be enabled only when it's needed.	2021-05-05 17:09:09 +03:00
Eero Tamminen	713c1ab170	Move GPU plugin CLI options to a struct To help in: * adding more CLI options in next and later commits, and * to replace magic newDevicePlugin() input parameters with explicitly named one(s)	2021-05-05 17:09:09 +03:00
Eero Tamminen	06fac8128f	Move GPU plugin sysfs device compatibility checks to own function To reduce scan() function complexity before adding more functionality to it.	2021-05-05 17:08:49 +03:00
Eero Tamminen	79b86fea2d	Skip PF for "i915" resource when it has VFs NOTE: this has impact only for GPUs which are virtualized with SR-IOV. Access to physical devices (PFs) is disabled for "i915" resource when they have configured virtual devices (VFs). This is because: * GPU resources are expected to be evenly split between VFs in such configurations * But PF resource amount is expected to differ from VFs and typically retain only enough resources (just few MB of RAM), to be able to provide GPU metrics that are not available from VFs * Neither the current GPU plugin, nor Kubernetes scheduling in general, has proper support for heterogeneous GPUs (= capability based scheduling) Therefore "i915" resource needs to be limited to GPU devices with homogeneous amount of resources, which in SR-IOV configurations is expected to be the case only with VFs (when such are present).	2021-05-05 14:13:48 +03:00
Eero Tamminen	e418c00fca	Add "i915_monitoring" resource to GPU plugin Which mounts all (Intel) GPU devices to requesting container. This is needed e.g. to get GPU metrics from the node. Requesting pod does not know how many GPUs are on the node it gets assigned to, so there needs to means to request them all. (Only alternative for the new resource would be requesting Privileged mode, which is clearly worse as that would grant pod access also to all other devices and capabilities.) This commit also: * Adds "i915_monitoring" resource testing to: go test -v -run Scan * Splits GPU plugin tests mock file system setup to a separate createTestFiles() function because otherwise TestScan() does not pass project's golangci-lint complexity limits	2021-04-27 14:21:05 +03:00
Eero Tamminen	f9158c1c3b	Update GPU plugin copyrights	2021-04-01 15:20:35 +03:00
Eero Tamminen	8ca19d408f	Fix GPU plugin error messages	2021-04-01 15:20:35 +03:00
Mikko Ylinen	0892a34705	move to k8s.io v1.20.x and klog/v2 v2.4.0 Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2021-01-21 15:34:39 +02:00
Ukri Niemimuukko	b2991b94e1	gpu_plugin: reduce topology scanning for high shared dev count For every created device info, a new topology scan is performed in the filesystem. The shared dev count was implemented so that for each shared device, a new device info was created, which resulted in the topology scan happening as many times per Scan-round, as there were shared devs. This fixes the issue by making the device info to be shared among the shared devices. Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>	2020-09-08 18:57:29 +03:00
Mikko Ylinen	cd068c797a	ci: update tool versions Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2020-08-21 17:04:04 +03:00
Dmitry Rozhkov	aabc45cbb5	gpu: increase code coverage for unit tests	2020-05-19 16:14:40 +03:00
Graham Whaley	626bbb6ee2	gpu: move to using klog Move from fmt to klog for logging and debug. Also add an extra info level message noting when we find new devices. Signed-off-by: Graham Whaley <graham.whaley@intel.com>	2020-03-20 11:54:38 +00:00
Ed Bartosh	1f4928790f	Implement function for DeviceInfo creation - Made DeviceInfo fields private - Implement NewDeviceInfo constructor	2020-02-07 15:26:37 +02:00
Graham Whaley	6537e38499	gpu: do not fail if device scanning fails If we fail to scan for GPU devices (note, that is potentially different from not finding any devices during a scan), then warn on it, and go around the poll loop again. Do not treat it as a fatal error or we might end up in a re-launch death deploy loop... Of course, getting a warning in your logs every 5s could also be annoying, but is somewhat 'less fatal'. Fixes: #260 Fixes: #230 Signed-off-by: Graham Whaley <graham.whaley@intel.com>	2020-01-29 09:24:50 +00:00
Dmitry Rozhkov	814e2e1a50	bump k8s dependencies up to v1.17.0	2020-01-09 11:19:58 +02:00
Dmitry Rozhkov	44ff734be6	gpu: add log messages for not found cards Let a user know the plugin can't find any Intel GPU if that's the case. It might be cumbersome to realize that the plugin runs on a host which doesn't have any Intel GPUs. Also make the code less nested for better readability.	2019-05-24 16:19:06 +03:00
Dmitry Rozhkov	54332c5eea	announce deviceplugin API public	2019-01-21 17:20:01 +02:00
Dmitry Rozhkov	7662cb9154	extend API to receive full specs instead of strings	2019-01-21 17:15:27 +02:00
Dmitry Rozhkov	eccd70c600	replace glog with simpler home-grown debug logging	2018-08-16 17:40:16 +03:00
Dmitry Rozhkov	2ff6c5929a	Use annotated errors for tracing	2018-08-16 17:31:19 +03:00
Dmitry Rozhkov	40246f64ad	gpu_plugin: add -shared-dev-num option The DRM driver of Intel i915 GPUs allows sharing one device between many containers. Make it possible to use the same device from different containers. The exact number of containers sharing the same device can be limited with the new option -shared-dev-num set to 1 by default. closes #53	2018-08-14 14:54:49 +03:00
Dmitry Rozhkov	bbee3fde77	refactor device plugins to increase code reuse Every device plugin is supposed to implement PluginInterfaceServer interface to be exposed as a gRPC service. But this functionality is common for all our device plugins and can be hidden in a Manager which manages all gRPC servers dynamically. The only mandatory functionality that needs to be provided by a device plugin and which differentiate one plugin from another is the code scanning the host for devices present on it. Refactor the internal deviceplugin package to accept only one mandatory method implementation from device plugins - Scan(). In addition to that a device plugin can optionally implement a PostAllocate() method which mutates responses returned by PluginInterfaceServer.Allocate() method. Also to narrow the gap between these device plugins and the kubevirt's collection the naming scheme for resources has been changed. Now device plugins provide a namespace for the device types they operate with. E.g. for resources in format "color.example.com/<color>" the namespace would be "color.example.com". So, the resource name "intel.com/fpga-region-fffffff" becomes "fpga.intel.com/region-fffffff".	2018-07-30 15:29:33 +03:00
Alexander D. Kanevskiy	6c08dbdb64	Merge pull request #54 from zhenyw/gpu gpu_plugin: skip drm control node	2018-07-26 15:03:04 +03:00
Zhenyu Wang	ec632e0b38	gpu_plugin: skip drm control node DRM control node is deprecated and removed by latest kernel. This will skip possible drm control node found on host. v2: Fix lint error v3: Fix regex string Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>	2018-07-26 10:35:53 +08:00
Zhenyu Wang	6f3543884f	gpu_plugin: Fix regex string for drm card node As noted on pull request comment, fix regex for drm card node. Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>	2018-07-26 10:33:12 +08:00
ssehgal	3eb2b10f75	Enabling support for QuickAssist Devices	2018-07-23 17:35:37 +01:00
Alexander Kanevskiy	d4d77a71e4	Initial public code release	2018-05-18 18:30:54 +03:00

48 Commits