Commit Graph

48 Commits

Author SHA1 Message Date
Mikko Ylinen
33e0e21a8b gpu: fix klog formatting typo
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2023-11-03 09:29:21 +02:00
Tuomas Katila
691dfc3483 gpu: refactor nfdhook functionality to plugin
NFD v0.14+ doesn't support binary NFD hooks by default, so there is
a need to move the label creation away from the GPU nfdhook.

Move extended resource label creation to plugin, and drop labels that were
already marked deprecated (platform_gen, media_version etc.).

Drop init-container from deployment files and operator. It is still possible
to use an initcontainer, but the default deployments do not support it.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
4e645d823c gpu: change 'none' allocation policy
With shared-dev-num and multiple i915s in the resource request,
try to find as many individual GPUs to expose to the container.

Previously, with multiple i915 resources, it was typical to
get only one GPU device in the container.

Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-04 13:39:10 +03:00
Tuomas Katila
342554c666 lint fixes found from 0.26.1 release preparation
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-05-02 13:52:36 +03:00
Tuomas Katila
943e34f3af gpu: mount by-path directory
oneCCL requires the /dev/dri/by-path folder to be available
to create a mapping between GPUs.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-04-20 14:56:59 +03:00
Eero Tamminen
fb18923298 Log GPU device share count & type count changes separately
And instead of accessing DeviceTree internals, add suitable method for it.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-31 17:23:57 +03:00
Mikko Ylinen
d826548d29
Merge pull request #1113 from eero-t/gpu-count-log
More detailed log for number of found GPU devices / resource types
2022-08-29 09:58:53 +03:00
Eero Tamminen
ddf2c8bc8f More detailed log for number of found GPU devices / resource types
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-26 17:51:27 +03:00
Eero Tamminen
5666b8fa30 Add "prefix" option to GPU plugin for scalability testing
GPU plugin code assumes container paths to match host paths, and
container runtime prevents creating fake files under real paths. When
non-standard paths are used, devices can be faked for scalability
testing.

Note: If one wants to run both normal GPU plugin and faked one in same
cluster, all nodes providing fake "i915" resources should be labeled
differently from ones with real GPU plugin + devices, so that real GPU
workloads can be limited to correct nodes with a suitable
nodeSelector.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-24 14:32:53 +03:00
Mikko Ylinen
0f36cde605
Merge pull request #935 from tkatila/gpu/tiles-support-and-numa-mapping
gpu: add tiles annotation support
2022-03-30 19:33:09 +03:00
Tuomas Katila
8f6a235b5d gpu: Start using GetPreferredAllocation with fractional resources
Move reallocate logic to getpreferredallocation and simplify
allocate to use the kubelet's device ids.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-03-30 11:32:49 +03:00
Hyeongju Johannes Lee
7eeaddc563 gpu: fix typo in implmentation of preferredAllocator interface
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
2022-03-28 05:04:32 -07:00
Ed Bartosh
55f3e17dd0 add 'annotations' parameter to the NewDeviceInfo API
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
2022-02-07 15:15:30 +02:00
Ed Bartosh
cec004c398 lint: enable wsl check
Fixes: #392

Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
2021-12-17 11:48:48 +02:00
Xu, Guoshu
e4c4a8f7ac GPU devices resource preferred allocation methods.
1. Implement PreferredAllocator interface.
2. Provide 3 preferred allocation policies: balancedPolicy, packedPolicy and nonePolicy.
3. Provide the cmdline interface: -allocation-policy balanced/packed/none, to select which preferred allocation policy to use.
4. Add operator support.

Co-authored-by: Mikko Ylinen <mikko.ylinen@intel.com>
2021-11-17 22:55:10 +08:00
Hyeongju Johannes Lee
8fc5df7e37 Add govet-fieldalignment
Add govet-fieldalignment to .golangci.yml
Fix errors that come from adding govet-fieldalignment
- by reordering the fields of structs
- by putting nolint:govet annotations

Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
2021-09-20 20:59:04 +03:00
Hyeongju Johannes Lee
09ba9fde00 Update tool versions and fix errors and warnings that originated from the update
Update tool versions
Fix the errors and warnings originated from the update:
-Correct type deviceInfo (->DeviceInfo) to make it public
-Fix gpu_plugin.go and vpu_plugin_test.go where stylecheck errors occur
-Fix deprecation warnings
-Rename type 'PatcherManager' to 'Manager' to solve exported errors
-Rename type 'SgxMutator' to 'Mutator' to solve exported errors

Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
2021-08-25 07:09:34 +00:00
Ukri Niemimuukko
7ca5cfcfd6 add pf skip to gpu nfdhook
This corresponds to the previous gpu-plugin skip code.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2021-06-10 18:44:57 +03:00
Dmitry Rozhkov
6aa1a47c9a
Merge pull request #638 from uniemimu/fractional
gpu_plugin: fractional resource management
2021-06-09 10:58:10 +03:00
Ukri Niemimuukko
2c4d529d66 gpu_plugin: fractional resource management
Fractional resource management feature

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
Signed-off-by: Dmitry Rozhkov <dmitry.rozhkov@intel.com>
2021-06-04 13:06:50 +03:00
Mikko Ylinen
facb4214a2 tree-wide: drop deprecated io/ioutil
Go 1.16 release notes announced the deprecation of io/ioutil [1]. It's easy
for us to move to use what is was recommended so just do it.

[1] https://golang.org/doc/go1.16#ioutil

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2021-06-02 13:41:15 +03:00
Eero Tamminen
ca9aa32556 Add "-enable-monitoring" option to GPU plugin
Make "i915_monitoring" resource (granting access to all GPUs) optional
so that it can be enabled only when it's needed.
2021-05-05 17:09:09 +03:00
Eero Tamminen
713c1ab170 Move GPU plugin CLI options to a struct
To help in:
* adding more CLI options in next and later commits, and
* to replace magic newDevicePlugin() input parameters with
  explicitly named one(s)
2021-05-05 17:09:09 +03:00
Eero Tamminen
06fac8128f Move GPU plugin sysfs device compatibility checks to own function
To reduce scan() function complexity before adding more functionality
to it.
2021-05-05 17:08:49 +03:00
Eero Tamminen
79b86fea2d Skip PF for "i915" resource when it has VFs
NOTE: this has impact only for GPUs which are virtualized with SR-IOV.

Access to physical devices (PFs) is disabled for "i915" resource when
they have configured virtual devices (VFs).

This is because:

* GPU resources are expected to be evenly split between VFs in such
  configurations

* But PF resource amount is expected to differ from VFs and typically
  retain only enough resources (just few MB of RAM), to be able to
  provide GPU metrics that are not available from VFs

* Neither the current GPU plugin, nor Kubernetes scheduling in
  general, has proper support for heterogeneous GPUs (= capability
  based scheduling)

Therefore "i915" resource needs to be limited to GPU devices with
homogeneous amount of resources, which in SR-IOV configurations is
expected to be the case only with VFs (when such are present).
2021-05-05 14:13:48 +03:00
Eero Tamminen
e418c00fca Add "i915_monitoring" resource to GPU plugin
Which mounts all (Intel) GPU devices to requesting container.

This is needed e.g. to get GPU metrics from the node.  Requesting pod
does not know how many GPUs are on the node it gets assigned to, so
there needs to means to request them all.

(Only alternative for the new resource would be requesting Privileged
mode, which is clearly worse as that would grant pod access also to
all other devices and capabilities.)

This commit also:

* Adds "i915_monitoring" resource testing to: go test -v -run Scan

* Splits GPU plugin tests mock file system setup to a separate
  createTestFiles() function because otherwise TestScan() does not
  pass project's golangci-lint complexity limits
2021-04-27 14:21:05 +03:00
Eero Tamminen
f9158c1c3b Update GPU plugin copyrights 2021-04-01 15:20:35 +03:00
Eero Tamminen
8ca19d408f Fix GPU plugin error messages 2021-04-01 15:20:35 +03:00
Mikko Ylinen
0892a34705 move to k8s.io v1.20.x and klog/v2 v2.4.0
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2021-01-21 15:34:39 +02:00
Ukri Niemimuukko
b2991b94e1 gpu_plugin: reduce topology scanning for high shared dev count
For every created device info, a new topology scan is performed in
the filesystem. The shared dev count was implemented so that for each
shared device, a new device info was created, which resulted in the
topology scan happening as many times per Scan-round, as there were
shared devs.

This fixes the issue by making the device info to be shared among the
shared devices.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2020-09-08 18:57:29 +03:00
Mikko Ylinen
cd068c797a ci: update tool versions
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2020-08-21 17:04:04 +03:00
Dmitry Rozhkov
aabc45cbb5 gpu: increase code coverage for unit tests 2020-05-19 16:14:40 +03:00
Graham Whaley
626bbb6ee2 gpu: move to using klog
Move from fmt to klog for logging and debug.
Also add an extra info level message noting when we find
new devices.

Signed-off-by: Graham Whaley <graham.whaley@intel.com>
2020-03-20 11:54:38 +00:00
Ed Bartosh
1f4928790f Implement function for DeviceInfo creation
- Made DeviceInfo fields private
- Implement NewDeviceInfo constructor
2020-02-07 15:26:37 +02:00
Graham Whaley
6537e38499 gpu: do not fail if device scanning fails
If we fail to scan for GPU devices (note, that is potentially
different from not finding any devices during a scan), then
warn on it, and go around the poll loop again. Do not treat
it as a fatal error or we might end up in a re-launch death
deploy loop...

Of course, getting a warning in your logs every 5s could also
be annoying, but is somewhat 'less fatal'.

Fixes: #260
Fixes: #230

Signed-off-by: Graham Whaley <graham.whaley@intel.com>
2020-01-29 09:24:50 +00:00
Dmitry Rozhkov
814e2e1a50 bump k8s dependencies up to v1.17.0 2020-01-09 11:19:58 +02:00
Dmitry Rozhkov
44ff734be6 gpu: add log messages for not found cards
Let a user know the plugin can't find any Intel GPU if that's
the case. It might be cumbersome to realize that the plugin runs
on a host which doesn't have any Intel GPUs.

Also make the code less nested for better readability.
2019-05-24 16:19:06 +03:00
Dmitry Rozhkov
54332c5eea announce deviceplugin API public 2019-01-21 17:20:01 +02:00
Dmitry Rozhkov
7662cb9154 extend API to receive full specs instead of strings 2019-01-21 17:15:27 +02:00
Dmitry Rozhkov
eccd70c600 replace glog with simpler home-grown debug logging 2018-08-16 17:40:16 +03:00
Dmitry Rozhkov
2ff6c5929a Use annotated errors for tracing 2018-08-16 17:31:19 +03:00
Dmitry Rozhkov
40246f64ad gpu_plugin: add -shared-dev-num option
The DRM driver of Intel i915 GPUs allows sharing one device
between many containers.

Make it possible to use the same device from different containers.
The exact number of containers sharing the same device can be limited
with the new option -shared-dev-num set to 1 by default.

closes #53
2018-08-14 14:54:49 +03:00
Dmitry Rozhkov
bbee3fde77 refactor device plugins to increase code reuse
Every device plugin is supposed to implement PluginInterfaceServer
interface to be exposed as a gRPC service. But this functionality is
common for all our device plugins and can be hidden in a Manager
which manages all gRPC servers dynamically.

The only mandatory functionality that needs to be provided by a device
plugin and which differentiate one plugin from another is the code
scanning the host for devices present on it.

Refactor the internal deviceplugin package to accept only
one mandatory method implementation from device plugins - Scan().

In addition to that  a device plugin can optionally implement a
PostAllocate() method which mutates responses returned by
PluginInterfaceServer.Allocate() method.

Also to narrow the gap between these device plugins and the
kubevirt's collection the naming scheme for resources has been changed.
Now device plugins provide a namespace for the device types they
operate with. E.g. for resources in format "color.example.com/<color>"
the namespace would be "color.example.com". So, the resource name
"intel.com/fpga-region-fffffff" becomes "fpga.intel.com/region-fffffff".
2018-07-30 15:29:33 +03:00
Alexander D. Kanevskiy
6c08dbdb64
Merge pull request #54 from zhenyw/gpu
gpu_plugin: skip drm control node
2018-07-26 15:03:04 +03:00
Zhenyu Wang
ec632e0b38 gpu_plugin: skip drm control node
DRM control node is deprecated and removed by latest kernel.
This will skip possible drm control node found on host.

v2: Fix lint error
v3: Fix regex string

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2018-07-26 10:35:53 +08:00
Zhenyu Wang
6f3543884f gpu_plugin: Fix regex string for drm card node
As noted on pull request comment, fix regex for drm card node.

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2018-07-26 10:33:12 +08:00
ssehgal
3eb2b10f75 Enabling support for QuickAssist Devices 2018-07-23 17:35:37 +01:00
Alexander Kanevskiy
d4d77a71e4 Initial public code release 2018-05-18 18:30:54 +03:00