Commit Graph

56 Commits

Author SHA1 Message Date
Ukri Niemimuukko
2c4d529d66 gpu_plugin: fractional resource management
Fractional resource management feature

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
Signed-off-by: Dmitry Rozhkov <dmitry.rozhkov@intel.com>
2021-06-04 13:06:50 +03:00
Eero Tamminen
c575ce9099 Document GPU plugin test code test-case struct members 2021-05-06 11:02:57 +03:00
Eero Tamminen
57c8d76e1b Add minimal GPU plugin options testing
Tests plugin scan results in setups having none, one and multiple
eligible GPU devices, with and without SRIOV enabled, with two
different options values.

This does not cover verifying number of devices added under
"i915_monitoring" resource as that would be much larger change.
2021-05-05 17:09:09 +03:00
Eero Tamminen
ca9aa32556 Add "-enable-monitoring" option to GPU plugin
Make "i915_monitoring" resource (granting access to all GPUs) optional
so that it can be enabled only when it's needed.
2021-05-05 17:09:09 +03:00
Eero Tamminen
713c1ab170 Move GPU plugin CLI options to a struct
To help in:
* adding more CLI options in next and later commits, and
* to replace magic newDevicePlugin() input parameters with
  explicitly named one(s)
2021-05-05 17:09:09 +03:00
Eero Tamminen
06fac8128f Move GPU plugin sysfs device compatibility checks to own function
To reduce scan() function complexity before adding more functionality
to it.
2021-05-05 17:08:49 +03:00
Eero Tamminen
79b86fea2d Skip PF for "i915" resource when it has VFs
NOTE: this has impact only for GPUs which are virtualized with SR-IOV.

Access to physical devices (PFs) is disabled for "i915" resource when
they have configured virtual devices (VFs).

This is because:

* GPU resources are expected to be evenly split between VFs in such
  configurations

* But PF resource amount is expected to differ from VFs and typically
  retain only enough resources (just few MB of RAM), to be able to
  provide GPU metrics that are not available from VFs

* Neither the current GPU plugin, nor Kubernetes scheduling in
  general, has proper support for heterogeneous GPUs (= capability
  based scheduling)

Therefore "i915" resource needs to be limited to GPU devices with
homogeneous amount of resources, which in SR-IOV configurations is
expected to be the case only with VFs (when such are present).
2021-05-05 14:13:48 +03:00
Eero Tamminen
e418c00fca Add "i915_monitoring" resource to GPU plugin
Which mounts all (Intel) GPU devices to requesting container.

This is needed e.g. to get GPU metrics from the node.  Requesting pod
does not know how many GPUs are on the node it gets assigned to, so
there needs to means to request them all.

(Only alternative for the new resource would be requesting Privileged
mode, which is clearly worse as that would grant pod access also to
all other devices and capabilities.)

This commit also:

* Adds "i915_monitoring" resource testing to: go test -v -run Scan

* Splits GPU plugin tests mock file system setup to a separate
  createTestFiles() function because otherwise TestScan() does not
  pass project's golangci-lint complexity limits
2021-04-27 14:21:05 +03:00
Eero Tamminen
f9158c1c3b Update GPU plugin copyrights 2021-04-01 15:20:35 +03:00
Eero Tamminen
8ca19d408f Fix GPU plugin error messages 2021-04-01 15:20:35 +03:00
Eero Tamminen
384d37ead0 Add test for multiple GPU devices 2021-04-01 15:20:35 +03:00
Eero Tamminen
49354693fb Fix GPU plugin test setup + better error message
Tests fail depending in which order they are run, unless mocked files
are cleaned between test runs.

Without this, the next commit would fail.
2021-04-01 15:20:35 +03:00
DougTW
625b30fd1b Fixes 560. Edited gpu_plugin README. Restored master to line 157
Signed-off-by: DougTW <doug.martin@intel.com>
2021-02-09 16:49:30 -08:00
DougTW
28cbebc81b edited gpu_plugin README; changed 2 instances of master to main. Related to PR 499.
Signed-off-by: DougTW <doug.martin@intel.com>
2021-02-08 18:40:47 -08:00
Mikko Ylinen
0892a34705 move to k8s.io v1.20.x and klog/v2 v2.4.0
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2021-01-21 15:34:39 +02:00
Ed Bartosh
6b208f8acf documentation: remove kubelet configuration check
Removed device plugin socket check from the documentation as
device plugin support is enabled by default in Kubelet.

Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
2021-01-12 13:00:20 +02:00
Kevin Putnam
1d149ffee6 Documentation: Fixes broken links and standardizes headers.
Signed-off-by: Kevin Putnam <kevin.putnam@intel.com>
2020-09-22 08:32:21 -07:00
Dmitry Rozhkov
1b82ab9df6 sync README.md files with the current state of the code
Closes #356
2020-09-16 10:54:39 +03:00
Ukri Niemimuukko
b2991b94e1 gpu_plugin: reduce topology scanning for high shared dev count
For every created device info, a new topology scan is performed in
the filesystem. The shared dev count was implemented so that for each
shared device, a new device info was created, which resulted in the
topology scan happening as many times per Scan-round, as there were
shared devs.

This fixes the issue by making the device info to be shared among the
shared devices.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2020-09-08 18:57:29 +03:00
Ukri Niemimuukko
7244bd0f25 gpu_plugin: README.md update
Move remark about GVT-d to end of introduction. Remove remarks
about GVT-g for the time being.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2020-08-25 13:45:10 +03:00
Mikko Ylinen
cd068c797a ci: update tool versions
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2020-08-21 17:04:04 +03:00
Dmitry Rozhkov
73aea0aa1b linter: enable gosec check 2020-06-11 17:56:24 +03:00
Dmitry Rozhkov
70f862f2aa add golangci linter
In this initial commit the following checks are disabled due to
excessive amount of changes required:
- dupl (duplicate code)
- funlen (function length)
- goerr113 (errors handling expressions)
- gomnd (magic numbers)
- gosec (security)
- nakedret (naked returns)
- wsl (forces to use empty lines)
- errcheck (checking for unchecked errors)
- staticcheck (static analysis)
2020-06-08 14:01:13 +03:00
Dmitry Rozhkov
aabc45cbb5 gpu: increase code coverage for unit tests 2020-05-19 16:14:40 +03:00
Graham Whaley
626bbb6ee2 gpu: move to using klog
Move from fmt to klog for logging and debug.
Also add an extra info level message noting when we find
new devices.

Signed-off-by: Graham Whaley <graham.whaley@intel.com>
2020-03-20 11:54:38 +00:00
Mikko Ylinen
f145541caf READMEs: use git clone to get the code
go get'ing does not work due to our k8s.io/kubernetes dependency
so guide users to use git clone to get the code.

Fixes: #290

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2020-02-20 08:04:07 +02:00
Dmitry Rozhkov
3db440d2d4
Merge pull request #288 from askervin/kustomize-gpu
gpu_plugin: add kustomizations
2020-02-11 10:54:14 +02:00
Ed Bartosh
1f4928790f Implement function for DeviceInfo creation
- Made DeviceInfo fields private
- Implement NewDeviceInfo constructor
2020-02-07 15:26:37 +02:00
Antti Kervinen
d568f050c5 gpu_plugin: add kustomizations
- Default deployment: `kubectl apply -k deployments/gpu_plugin`
- Default deployment does not specify namespace anymore
  (was: `kube-system`).
- Variant: deploy only on nodes with Intel GPU label by NFD:
  `kubectl apply -k deployments/gpu_plugin/overlays/nfd_labeled_nodes`
- Variant: deploy to `kube-system` instead of user-defined namespace
  (or "default"):
  `kubectl apply -k deployments/gpu_plugin/overlays/namespace_kube-system`
- GPU plugin README updated.

Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
2020-02-07 14:56:52 +02:00
Graham Whaley
6537e38499 gpu: do not fail if device scanning fails
If we fail to scan for GPU devices (note, that is potentially
different from not finding any devices during a scan), then
warn on it, and go around the poll loop again. Do not treat
it as a fatal error or we might end up in a re-launch death
deploy loop...

Of course, getting a warning in your logs every 5s could also
be annoying, but is somewhat 'less fatal'.

Fixes: #260
Fixes: #230

Signed-off-by: Graham Whaley <graham.whaley@intel.com>
2020-01-29 09:24:50 +00:00
Graham Whaley
79a86c10e8 docs: gpu: Add more details, re-arrange section order
Re-arrange the section order a little (such as putting the use
of the DaemonSet before the sudo hand-deploy), and add a lot more
detail of what to expect, and how to check if the pod has launched
correctly.

Signed-off-by: Graham Whaley <graham.whaley@intel.com>
2020-01-17 13:34:13 +00:00
Graham Whaley
6705a8e461 docs: gpu: add high level details to README
Fill out the introduction to the GPU README to give some details around
what the plugin supports and how.

Signed-off-by: Graham Whaley <graham.whaley@intel.com>
2020-01-16 15:27:22 +00:00
Dmitry Rozhkov
814e2e1a50 bump k8s dependencies up to v1.17.0 2020-01-09 11:19:58 +02:00
Mikko Ylinen
fd631fc31c deployments/gpu_plugin: limit host mounts
The default deployment gives rather wide host mounts. We can limit
the mounts only to the subdirectories the plugin needs and mount
them read-only.

Also, add notes that both QAT and GPU plugins can be run as non-root
user.

Fixes: #228

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2019-12-11 12:54:36 +02:00
Dmitry Rozhkov
44ff734be6 gpu: add log messages for not found cards
Let a user know the plugin can't find any Intel GPU if that's
the case. It might be cumbersome to realize that the plugin runs
on a host which doesn't have any Intel GPUs.

Also make the code less nested for better readability.
2019-05-24 16:19:06 +03:00
Dmitry Rozhkov
54332c5eea announce deviceplugin API public 2019-01-21 17:20:01 +02:00
Dmitry Rozhkov
7662cb9154 extend API to receive full specs instead of strings 2019-01-21 17:15:27 +02:00
Frederik Carlier
d6016dedf9 Fix typos 2018-11-22 20:44:00 +00:00
Ed Bartosh
14b4168cbd add GPU plugin deployment
Added DaemonSet yaml
Added deployment instructions to plugin's README
2018-09-14 13:55:08 +03:00
Dmitry Rozhkov
47464fa034 Update gpu_plugin/README.md not to mention beignet 2018-09-04 11:09:23 +03:00
Dmitry Rozhkov
eccd70c600 replace glog with simpler home-grown debug logging 2018-08-16 17:40:16 +03:00
Dmitry Rozhkov
2ff6c5929a Use annotated errors for tracing 2018-08-16 17:31:19 +03:00
Dmitry Rozhkov
40246f64ad gpu_plugin: add -shared-dev-num option
The DRM driver of Intel i915 GPUs allows sharing one device
between many containers.

Make it possible to use the same device from different containers.
The exact number of containers sharing the same device can be limited
with the new option -shared-dev-num set to 1 by default.

closes #53
2018-08-14 14:54:49 +03:00
Ed Bartosh
21e0e5c518
Merge pull request #48 from rojkov/refactor-dev-plugins
Refactor dev plugins to increase code reuse
2018-07-31 13:37:16 +03:00
MCamp859
fb5e20c14a Edits for format and text flow.
Signed-off-by: MCamp859 <mary.camp@ptiglobal.net>
2018-07-30 13:59:26 -04:00
MCamp859
f3f749d4f5 Edits for format and text flow.
Signed-off-by: MCamp859 <mary.camp@ptiglobal.net>
2018-07-30 13:54:01 -04:00
Dmitry Rozhkov
1e7dbac162 Update README.md files to reflect changes caused by refactoring
update demo files
2018-07-30 15:29:33 +03:00
Dmitry Rozhkov
bbee3fde77 refactor device plugins to increase code reuse
Every device plugin is supposed to implement PluginInterfaceServer
interface to be exposed as a gRPC service. But this functionality is
common for all our device plugins and can be hidden in a Manager
which manages all gRPC servers dynamically.

The only mandatory functionality that needs to be provided by a device
plugin and which differentiate one plugin from another is the code
scanning the host for devices present on it.

Refactor the internal deviceplugin package to accept only
one mandatory method implementation from device plugins - Scan().

In addition to that  a device plugin can optionally implement a
PostAllocate() method which mutates responses returned by
PluginInterfaceServer.Allocate() method.

Also to narrow the gap between these device plugins and the
kubevirt's collection the naming scheme for resources has been changed.
Now device plugins provide a namespace for the device types they
operate with. E.g. for resources in format "color.example.com/<color>"
the namespace would be "color.example.com". So, the resource name
"intel.com/fpga-region-fffffff" becomes "fpga.intel.com/region-fffffff".
2018-07-30 15:29:33 +03:00
Alexander D. Kanevskiy
6c08dbdb64
Merge pull request #54 from zhenyw/gpu
gpu_plugin: skip drm control node
2018-07-26 15:03:04 +03:00
Zhenyu Wang
ec632e0b38 gpu_plugin: skip drm control node
DRM control node is deprecated and removed by latest kernel.
This will skip possible drm control node found on host.

v2: Fix lint error
v3: Fix regex string

Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
2018-07-26 10:35:53 +08:00