Commit Graph

445 Commits

Author SHA1 Message Date
Ed Bartosh
b4c2bd3afe
Merge pull request #1116 from eero-t/gpu_fakedev
Add fake GPU device generator for scalability testing
2022-12-07 18:44:08 +02:00
Ukri Niemimuukko
59cd72a66f fix gpu nfdhook numa labeling
Numa labeling only worked when card numbering started from 0.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-11-22 18:19:33 +02:00
Mikko Ylinen
afce0ed79c
Merge pull request #1196 from ozhuraki/e2e-operator
operator: Add e2e tests for DSA, IAA
2022-11-17 21:30:33 +02:00
Oleg Zhurakivskyy
ef7954c8e1 operator: Add e2e tests for DSA, IAA
Closes #1230

Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
2022-11-17 17:47:21 +02:00
Hyeongju Johannes Lee
9b203ba6b8 iaa: fix the name of the demo image
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
2022-11-11 15:37:11 +02:00
Mikko Ylinen
5876882066 operator: add support for Liveness and Readiness probes
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-11-03 10:25:07 +02:00
Hyeongju Johannes Lee
372dd73bfd iaa: fix readme to have correct web links
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
2022-10-31 13:07:17 +02:00
chaitanya1731
084bf53efb Added ocp_quickstart_guide for OCP users
Added operator installation steps for RedHat OpenShift Container Platform and updated main README to add the link

Signed-off-by: chaitanya1731 <chaitanya.kulkarni@intel.com>
2022-10-13 01:10:31 -07:00
Ukri Niemimuukko
41b7b55727 gpu: log errors from pod listing
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-10-11 14:31:56 +03:00
Mikko Ylinen
75bff62ba1
Merge pull request #1183 from tkatila/gpu-demo-updates
gpu: improve demo run instructions
2022-10-07 13:08:54 +03:00
Eero Tamminen
0b47ebd3e7 Add information on new DKMS kernel GPU driver packages
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-10-06 18:08:53 +03:00
Tuomas Katila
56bc5ebeee Modifications based on Eero's comments
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-10-06 17:55:04 +03:00
Tuomas Katila
63cbe808a7 gpu: improve demo run instructions
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-10-05 16:10:03 +03:00
Mikko Ylinen
fd1b25b9d4 docs: move away from 01.org doc links
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-10-03 18:22:07 +03:00
Eero Tamminen
647b484e7a Improve GPU drivers installation instructions
- Add note about LTS kernel DKMS source repo
- Correct note about the demo (unlike FPGA demo,
  GPU demo is not in docker hub)

Fixes: 89d3c5a4f3

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-09-28 12:40:30 +03:00
Eero Tamminen
9b3ee06cb1 Add GPU plugin README prerequisites section
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-09-23 20:32:46 +03:00
Tuomas Katila
eac635e439 gpu: fix documentation links
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-23 20:32:46 +03:00
Tuomas Katila
e375186458
Update cmd/gpu_plugin/README.md
Co-authored-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-09-15 15:30:23 +03:00
Tuomas Katila
c562db9b28 gpu: Improve installation options and documentation
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-15 15:19:23 +03:00
Ed Bartosh
92cd51bec3
Merge pull request #1152 from mythi/PR-2022-063
Update SGX and FPGA webhook flags
2022-09-13 19:58:44 +03:00
Ed Bartosh
f2db3826d8
Merge pull request #1134 from mythi/PR-2022-058
qat: read device capabilities from sysfs
2022-09-13 19:56:45 +03:00
Mikko Ylinen
b81d2dcba8 Update SGX and FPGA webhook flags
SGX Admission webhook was quickly forked from FPGA's
implementation. After a bit of thinking, it turns out
leader election and metrics are not necessary for a
(idempotent) webhook-only functionality.

For FPGA Admission webhook, the metrics isn't correctly
set up so it's better to disable the functionality. Leader
election is kept but the flag name is renamed to align with
"kubebuilder v3 functionality" similar to how we changed it
to the operator as well.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-09-13 13:18:28 +03:00
Mikko Ylinen
3abf10d7ff qat: read device capabilities from sysfs
Linux 6.0 adds sysfs-driver-qat entries to read device capabilities:
42e66b1cc3/Documentation/ABI/testing/sysfs-driver-qat

Implement the logic for reading from sysfs and prefer that over debugfs.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-09-09 14:16:03 +03:00
Tuomas Katila
230570f12e gpu: add mentions about data center gpu support
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-09 13:07:50 +03:00
Mikko Ylinen
307e960871 docs: fix remaining review comments
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-09-06 14:28:25 +03:00
Mikko Ylinen
8ac321f5e3 sgx: send nil TopologyInfo
/dev/sgx_* cannot be mapped to any topology. SGX itself is topology
aware but we cannot control it with TopologyInfo.

Currently, pkg/topology returns empty TopologyInfo{Nodes:[]*NUMANode{}}
for /dev/sgx_* but kubelet TopologyManager (when enabled and with the
policy other than 'none') interpretes that as "Hint Provider has no
possible NUMA affinities for resource" and rejects the SGX resources.

What we want is "Hint Provider has no preference for NUMA affinity with
resource". This is communicated using nil TopologyInfo.

See: https://github.com/kubernetes/kubernetes/issues/112234

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-09-06 08:43:04 +03:00
Ed Bartosh
f0dd95274e
Merge pull request #1126 from mythi/PR-2022-054
docs: rework development guide
2022-09-02 17:59:15 +03:00
Ed Bartosh
5756725b09 fix lint failure
Removed unused import. This should fix this golangci-lint failure:
  can't run linter goanalysis_metalinter:
  buildir: failed to load package :
  could not load export data:
  no export data for "cloud.google.com/go/compute/metadata"

Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
2022-09-02 12:02:06 +03:00
Mikko Ylinen
1b3accacc2 docs: rework development guide
Currently, each individual plugin README documents roughly the same
daily development steps to git clone, build, and deploy. Re-purpose
the plugin READMEs more towards cluster admin type of documentation
and start moving all development related documentation to DEVEL.md.

The same is true for e2e testing documentation which is scattered
in places where they don't belong to. Having all day-to-day
development Howtos is good to have in a centralized place.

Finally, the cleanup includes some harmonization to plugins'
table of contents which now follows the pattern:

* [Introduction](#introduction)
(* [Modes and Configuration Options](#modes-and-configuration-options))
* [Installation](#installation)
    (* [Prerequisites](#prerequisites))
    * [Pre-built Images](#pre-built-images)
    * [Verify Plugin Registration](#verify-plugin-registration)
* [Testing and Demos](#testing-and-demos)
    * ...

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-08-31 20:00:15 +03:00
Eero Tamminen
fb18923298 Log GPU device share count & type count changes separately
And instead of accessing DeviceTree internals, add suitable method for it.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-31 17:23:57 +03:00
Mikko Ylinen
d826548d29
Merge pull request #1113 from eero-t/gpu-count-log
More detailed log for number of found GPU devices / resource types
2022-08-29 09:58:53 +03:00
Ed Bartosh
02446fca1d
Merge pull request #1114 from eero-t/prefix-option
Add "prefix" option to GPU plugin for scalability testing
2022-08-26 22:54:28 +03:00
Eero Tamminen
9d4b52188e Add "gpu_fakedev" documentation
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-26 19:05:10 +03:00
Eero Tamminen
cc3aebbefc Add minimal example JSON to test "gpu_fakedev" generator
Config file is suitably indented so that it can be directly
appended to a suitable configMap header.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-26 19:05:10 +03:00
Eero Tamminen
c15feea1f8 Add code for generating fake GPU sysfs + devfs files
To facilitate GPU plugin scalability testing on a real cluster.

Pre-existing (fake) sysfs & devfs content needs to be removed first:

* Fake devfs directory is mounted from host so OCI runtime can "mount"
  device files also to workloads requesting fake devices. This means
  that those files can persist over fake GPU plugin life-time, and
  earlier files need to be removed, as they may not match

* DaemonSet restarts failing init containers, so errors about content
  created on previous generator run would prevent getting logs of the
  real error on first generator run

* Before removal, check that removed directory content is as expected,
  to avoid accidentally removing host sysfs/devfs content (in case
  container was erronously granted access to the real thing)

Container runtime requires fake device files to real be devices:

* Use NULL devices to represent fake GPU devices:
  https://www.kernel.org/doc/Documentation/admin-guide/devices.txt

* Give more detailed logging for MkNod() failures as device
  node creation is most likely operation to fail when container
  does not have the necessary access rights

Created content is based on JSON config file (instead of e.g.
commandline options) so that (configMap providing) it can be updated
independently of the pod where generator is run.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-26 19:04:43 +03:00
Eero Tamminen
ddf2c8bc8f More detailed log for number of found GPU devices / resource types
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-26 17:51:27 +03:00
Eero Tamminen
0b519ecf1e Deprecate debugfs GPU IP block version labels in NFD hook doc
There's no mapping available from IP block versions to actual product
features, which make these version numbers fairly useless for end
users.

In mixed GPU clusters, running a job that adds/updates node labels for
the relevant GPU features to each relevant node would be much more
user-friendly.  This could be done easily by converting given GPU API
capability tool (e.g. "vainfo" for VA-API, "clinfo" for OpenCL) output
to a NFD feature file.

(Such thing would be outside of this project scope though, except
maybe as an example / test-case.)

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-24 16:55:01 +03:00
Eero Tamminen
0b7cbc862d Improve GPU NFD hook documentation
Add table of contents, simplify introduction text.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-24 16:55:01 +03:00
Eero Tamminen
5666b8fa30 Add "prefix" option to GPU plugin for scalability testing
GPU plugin code assumes container paths to match host paths, and
container runtime prevents creating fake files under real paths. When
non-standard paths are used, devices can be faked for scalability
testing.

Note: If one wants to run both normal GPU plugin and faked one in same
cluster, all nodes providing fake "i915" resources should be labeled
differently from ones with real GPU plugin + devices, so that real GPU
workloads can be limited to correct nodes with a suitable
nodeSelector.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-24 14:32:53 +03:00
Ed Bartosh
6177dd0dfe
Merge pull request #1093 from mythi/PR-2022-050
build: move to Go 1.19
2022-08-16 00:25:57 +03:00
astronaut0131
2d155edac7 sgx: add kind deployment notes for aesmd 2022-08-15 15:26:01 +08:00
Mikko Ylinen
642c4f7b59 build: move to Go 1.19 and golangci-lint 1.48 because of that
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-08-15 10:13:37 +03:00
Chelsea Mafrica
24eb52a912 docs: Fix missing code block in operator doc
Add missing code block to section the the operator README.

Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
2022-08-05 11:32:48 -07:00
Mikko Ylinen
3c948cc106
Merge pull request #1063 from bart0sh/PR144-upgrade-libDLB
dlb: update DLB to v7.7.0
2022-07-18 09:29:55 +03:00
Ed Bartosh
9f2db89da6 dlb: update DLB to v7.7.0
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
2022-07-03 15:08:14 +03:00
Huang Xin
89caad1cd4 doc: modify SGX device plugin deployments url from 'main' to '<RELEASE_VERSION>'
Signed-off-by: Huang Xin <xin1.huang@intel.com>
2022-06-25 17:33:46 +08:00
Ed Bartosh
c82b907472
Merge pull request #1055 from mythi/PR-2022-045
operator: align with kubebuilder v3 functionality
2022-06-20 23:12:21 +03:00
Mikko Ylinen
f9ca36cc26 set TLSMinVersion for webhook servers
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-06-20 19:04:50 +03:00
Mikko Ylinen
b48568c43a operator: align with kubebuilder v3 functionality
kubebuilder v3 based scaffolding has updated many things
and they are documented in [1].

Update operator's functionality to v3 level. We've done
most/some of the changes earlier (e.g., by not using
deprecated k8s APIs anymore) so the changes are minimal.

[1] https://book.kubebuilder.io/migration/v2vsv3.html

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2022-06-20 16:35:40 +03:00
Oleg Zhurakivskyy
f1ec14d106 iaa: Add e2e tests
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
2022-06-09 15:00:25 +03:00