Added operator installation steps for RedHat OpenShift Container Platform and updated main README to add the link
Signed-off-by: chaitanya1731 <chaitanya.kulkarni@intel.com>
- Add note about LTS kernel DKMS source repo
- Correct note about the demo (unlike FPGA demo,
GPU demo is not in docker hub)
Fixes: 89d3c5a4f3
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
SGX Admission webhook was quickly forked from FPGA's
implementation. After a bit of thinking, it turns out
leader election and metrics are not necessary for a
(idempotent) webhook-only functionality.
For FPGA Admission webhook, the metrics isn't correctly
set up so it's better to disable the functionality. Leader
election is kept but the flag name is renamed to align with
"kubebuilder v3 functionality" similar to how we changed it
to the operator as well.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Linux 6.0 adds sysfs-driver-qat entries to read device capabilities:
42e66b1cc3/Documentation/ABI/testing/sysfs-driver-qat
Implement the logic for reading from sysfs and prefer that over debugfs.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
/dev/sgx_* cannot be mapped to any topology. SGX itself is topology
aware but we cannot control it with TopologyInfo.
Currently, pkg/topology returns empty TopologyInfo{Nodes:[]*NUMANode{}}
for /dev/sgx_* but kubelet TopologyManager (when enabled and with the
policy other than 'none') interpretes that as "Hint Provider has no
possible NUMA affinities for resource" and rejects the SGX resources.
What we want is "Hint Provider has no preference for NUMA affinity with
resource". This is communicated using nil TopologyInfo.
See: https://github.com/kubernetes/kubernetes/issues/112234
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Removed unused import. This should fix this golangci-lint failure:
can't run linter goanalysis_metalinter:
buildir: failed to load package :
could not load export data:
no export data for "cloud.google.com/go/compute/metadata"
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
Currently, each individual plugin README documents roughly the same
daily development steps to git clone, build, and deploy. Re-purpose
the plugin READMEs more towards cluster admin type of documentation
and start moving all development related documentation to DEVEL.md.
The same is true for e2e testing documentation which is scattered
in places where they don't belong to. Having all day-to-day
development Howtos is good to have in a centralized place.
Finally, the cleanup includes some harmonization to plugins'
table of contents which now follows the pattern:
* [Introduction](#introduction)
(* [Modes and Configuration Options](#modes-and-configuration-options))
* [Installation](#installation)
(* [Prerequisites](#prerequisites))
* [Pre-built Images](#pre-built-images)
* [Verify Plugin Registration](#verify-plugin-registration)
* [Testing and Demos](#testing-and-demos)
* ...
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Config file is suitably indented so that it can be directly
appended to a suitable configMap header.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
To facilitate GPU plugin scalability testing on a real cluster.
Pre-existing (fake) sysfs & devfs content needs to be removed first:
* Fake devfs directory is mounted from host so OCI runtime can "mount"
device files also to workloads requesting fake devices. This means
that those files can persist over fake GPU plugin life-time, and
earlier files need to be removed, as they may not match
* DaemonSet restarts failing init containers, so errors about content
created on previous generator run would prevent getting logs of the
real error on first generator run
* Before removal, check that removed directory content is as expected,
to avoid accidentally removing host sysfs/devfs content (in case
container was erronously granted access to the real thing)
Container runtime requires fake device files to real be devices:
* Use NULL devices to represent fake GPU devices:
https://www.kernel.org/doc/Documentation/admin-guide/devices.txt
* Give more detailed logging for MkNod() failures as device
node creation is most likely operation to fail when container
does not have the necessary access rights
Created content is based on JSON config file (instead of e.g.
commandline options) so that (configMap providing) it can be updated
independently of the pod where generator is run.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
There's no mapping available from IP block versions to actual product
features, which make these version numbers fairly useless for end
users.
In mixed GPU clusters, running a job that adds/updates node labels for
the relevant GPU features to each relevant node would be much more
user-friendly. This could be done easily by converting given GPU API
capability tool (e.g. "vainfo" for VA-API, "clinfo" for OpenCL) output
to a NFD feature file.
(Such thing would be outside of this project scope though, except
maybe as an example / test-case.)
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
GPU plugin code assumes container paths to match host paths, and
container runtime prevents creating fake files under real paths. When
non-standard paths are used, devices can be faked for scalability
testing.
Note: If one wants to run both normal GPU plugin and faked one in same
cluster, all nodes providing fake "i915" resources should be labeled
differently from ones with real GPU plugin + devices, so that real GPU
workloads can be limited to correct nodes with a suitable
nodeSelector.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
kubebuilder v3 based scaffolding has updated many things
and they are documented in [1].
Update operator's functionality to v3 level. We've done
most/some of the changes earlier (e.g., by not using
deprecated k8s APIs anymore) so the changes are minimal.
[1] https://book.kubebuilder.io/migration/v2vsv3.html
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>