With the NFD recent versions (v0.13+), it's no longer necessary to
start NFD with custom nfd-master args/rbac settings to get numeric
labels registered as extended resources.
The same can be specified via NodeFeatureRules which also works for
"local" source with feature files.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
NFD v0.14+ doesn't support binary NFD hooks by default, so there is
a need to move the label creation away from the GPU nfdhook.
Move extended resource label creation to plugin, and drop labels that were
already marked deprecated (platform_gen, media_version etc.).
Drop init-container from deployment files and operator. It is still possible
to use an initcontainer, but the default deployments do not support it.
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
k8s 1.27.x triggers build errors on controller-runtime 0.14.x
so we will need to update to 0.15.x at the same time.
Changes include:
* k8s e2e framework moved to use Ginkgo context so we add
test context to all our test nodes.
* adapt Ginkgo parameter modifications.
* adapt SGX admissionwebhook to InjectDecoder removal.
* adapt deviceplugins and FPGA CRDs to controller-runtime
API changes.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
With shared-dev-num and multiple i915s in the resource request,
try to find as many individual GPUs to expose to the container.
Previously, with multiple i915 resources, it was typical to
get only one GPU device in the container.
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
In large clusters and with resource management, the load
from gpu-plugins can become heavy for the api-server.
This change will start fetching pod listings from kubelet
and use api-server as a backup. Any other error than timeout
will also move the logic back to using api-server.
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
- Add note about LTS kernel DKMS source repo
- Correct note about the demo (unlike FPGA demo,
GPU demo is not in docker hub)
Fixes: 89d3c5a4f3
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Currently, each individual plugin README documents roughly the same
daily development steps to git clone, build, and deploy. Re-purpose
the plugin READMEs more towards cluster admin type of documentation
and start moving all development related documentation to DEVEL.md.
The same is true for e2e testing documentation which is scattered
in places where they don't belong to. Having all day-to-day
development Howtos is good to have in a centralized place.
Finally, the cleanup includes some harmonization to plugins'
table of contents which now follows the pattern:
* [Introduction](#introduction)
(* [Modes and Configuration Options](#modes-and-configuration-options))
* [Installation](#installation)
(* [Prerequisites](#prerequisites))
* [Pre-built Images](#pre-built-images)
* [Verify Plugin Registration](#verify-plugin-registration)
* [Testing and Demos](#testing-and-demos)
* ...
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
GPU plugin code assumes container paths to match host paths, and
container runtime prevents creating fake files under real paths. When
non-standard paths are used, devices can be faked for scalability
testing.
Note: If one wants to run both normal GPU plugin and faked one in same
cluster, all nodes providing fake "i915" resources should be labeled
differently from ones with real GPU plugin + devices, so that real GPU
workloads can be limited to correct nodes with a suitable
nodeSelector.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
grpc-go v1.43.0 deprecated grpc.WithInsecure() in favor of
insecure.NewCredentials(). Move to use the recommended approach
and drop the linter annotations.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Move reallocate logic to getpreferredallocation and simplify
allocate to use the kubelet's device ids.
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Adds functionality to convert container's tile annotation
in to corresponding L0 affinity mask. This helps to target
container's workload to specific L0 subdevices.
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
While the labeling limit is obvious after little thought, IMHO
limitations like this should either be stated out front, or be in
their own section in the README. Commit does former for the GPU
plugin fractional resources, and latter for the NFD hook / labeling.
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.42.0 to 1.43.0.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.42.0...v1.43.0)
---
updated-dependencies:
- dependency-name: google.golang.org/grpc
dependency-type: direct:production
update-type: version-update:semver-minor
...
---
In addition to changes made by dependabot, I add nolint comments to ignore staticcheck(SA1019) errors.
It is because insecure.NewCredentials() recommended as an alternative is still declared experimental.
So keep grpc.withInsecure() with nolint comment.
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
1. Implement PreferredAllocator interface.
2. Provide 3 preferred allocation policies: balancedPolicy, packedPolicy and nonePolicy.
3. Provide the cmdline interface: -allocation-policy balanced/packed/none, to select which preferred allocation policy to use.
4. Add operator support.
Co-authored-by: Mikko Ylinen <mikko.ylinen@intel.com>