grpc.NewClient(), added in grpc-go v1.63, is the preferred way to
create a new ClientConn. In most of our usages, moving away from
grpc.Dial*() to it is straightforward.
However, we've also relied on grpc.Dial*()'s behavior to automatically
make a new connection to "test" a connection is successful isn't available
anymore. Combined with grpc.WithBlock dialoption this usage is considered
"especially bad" way to handle a client connection.
The recommended approach to test a server connection is to separately
make a connection and watch the connection state to become Ready. This
change follows that recommendation.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Plugin can support both i915 and xe drivers dynamically. But
having both drivers on same node with RM is not possible.
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Split readme into smaller chunks, show only one "easy installation"
and hide the rest. Add some notes about tile resources.
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
With tile requests, the level zero affinity mask now defaults to
flat/combined mode. If ZE_FLAT_DEVICE_HIERARCHY is set to COMPOSITE
in the Pod's specification, plugin will use the previous "x.y" format
instead of the new "x" in the affinity mask.
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
With the NFD recent versions (v0.13+), it's no longer necessary to
start NFD with custom nfd-master args/rbac settings to get numeric
labels registered as extended resources.
The same can be specified via NodeFeatureRules which also works for
"local" source with feature files.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
NFD v0.14+ doesn't support binary NFD hooks by default, so there is
a need to move the label creation away from the GPU nfdhook.
Move extended resource label creation to plugin, and drop labels that were
already marked deprecated (platform_gen, media_version etc.).
Drop init-container from deployment files and operator. It is still possible
to use an initcontainer, but the default deployments do not support it.
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
k8s 1.27.x triggers build errors on controller-runtime 0.14.x
so we will need to update to 0.15.x at the same time.
Changes include:
* k8s e2e framework moved to use Ginkgo context so we add
test context to all our test nodes.
* adapt Ginkgo parameter modifications.
* adapt SGX admissionwebhook to InjectDecoder removal.
* adapt deviceplugins and FPGA CRDs to controller-runtime
API changes.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
With shared-dev-num and multiple i915s in the resource request,
try to find as many individual GPUs to expose to the container.
Previously, with multiple i915 resources, it was typical to
get only one GPU device in the container.
Co-authored-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
In large clusters and with resource management, the load
from gpu-plugins can become heavy for the api-server.
This change will start fetching pod listings from kubelet
and use api-server as a backup. Any other error than timeout
will also move the logic back to using api-server.
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
- Add note about LTS kernel DKMS source repo
- Correct note about the demo (unlike FPGA demo,
GPU demo is not in docker hub)
Fixes: 89d3c5a4f3
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Currently, each individual plugin README documents roughly the same
daily development steps to git clone, build, and deploy. Re-purpose
the plugin READMEs more towards cluster admin type of documentation
and start moving all development related documentation to DEVEL.md.
The same is true for e2e testing documentation which is scattered
in places where they don't belong to. Having all day-to-day
development Howtos is good to have in a centralized place.
Finally, the cleanup includes some harmonization to plugins'
table of contents which now follows the pattern:
* [Introduction](#introduction)
(* [Modes and Configuration Options](#modes-and-configuration-options))
* [Installation](#installation)
(* [Prerequisites](#prerequisites))
* [Pre-built Images](#pre-built-images)
* [Verify Plugin Registration](#verify-plugin-registration)
* [Testing and Demos](#testing-and-demos)
* ...
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
GPU plugin code assumes container paths to match host paths, and
container runtime prevents creating fake files under real paths. When
non-standard paths are used, devices can be faked for scalability
testing.
Note: If one wants to run both normal GPU plugin and faked one in same
cluster, all nodes providing fake "i915" resources should be labeled
differently from ones with real GPU plugin + devices, so that real GPU
workloads can be limited to correct nodes with a suitable
nodeSelector.
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>