Commit Graph

14 Commits

Author SHA1 Message Date
Tuomas Katila
ea659a5e4b nfd: add rules to label nodes with different GPUs
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
691dfc3483 gpu: refactor nfdhook functionality to plugin
NFD v0.14+ doesn't support binary NFD hooks by default, so there is
a need to move the label creation away from the GPU nfdhook.

Move extended resource label creation to plugin, and drop labels that were
already marked deprecated (platform_gen, media_version etc.).

Drop init-container from deployment files and operator. It is still possible
to use an initcontainer, but the default deployments do not support it.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2023-09-12 16:20:33 +03:00
Tuomas Katila
13e20c9abf gpu: fix documentation links
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-09-16 15:25:45 +03:00
Eero Tamminen
0b519ecf1e Deprecate debugfs GPU IP block version labels in NFD hook doc
There's no mapping available from IP block versions to actual product
features, which make these version numbers fairly useless for end
users.

In mixed GPU clusters, running a job that adds/updates node labels for
the relevant GPU features to each relevant node would be much more
user-friendly.  This could be done easily by converting given GPU API
capability tool (e.g. "vainfo" for VA-API, "clinfo" for OpenCL) output
to a NFD feature file.

(Such thing would be outside of this project scope though, except
maybe as an example / test-case.)

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-24 16:55:01 +03:00
Eero Tamminen
0b7cbc862d Improve GPU NFD hook documentation
Add table of contents, simplify introduction text.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
2022-08-24 16:55:01 +03:00
Tuomas Katila
bdd72c8cf7 gpu: Add numa node mapping label for GPUs
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-03-24 14:29:05 +02:00
Tuomas Katila
6f57c55ef8 Add a total tile count to node's labels
This label isn't dependent on the debugfs as the platform
specific tile count is.

Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
2022-01-26 09:57:33 +02:00
Ukri Niemimuukko
7520393041 gpu_nfdhook: gpu-numbers and pci-groups
This adds a new label "gpu-numbers" for short numbered lists of
gpus, omitting "card" from the names. Also adds splitting of long
label values.

Similarly this adds a new label "pci-groups" for PCI groups. Grouping
can be controlled by env var GPU_PCI_GROUPING_LEVEL. The env var
dictates, how many pci-folder names need to match, in order for GPUs
to be considered to belong in a group.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2022-01-25 09:17:56 +02:00
Eero Tamminen
36046d90a4 Make GPU plugin / resource label limitations more explicit
While the labeling limit is obvious after little thought, IMHO
limitations like this should either be stated out front, or be in
their own section in the README.  Commit does former for the GPU
plugin fractional resources, and latter for the NFD hook / labeling.
2022-01-04 11:43:08 +02:00
Ukri Niemimuukko
46dcffc33e README typofix
Label descriptions had extra underscores.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2021-12-28 12:01:40 +02:00
Eero Tamminen
bcc737bd2a Adapt GPU label support to debugfs DRM entry changes
GPU generation "gen" number is replaced in the capability files of
latest kernels with separate display, graphics, and media versions.

For compatibility with newer kernels, provide "gen" based on the new
labels (but without decimals), and for older kernel compatibility, new
labels based on the "gen".

Because different kernels match different items from the action map,
whole capability file will get parsed. Capability file parsing is
optimized by using prefix check instead of scanf.

"platform_gen" label is deprecated, and can be dropped whenever it
becomes inconvenient (lint complains about line count etc).
2021-12-16 21:22:31 +02:00
Ukri Niemimuukko
cbf7bab114 add link to gpu_nfdhook and update hook README
This adds a link from gpu-plugin README to the nfdhook README, and
updates the nfdhook README with label descriptions.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2021-06-11 18:54:44 +03:00
Ukri Niemimuukko
5b5180ae00 gpu_nfdhook memory amount reading from sysfs
This adds reading of the GPU memory amount from the sysfs. As a
fallback the environment variable GPU_MEMORY_OVERRIDE remains.

Another environment variable GPU_MEMORY_RESERVED can be used to
reserve a dedicated byte amount outside of kubernetes usage.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2020-10-26 09:45:43 +02:00
Ukri Niemimuukko
505eadaf94 gpu-plugin nfd-hook
This adds an nfd-hook for the gpu-plugin, which will create labels
for the GPUs that can then be used for POD deployment purposes or
creation of GPU extended resources which allow then finer grained
GPU resource management.

The nfd-hook will install to the host system when the
intel-gpu-initcontainer is run. It is added into the plugin deployment
yaml.

Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
2020-10-01 12:02:57 +03:00