Commit 00a59e8f7d was not complete in that it didn't update
the corresponding documentation. This commit fixes that.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
The devices enabled by default are different between the
kustomize and operator based deployments.
This change harmonizes the defaults to c6xxvf and 4xxxvf
in both deployment options.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
This changes the memory reading to be done through lmem_total_bytes
file instead of the addr_range file.
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
Add govet-fieldalignment to .golangci.yml
Fix errors that come from adding govet-fieldalignment
- by reordering the fields of structs
- by putting nolint:govet annotations
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
Update tool versions
Fix the errors and warnings originated from the update:
-Correct type deviceInfo (->DeviceInfo) to make it public
-Fix gpu_plugin.go and vpu_plugin_test.go where stylecheck errors occur
-Fix deprecation warnings
-Rename type 'PatcherManager' to 'Manager' to solve exported errors
-Rename type 'SgxMutator' to 'Mutator' to solve exported errors
Signed-off-by: Hyeongju Johannes Lee <hyeongju.lee@intel.com>
- Information on specific HW & virtualization types on which GPU plugin
is tested on, belongs to releases notes, not to README intro
(where it has already became obsolete)
- HW offloading is provided by driver backends, not frontends
(e.g. OneVPL is just one of the media driver frontends)
This adds a section heading, TOC link, command line flag description
and a short explanation of what other dependendent configuration
changes are needed with fractional resources in order for the command
line flag to achieve something useful.
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
This adds a link from gpu-plugin README to the nfdhook README, and
updates the nfdhook README with label descriptions.
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
<device>/driver symlink does not exist if the device is not bound
to any driver. bindDevice() failed when writing to <device>/driver/unbind
errored but IsNotExist() error is acceptable in case there's no driver
to unbind.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Go 1.16 release notes announced the deprecation of io/ioutil [1]. It's easy
for us to move to use what is was recommended so just do it.
[1] https://golang.org/doc/go1.16#ioutil
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
All but one (VPU) of the published container images can be built with
static binaries which allows us to use distroless/static as the
base image. Moreover, when combined with stripping the plugin binaries,
we can get both build time and image size savings.
This is the part 1 (out of 2) of the rework. Part 2 will finish the
change by making some adjustments to VPU plugin image and moving the
FPGA/SGX/GPU initcontainers to distroless/static too.
Partial: #516
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
Tests plugin scan results in setups having none, one and multiple
eligible GPU devices, with and without SRIOV enabled, with two
different options values.
This does not cover verifying number of devices added under
"i915_monitoring" resource as that would be much larger change.
To help in:
* adding more CLI options in next and later commits, and
* to replace magic newDevicePlugin() input parameters with
explicitly named one(s)
NOTE: this has impact only for GPUs which are virtualized with SR-IOV.
Access to physical devices (PFs) is disabled for "i915" resource when
they have configured virtual devices (VFs).
This is because:
* GPU resources are expected to be evenly split between VFs in such
configurations
* But PF resource amount is expected to differ from VFs and typically
retain only enough resources (just few MB of RAM), to be able to
provide GPU metrics that are not available from VFs
* Neither the current GPU plugin, nor Kubernetes scheduling in
general, has proper support for heterogeneous GPUs (= capability
based scheduling)
Therefore "i915" resource needs to be limited to GPU devices with
homogeneous amount of resources, which in SR-IOV configurations is
expected to be the case only with VFs (when such are present).
The SGX DCAP out-of-tree v1.41 driver is also known to work
with the SGX plugin. However, the default NFD labeling does not
work with the out-of-tree driver so warn users about it.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Which mounts all (Intel) GPU devices to requesting container.
This is needed e.g. to get GPU metrics from the node. Requesting pod
does not know how many GPUs are on the node it gets assigned to, so
there needs to means to request them all.
(Only alternative for the new resource would be requesting Privileged
mode, which is clearly worse as that would grant pod access also to
all other devices and capabilities.)
This commit also:
* Adds "i915_monitoring" resource testing to: go test -v -run Scan
* Splits GPU plugin tests mock file system setup to a separate
createTestFiles() function because otherwise TestScan() does not
pass project's golangci-lint complexity limits
Add --device command line to operator's main.go which defines
the controllers/webhooks to set up.
Signed-off-by: Oleg Zhurakivskyy <oleg.zhurakivskyy@intel.com>
As the operator container image is available from a registry, we should
guide users to use it rather than build and deploy it locally.
Further, drop (un)deploy-operator targets in favor of simply using
kubectl for deployment.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Replaced multiple instances of master with main.
Reworded line 15 "Verify QAT device plugin is registered" removed 'on master'
and corresponding section heading. Related to pr499.
Signed-off-by: DougTW <doug.martin@intel.com>
The device plugins daemonsets are cluster wide and currently only
one device plugin instance per device is possible so making the
corresponding deviceplugin/v1 CRDs non-namespaced (i.e., scope: cluster)
fits better.
Previously, the device plugin daemonset was deployed in the same
namespace as the CR for that device but with the cluster scoped CRDs
we default to use the same namespace as the operator, unless overridden
via DEVICEPLUGIN_NAMESPACE env variable or a command line parameter
to operator manager deployment.
Three additional changes in this commit:
- enable DSA envtest tests
- update controller-runtime to v0.8.1
- change device plugin envtest suite to use klog/v2
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Decouple the default enclaveLimit/provisionLimit from core count. With
this change, the default limit is constant and it can be made relative
to core count by setting PODS_PER_CORE multiplier via env variable.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Removed device plugin socket check from the documentation as
device plugin support is enabled by default in Kubelet.
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
- Impelemented demo image that runs accel-config tests
- Added testing instructions to the documentation
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
It looks that for a long time now we have accepted a setup where a valid QAT
device ID is accepted as a QAT device resource even though the device is
not "enabled" via kernelVfDrivers parameter.
Fix device ID validation to skip valid QAT devices that are not
explicitly specified in kernelVfDrivers.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
The updated dp.scan() changes the way how VF devices are detected. The
main reason for the change is to take into account cases where the QAT VF
driver is not present in the system at all but only the PF driver is
loaded (and the SR-IOV devices are are enabled).
The rework also takes into account bare metal and VM deployments and
adds a test case for checking the virtualized environment.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
The plugin now detects/accepts 4xxx and c4xxx devices too
and defaults to those drivers that are part of Linux mainline.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
We have both "path" and "path/filepath" but the latter provides
everything needed so move it completely.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
The code was stripping out "0000:" (bus) and then adding
it back in several places.
That's not necessary so this change simplifies QAT VF addr
handling by operating using full BDF IDs.
Moveover, simplify function calls: use getDpdkDevice() once
for each VF device.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
The SGX device nodes have changed from /dev/sgx/[enclave|provision]
to /dev/sgx_[enclave|provision] in v4x RFC patches according to the
LKML feedback.
This changes moves to use the new device nodes. Backwards compatibility
is provided by adding /dev/sgx directory mount to containers. This
assumes the cluster admin has installed the udev rules provided in the
README to make the old device nodes as symlinks to the new device nodes.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
This call is implemented by calling ioctl, which raises
"open /dev/intel-fpga-port.X: operation not permitted" error
when called inside unprivileged container.
This breaks FPGA plugin.
Calling this API from fpga_tool is still OK, so
moving calls there should fix the issue.
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
This commit documents the SGX building blocks for Kubernetes and
how to deploy them in the cluster.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Reimplemented discovering of the FPGA devices using
APIs from pkg/fpga/intel_fpga_linux. The APis are also
used in the fpga_tool utility.
The API is more advanced and supports SR-IOV among other
things.
Fixes: #372
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
This adds reading of the GPU memory amount from the sysfs. As a
fallback the environment variable GPU_MEMORY_OVERRIDE remains.
Another environment variable GPU_MEMORY_RESERVED can be used to
reserve a dedicated byte amount outside of kubernetes usage.
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
With the addition of SGX webhook in the operator, full SGX stack
depends on having the operator deployed first. SgxDevicePlugin CRD
is set to get intel-sgx-plugin and intel-sgx-initcontainer deployed
by the operator.
As a pre-requisite, node-feature-discovery must be deployed but it
is currently deployed via sgx_plugin kustomization overlay only.
It's better to allow NFD with the SGX specific settings deployed with
a kustomization of its own.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
This adds an nfd-hook for the gpu-plugin, which will create labels
for the GPUs that can then be used for POD deployment purposes or
creation of GPU extended resources which allow then finer grained
GPU resource management.
The nfd-hook will install to the host system when the
intel-gpu-initcontainer is run. It is added into the plugin deployment
yaml.
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
For every created device info, a new topology scan is performed in
the filesystem. The shared dev count was implemented so that for each
shared device, a new device info was created, which resulted in the
topology scan happening as many times per Scan-round, as there were
shared devs.
This fixes the issue by making the device info to be shared among the
shared devices.
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
Move remark about GVT-d to end of introduction. Remove remarks
about GVT-g for the time being.
Signed-off-by: Ukri Niemimuukko <ukri.niemimuukko@intel.com>
The SGX plugin exposes two device files as separate resources:
* /dev/sgx/enclave as sgx.intel.com/enclave
* /dev/sgx/provision as sgx.intel.com/provision
The number of resources is configurable, but it's intended to be equal
to the pod count by default, so that any pod requiring access would have
it. The access control (who can do SGX remote attestation) is done
outside this plugin.
Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>
fpga: make AFU resource name 63 char long
webhook: drop mode from README
webhook: extend mappings description
webhook: tighten CRD definitions
webhook: drop mapping to non-existing afuId
explicitly state mappings names can be in any format
use consistent terminology across fpga webhook and plugin
DPDK uses /sys/class/uio/uioX/device/[control|resource*] and we
had special mounts for the individual uioX paths. However, it turned
out this wasn't working as expected: host's /sys/class/uio/uioX/device/
was mounted to container's /sys/class/uio and DPDK failed to find
uioX/device/[control|resource*] files. Moreover, workloads requesting
more than one QAT resource, still saw only one path.
While cri-o/containerd give sysfs read-only mounts, DPDK needs
device/config RW. Therefore, we need to mount host /sys/class/uio/uioX
to container /sys/class/uio/uioX for each requested device.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Changed code a little bit to improve test coverage:
* call Scan in test code
* call Scan without hddl socket
* call Scan with 0 SharedDevNum
* move SharedDevNum in newDevicePlugin
* use Ticker instead of Sleep
Signed-off-by: Alek Du <alek.du@intel.com>
Move all the fpga components to using klog for logging
and debug. This includes replacing our homebrew 'fatal()'
with klog.Error().
Modify the deployment files to move from `-debug` to
`-v`, and set their default level to '1' (Info), rather
than full debug mode ('4').
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
Add NewDevicePlugin() tests to improve test coverage. This also
contributes to "input validation" (part of #321) that wasn't done
properly before.
Fixes: #325
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Move from fmt to klog for logging and debug.
Also add an extra info level message noting when we find
new devices.
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
Move the framework, and the qat driver, to use `klog`
for logging and debug.
This has a some noticeable effects:
1) Our default log output gains a bunch of annotation:
From:
QAT device plugin started in 'dpdk' mode
To:
I0312 11:51:02.057728 6053 qat_plugin.go:64] QAT device plugin started in 'dpdk' mode
(there is now a command line option to drop those annotations if
necessary).
2) We gain a bunch of command line parameters from klog for controlling log
levels and output. We go from 5 arguments to 17:
---
Usage of ./cmd/qat_plugin/qat_plugin:
-add_dir_header
If true, adds the file directory to the header
-alsologtostderr
log to standard error as well as files
-debug
enable debug output
-dpdk-driver string
DPDK Device driver for configuring the QAT device (default "vfio-pci")
-kernel-vf-drivers string
Comma separated VF Device Driver of the QuickAssist Devices in the system. Devices supported: DH895xCC,C62x,C3xxx and D15xx (default "dh895xccvf,c6xxvf,c3xxxvf,d15xxvf")
-log_backtrace_at value
when logging hits line file:N, emit a stack trace
-log_dir string
If non-empty, write log files in this directory
-log_file string
If non-empty, use this log file
-log_file_max_size uint
Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
-logtostderr
log to standard error instead of files (default true)
-max-num-devices int
maximum number of QAT devices to be provided to the QuickAssist device plugin (default 32)
-mode string
plugin mode which can be either dpdk (default) or kernel (default "dpdk")
-skip_headers
If true, avoid header prefixes in the log messages
-skip_log_headers
If true, avoid headers when opening log files
-stderrthreshold value
logs at or above this threshold go to stderr (default 2)
-v value
number for the log level verbosity
-vmodule value
comma-separated list of pattern=N settings for file-filtered logging
---
3) Our `-debug` flag is now replaced by the `klog` `-v n` flag.
*NOTE:* This is potentially a minor breaking change. Applying
this debug overlay to any previous (pre-klog edit) images will
cause the container to fail to launch, as it will not recognise
the new `-v` arguments.
We also update the kustomize deployment to move from using
DEBUG env vars to adding a VERBOSITY var that controls both
the log verbosity and now the debug mode enabling.
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
Kerneldrv checks for available devices based on adf_ctl output.
We only accepted two cases: PFs if IOMMU is off and VFs if IOMMU
is on.
The right check is to only skip PFs if IOMMU is on and allow other
cases. This fixes two scenarios: when run in VMs, we accept VFs
regardless of (v)IOMMU presence.
Moreover, do not hard code domain '0000:' because it is not the
case always.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
This commit drops fpga_plugin dependency to k8s.io/kubernetes which
was used to get GetHostname(). After this change, the plugin node
name can be set using new -node-name parameter. The default value for
is read from NODE_NAME environment variable.
If the node annotation override check fails, we continue with the default
mode parameter and do not exist like we did previously.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
go get'ing does not work due to our k8s.io/kubernetes dependency
so guide users to use git clone to get the code.
Fixes: #290
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Not touching "orchestration programmed". Fixing only instances where
this refers directly to the mode recognized by the webhook-deploy.sh
script.
Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
crypto-perf instructions were outdated and hand implicit
assumptions about the environment. More specifically:
Clear Linux builds DPDK libraries as shared so for the
compress and crypto test applications to run, the memory and
QAT PMD libraries must be explicitly preloaded using '-d' parameter.
Also, the test-crypto1 and test-compress1 deployments expect the
cluster is configured with CPU Manager's static policy.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Just follow the standard format to fix the vpu plugin readme.
Also added the ubuntu OpenVINO demo job long logs.
Signed-off-by: Alek Du <alek.du@intel.com>
Update the CRI-O webhook README, adding notes about what it is and
does, and that it is normally installed as part of the device
plugin daemonset.
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
Expand and re-arrange the README. Add some details about what the
plugin and other FPGA components provide.
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
If we fail to scan for GPU devices (note, that is potentially
different from not finding any devices during a scan), then
warn on it, and go around the poll loop again. Do not treat
it as a fatal error or we might end up in a re-launch death
deploy loop...
Of course, getting a warning in your logs every 5s could also
be annoying, but is somewhat 'less fatal'.
Fixes: #260Fixes: #230
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
Update the QAT README. Add some descriptions. Add information about
the command line and config options.
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
Add draw.io and their generated PNG files for both
orchestrated and preprogrammed FPGA modes. These will
then be used in the documentation.
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
The fpga_tool had no README. Add a basic one.
Desired as we should at least reference the tool from the
fpga_plugins document.
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
Re-arrange the section order a little (such as putting the use
of the DaemonSet before the sudo hand-deploy), and add a lot more
detail of what to expect, and how to check if the pod has launched
correctly.
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
Fill out the introduction to the GPU README to give some details around
what the plugin supports and how.
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
The default deployment gives rather wide host mounts.
Limited sysfs mount only to the subdirectory the plugin
needs.
Mounted sysfs and dev mounts read-only.
Added notes that FPGA plugin can be run as non-root user.
The default deployment gives rather wide host mounts. We can limit
the mounts only to the subdirectories the plugin needs and mount
them read-only.
Also, add notes that both QAT and GPU plugins can be run as non-root
user.
Fixes: #228
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
In the pods generated automatically by Deployment/ReplicaSets
fields name and namespace might be missing.
We can use information about namespace from request itself.
Initcontainer is now built in main build process, no need to download
anythin special.
Added note about checking OCI hooks configuration parameter in CRI-O
Fixes: #192
- user readable output for fpgainfo/fmeinfo/portinfo commands
- new commands: list, list-fme, list-port
- new -q flag to suppres headers, progress and too verbose messages
- install command will now fail if destination file already exist
- new --force flag: allows overwrite files in install command
- removed development and debug output
Extended fpga plugin to support both in-tree(DFL) and
out-of-tree (OPAE) kernel drivers.
- fpga_crihook: move JSON parsing to separate functions
- decreased cyclomatic complexity of the CRI hook main() function
- increased readability
- increased test coverage
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
- Migrate to OPAE 1.3.2
- Build all the tools from the source
- ignore files in workspace
- minimal fpga_tool utility to check gbs/aocx file parsing and flashing
- implemented kernel IOCTL based flashing of bitstreams
- add PCI and sysfs functions
We plan to use crypto-perf for simple QAT testing. This commit adds
kustomization to make the deployment easier. The original .yaml is
also moved to deployments/ with some changes.
For instance, it turns out also vfio-pci mode with DPDK needs CAP_SYS_ADMIN
(See PR: #187 which states that only igb_uio would need it).
kustomize is available part of kubectl since kubernetes v1.14.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
-mode kerneldrv comes with no documentation. This patch adds few
notes about it and instructions how to get it build if a user chooses
to have it enabled.
Closes: #197
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
When IOMMU is on in the system, the physical function (PF)
devices cannot be used. This prevented using kerneldrv as it
was only written to work with PFs.
However, QAT bare metal functions can also be used when IOMMU
is enabled. In this case, they must be used via virtual functions
(VF).
This commit makes it possible to use kerneldrv when IOMMU is
on. The added side benefit is we can now slice the same QAT HW
for both "dpdk" and "kernel" usages simultaneously.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
In adf_ctl output, qat_devX is a sequence number that includes both
PF and VF devices:
qat_dev0 - type: c6xx, inst_id: 0, node_id: 1, bsf: 84:00.0, #accel: 5 #engines: 10 state: up
qat_dev1 - type: c6xx, inst_id: 1, node_id: 1, bsf: 85:00.0, #accel: 5 #engines: 10 state: up
qat_dev2 - type: c6xx, inst_id: 2, node_id: 1, bsf: 86:00.0, #accel: 5 #engines: 10 state: up
qat_dev3 - type: c6xxvf, inst_id: 0, node_id: 1, bsf: 84:01.0, #accel: 1 #engines: 1 state: up
qat_dev4 - type: c6xxvf, inst_id: 1, node_id: 1, bsf: 84:01.1, #accel: 1 #engines: 1 state: up
...
X cannot be used as the config file identified because it does not match
the real id of the device. inst_id gives this so move to use that to find
the right config file.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Let a user know the plugin can't find any Intel GPU if that's
the case. It might be cumbersome to realize that the plugin runs
on a host which doesn't have any Intel GPUs.
Also make the code less nested for better readability.
This commit adds the possibility to qat2_plugin use pci,
devices with communication chipset 8925 to 8955.
Signed-off-by: Rivera Gonzalez, Julio C <julio.c.rivera.gonzalez@intel.com>
CRDs for AF or Region mappings are scoped to namespaces. So an
admitted pod has to be mutated with CRDs existing in the same
namespace as the pod's.
Closes#167
Not all QAT chips (e.g, 37c9) are available in pci.ids which makes
"grep QAT" to not show them.
Scan all known VF PCI ids in a loop to ensure all configured devices
are shown.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
For easier deployments, fetch plugin command line arguments from ConfigMap.
When using ConfigMaps, qat_plugin.yaml needs no changes and can always
be used as is.
qat_plugin_default_configmap.yaml uses built-in defaults.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>