The DRM driver of Intel i915 GPUs allows sharing one device
between many containers.
Make it possible to use the same device from different containers.
The exact number of containers sharing the same device can be limited
with the new option -shared-dev-num set to 1 by default.
closes#53
When the device's firmware crashes the OPAE driver reports all properties
of the device as a stream of binary ones. This effectively makes
interface and afu IDs look like "ffffffffffffffffffffffffffffffff".
Mark such devices as Unhealthy.
closes#77
Currently we have hardcoded mapping from human readable names of
AFs and FPGA regions like arria10-nlb0 to the resource names
produced by the FPGA device plugin. This is not sustainable
long term solution.
Implement CRD based mappings so that a new mapping can be added or
removed dynamically by cluster admins with CRD resources.
Every device plugin is supposed to implement PluginInterfaceServer
interface to be exposed as a gRPC service. But this functionality is
common for all our device plugins and can be hidden in a Manager
which manages all gRPC servers dynamically.
The only mandatory functionality that needs to be provided by a device
plugin and which differentiate one plugin from another is the code
scanning the host for devices present on it.
Refactor the internal deviceplugin package to accept only
one mandatory method implementation from device plugins - Scan().
In addition to that a device plugin can optionally implement a
PostAllocate() method which mutates responses returned by
PluginInterfaceServer.Allocate() method.
Also to narrow the gap between these device plugins and the
kubevirt's collection the naming scheme for resources has been changed.
Now device plugins provide a namespace for the device types they
operate with. E.g. for resources in format "color.example.com/<color>"
the namespace would be "color.example.com". So, the resource name
"intel.com/fpga-region-fffffff" becomes "fpga.intel.com/region-fffffff".
DRM control node is deprecated and removed by latest kernel.
This will skip possible drm control node found on host.
v2: Fix lint error
v3: Fix regex string
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Check if container annotation com.intel.fpga.mode is set to
"intel.com/fpga-region". This annotation is set by device plugin.
So, the check should help to filter out unwanted workflow that
device plugin is not aware of.
When kubelet notifies the plugin about its restart by removing
the plugin's socket we do reconnect to kubelet, but we don't
send the current list of monitored devices to kubelet. As result
kubelet is not aware of discovered devices if it restarts.
Always send the current list of monitored devices to kubelet
upon ListAndWatch() request.
The hook gets FPGA_REGION and FPGA_BITSTREAM environment variables
defined in a pod spec, finds bitstream file, verifies it and programs
FPGA device with it using fpga-configure tool from OPAE.
When running in the region mode we don't need to know AFU IDs
thus don't read them while in the mode.
It's important not to try to read them because in the region mode
AFUs are supposed to be reprogrammed in the fly anyway and the
afu_id files may become busy.
Device descovery can get EBUSY error when AFU is being programmed.
It causes plugin to crash with error:
Device scan failed: read /sys/class/fpga/intel-fpga-dev.0/intel-fpga-port.0/afu_id:
device or resource busy
This error should be ignored as this is valid use case.
This is harmless as afu will be rediscovered on the next run, when
reprogramming is done.
Moved code that goes through sysfs to the separate function
getSysFsInfo to decrease cyclomatic complexity of the scanFPGAs
function.
This is required to get the next commit through our CI check.