Support for Nvidia backend is implemented via SYCL CUDA backend. The feature is
disabled by default. Users must enable it at build time with a CMake option
DNNL_GPU_VENDOR=NVIDIA. The Nvidia GPUs can be used via oneDNN engine
abstraction. The engine should be created using dnnl::engine::kind::gpu engine
kind or the user can provide a sycl::device objects that corresponds to Nvidia
GPUs.
- oneAPI DPC++ Compiler with support for CUDA or Intel oneAPI DPC++/C++ Compiler with NVIDIA plugin
- NVIDIA CUDA* driver
- cuBLAS 10.1 or later
- cuDNN 7.6 or later
export CC=/path/to/dpcpp/install/bin/clang
export CXX=/path/to/dpcpp/install/bin/clang++
mkdir build
cd build
cmake -DDNNL_CPU_RUNTIME=DPCPP -DDNNL_GPU_RUNTIME=DPCPP \
-DDNNL_GPU_VENDOR=NVIDIA -G Ninja ..Both buffer-based and USM-based oneDNN APIs are supported for Nvidia backend.
The following table documents the supported data types.
| Data Type | Computation Mode |
|---|---|
| f32 | Training, Inference |
| f16 | Inference |
| s8 | Inference (when applicable) |
| bf16 | Training, Inference |
cuDNN functions are not necessarily the same as oneDNN primitives due to lack of standard API for DNN. For each primitive the cuDNN equivalent function is added to the Nvidia backend for oneDNN. However, the added backend cannot provide all functionalities supported by oneDNN primitives. The detailed limitations of each cuDNN primitive are explained as follow.
The closest equivalent to oneDNN batch normalization can be
cudnnBatchNormalizationForward and cudnnBatchNormalizationBackward
operations. However, there are some difference between cuDNN and oneDNN batch
normalization.
- When
global_statsflag is set for batch normalization, the mean and variance are input only parameters. However, cuDNN does not have the option to accept the mean and variance as inputs in the forward training operation. Therefore,cudnnBatchNormalizationForwardInferenceis used to match the oneDNN feature. Although inference is not supported withoutglobal_statsflags set. - The cuDNN precision is different from that of oneDNN for Batch Normalization.
(e.g
fp:0.0170898 dt:0.0170907 diff:8.27014e-07 rdiff:4.83922e-05) - The forward training with no flags accepts mean and variance as an output.
However, in cuDNN the mean and variance are running mean and variance
respectably so they are both input and output variable. Therefore, they are
required to have a sensible value (cannot be NaN). Since oneDNN will not set
value for the mean and variance when no flag is passed, the NaN can be
propagated as a result. To avoid NaN propagation,
cudaMemsetfunction is used to initialize the mean and variance with zero. - cuDNN always requires the values for scale and shift. When shift and scale are
not defined in oneDNN,
cudaMemsetis used to initialize scale to 1 and shift to 0. - For performance reason in the backward pass, cuDNN requires the mean and inverse variance to be saved in the forward pass. Therefore, when Nvidia backend is used for batch normalization, the workspace must be provided to save the mean and inverse variance.
- When
dnnl_fuse_norm_reluflag is set for batch normalization, thecudnnActivationForwardoperation is called immediately after the batch normalization, since cuDNN does not have a fused batch normalization withRELU. The implementation for element-wise post operations is the same. - When
dnnl_fuse_norm_reluis used, the intermediate output of batch normalization, which is used as an input to the activation function, is saved in the workspace as well. This is required to compute the backward pass fordnnl_fuse_norm_reluflag. - Forward pass supports f32, f16, bf16 and s8 data types. Although blocking is not supported for s8.
- cuDNN uses
alphaandbetaparameters to blend thedy,shiftandscale. Since oneDNN does not have this feature, thealphaandbetavalues in the backward direction are set to 1 and 0 respectively to avoid blending. - Nvidia backend for backward direction requires the workspace as an input containing the mean and inverse variance computed in the forward pass.
- The Nvidia backend for oneDNN does not support the backward direction for
batch normalization when the flag is set to
global_stats. This is due to the fact that oneDNN will skip the
- Supported data types for source and destination:
f32,bf16,f16,s8 - Supported data types for mean and variance:
f32 - Supported post-ops:
eltwise_relu - Supported flags:
dnnl_global_stats,dnnl_use_scale,dnnl_use_shift,dnnl_fuse_norm_relu,dnnl_fuse_norm_add_relu
- Supported data types for source and destination:
f32,bf16,f16 - Supported data types for mean and variance:
f32 - Supported flags:
dnnl_global_stats,dnnl_use_scale,dnnl_use_shift,dnnl_fuse_norm_relu,dnnl_fuse_norm_add_relu
The cudnnOpTensor is equivalent of oneDNN binary primitives.
- Only scales attribute is supported. Post-op attribute is not supported.
- Blocking is only supported for
int8and only in the C dimension with either 4 or 32 block size (same as other cuDNN primitives). - Supported data types are f32, f16, bf16 and s8.
The concat operation uses the reorder primitive to concatenate tensors over the chosen dimension, so the same limitation as reorder applies here.
The cudnnConvolutionForward, cudnnConvolutionBackward and
cudnnConvolutionBackwardFilter is used to compute forward, backward by data or
backward by weights for a convolution operation.
- Blocking is only supported for
int8and only in the C dimension with block size of 4. Input and output tensors must have the same data type. - For int8 (s8s8s8) with post-ops the operations are performed as s8s8f32 (due
to cuDNN limitations) then reordered to
s8at the end which impacts performance. - Direct convolution is not supported, so implicit GEMM is used in those cases.
- "Left" padding must be greater or equal to "right" padding, and the requested spatial output should match the output formula for two "left" padding used.
- Eltwise post-op limitations are the same as our eltwise limitation as post-ops are not fused.
- cuDNN requires padding tensors to 4 dimensions, so 1D convolutions are supported but are performed as 2D.
The following table shows the convolution status for the oneDNN Nvidia backend:
| Weights Format | Winograd Supported | Supported Input Format | Supported Output Format | Supported Data Type | Limitations |
|---|---|---|---|---|---|
| 2D NCHW | YES | NCHW, NHWC | NCHW, NHWC | f32, f16, bf16 | The Winograd algorithm has limitations: * Filter size must be 3x3 or 5x5. * Dilation must be zero for all dimensions. * Horizontal and vertical filter stride must be 1. |
| 2D NHWC | NO | NHWC | NHWC | f32, f16, bf16, int8 | * Dilation must be zero in all dimensions. * Output feature maps must be multiple of 4 for int8 type. |
| 3D NCHW | NO | NCHW, NHWC | NCHW, NHWC | f32, f16, bf16 |
| Weights Format | Winograd Supported | Supported Input Format | Supported Output Format | Supported Data Type | Limitations |
|---|---|---|---|---|---|
| 2D NCHW | YES | NCHW, NHWC | NCHW | f32, f16, bf16 | 1. Dilation must be zero for all dimensions. 2. The Winograd algorithm has limitations: * Filter size must be 3x3 or 5x5. * Dilation must be zero for all dimensions. * Horizontal and vertical filter stride must be 1. |
| 2D NHWC | NO | NHWC | NHWC | f32, f16, bf16 | |
| 3D NCHW | NO | NCHW, NHWC | NCHW | f32, f16, bf16 |
Deconvolution primitive is implemented through the convolution with swapped input abd output channels.
- Currently, there is a bug, likely in this code, which causes crashes in memory_tracking for 3D backward_weights with bias when backward_weights without bias was also a part of the run. Cache interrogation is suspected due to cache-free runs are successful. Switched off in benchdnn until further investigation and the fix.
The cudnnActivationForward and cudnnActivationBackward is the equivalent of
eltwise forward and eltwise backward in oneDNN respectively. There are some
limitations when using Nvidia backend for eltwise primitive:
- cuDNN only supports the following operations -
RELU,ELU,TANH,LOGISTIC. RELUis only supported with alpha = 0.- cuDNN expects
x,yanddyas inputs to the backward pass, hence, onlyRELUoperation supports backward proragation kind. TODO: addELU_DST,TANH_DSTandLOGISTIC_DSTsupport which requiredy. - Forward pass supports
f32,f16,bf16ands8data types. Although blocking is not supported fors8. - Backward pass supports
f32andbf16data types.
- Supported data types:
f32,bf16,f16,s32,s8andu8 - Supported post-ops:
binary
- Supported data types:
f32,bf16
The inner product primitives is an implementation of matrix multiplication plus bias activation. There are two implementation of inner product in cuDNN backend.
With sum post-op, the accumulation mode attribute affects the behavior as
follows:
relaxed: Uses GEMM’s beta parameter for a fused and optimized sum post-op but may reduce output precision for largef16inputs.strict(default): Converts GEMM output tof32, performssumpost-op as a separate operation, then converts it back to the original type. This attribute provides better output precision but reduced performance.
The default backend for inner product is the gemm backend using cublasGemmEx
for forward, backward data, and backward weight and cudnnReduceTensor for
backward bias. A function called gemm_consitency_check(), dense_check() is
used to see if the gemm backend can be used for inner product. reorder_check()
is used when reorder is required. If none of the above condition are met, it
falls back to the convolution backend. cudnnActivationForward operation is
used for eltwise operation and cudnnAddTensor is used for bias operation. The
beta parameter in gemm is used for the sum scale and alpha parameter is used
for the output scale.
- Forward pass supports
f32,f16,bf16ands8data types. - Backward pass supports
f32andbf16data types.
For the forward direction, this operation can be implemented by using
cudnnConvolutionBiasActivation by converting the inner product to 1x1
convolution. For the backward direction the inner product operation will be
equivalent of cudnnConvolutionBackwardData, cudnnConvolutionBackwardWeights
and cudnnConvolutionBackwardBias when applied. This implementation of inner
product has the following restrictions and performance implications:
- The only blocked layouts are those that are supported in cuDNN - namely that
the blocking is done on the C dimension, the block size is 4, and only for
int8inference. The additional requirement is that both the input and filter must be blocked. - The
ReLUand sum are supported as a fused post-op, for other post-op a separate call to eltwise primitive is performed. So the limitation for the eltwise primitive is applied here. - Only
mask = 0case is supported for output scale. - The restrictions for the convolution primitive are applied here for input and filter format. When required, the filter is internally reordered to match the convolution restriction.
- For
int8cuDNN requires both input and output feature maps to be a multiple of 4. - Forward pass supports
f32,f16,bf16ands8data types. - Backward pass supports
f32andbf16data types.
The local response normalization primitive in the Nvidia backend is implemented
with the cudnnLRNForward and cudnnLRNBackward functions for forward and
backward propagation respectively.
WITHINalgorithm is not supported.- There is a difference in the LRN algorithm used in oneDNN and cuDNN which causes a mismatch when the local size is even.
- cuDNN supports NCHW tensor formats for all valid dimensions. However, it does not support the NHWC tensor format for above 5 dimensions.
- Supported data types:
f32,bf16,f16 - Supported algorithms:
ACROSS,WITHIN
- Supported data types:
f32,bf16 - Supported algorithms:
ACROSS,WITHIN
The matrix multiplication primitive in the Nvidia backend is implemented with
cublasGemmEx and cublasGemmStridedBatchedEx functions.
- Zero points support is not provided by cuBLAS and, hence, not supported by the Nvidia backend.
- Post-ops and output scale limitations are same as for Inner Product.
- Supported data types are
f32,f16,bf16ands8.
The pooling primitive in the Nvidia backend is implemented with the
cudnnPoolingForward and cudnnPoolingBackward functions for forward and
backward propagation respectively.
-
cuDNN only allows the use of symmetric padding, i.e. padding at the beginning of a dimension must be the same as the padding at the end of that dimension. oneDNN doesn't have this limitation. Therefore,
- Configurations where padding in the beginning is larger than padding at the end are supported and work as expected.
- For configurations where padding at the end is larger than padding in the
beginning of any dimension, the primitive returns
status::unimplemented.
-
For backward propagation cuDNN requires the parameters
x,y,dxanddy, while oneDNN requires onlydx,dyand workspace when theMAXalgorithm is used. Hence, the workspace is used to store thexandyparameters in the forward pass for the Nvidia backend. Therefore, the workspace is always required when the Nvidia backend is used (except for the forward inference). -
Supported data type are
f32,f16,bf16ands8.
- Supported data types:
f32,bf16,f16,s8,u8ands32 - Supported post-ops:
binary,eltwise_linear - Supported algorithms:
max,avg_p,avg_np
- Supported data types:
f32,bf16,f16 - Supported algorithms:
max,avg_p,avg_np
The Primitive layer normalization is implemented through SYCL kernels.The implementation supports both forward and backward directions.
- Supported formats:
NCDHW,NDHWC,NCHW,NHWC,NCW,NWC,NC
- Supported data types for source and destination:
f32,bf16,f16,s8,u8 - Supported attributes:
Scales - Supported flags:
dnnl_global_stats,dnnl_use_scale,dnnl_use_shift
- Supported data types for source and destination:
f32,bf16,f16 - Supported data types for mean and variance:
f32 - Supported flags:
dnnl_global_stats,dnnl_use_scale,dnnl_use_shift
The cudnnTransform function is the equivalent of oneDNN reorder function.
However, there are some limitations when using reorder on Nvidia
GPU:
- Per dimension scaling is not supported (a single alpha and beta value is accepted by the transform tensor function).
- Blocking is only permitted for the channel dimension in cuDNN. This primitive currently supports block size of 4.
- Blocking is only supported when channel dimension is a multiple of the block
size and the datatype is
int8. - Forward pass supports
f32,f16,bf16ands8data types. - Backward pass supports
f32andbf16data types.
The cudnnSoftmaxForward and cudnnSoftmaxBackward are used to implement the
softmax primitive. For logsoftmax primitive the same functions will be used and
the algorithm selection in cuDNN for the above mentioned functions will be
changed to CUDNN_SOFTMAX_LOG.
- The softmax axis is supported for only the channel dimension, (i.e., axis=1).
- There is a bug in cuDNN softmax for 5D tensor with format
NHWC. When the channel size is greater than 1, it only applies softmax for a single channel and leave the others untouched. - Forward pass supports
f32,f16,bf16ands8data types. - Backward pass supports
f32andbf16data types.
- Supported data types:
f32,bf16,f16,u8ands8 - Supported scales: common scales for
s8andu8data types
- Supported data types:
f32,bf16andf16
The sum operation uses the reorder primitive to sum tensors, so the same limitation as reorder applies here.
Some missing primitives/features are supported through generic SYCL kernels.