-
Notifications
You must be signed in to change notification settings - Fork 88
Description
🐛 Describe the bug
TL;DR
- We should use safer data_ptr accessing API. Newer template APIs have additional checks. If possible, use
tensor.mutable_data_ptr<T>()andtensor.const_data_ptr<T>(). Stop using<scalar_t>tensor.data_ptr(). - Stop using
char*. Either useint *or use the allocator directly for buffer accessing.
1. Use template data_ptr API for tensor
1.1 Existing code usage and its problem
Existing code uses raw pointers like this:
// non-template usage. Avoid using this
kernel_call(
(scalar_t*)tensor.data_ptr();
);The above usage doesn't check whether the tensor is initialized or not. When the tensor's storage is not initialized (for example, FakeTensor), this data_ptr is null. Then the kernel may try to write to a non-initialized memory, and causing a page fault.
1.2 template data_ptr
The template data_ptr will do additional checks. Like has_storage() and storage_initialized(). Please refer to the PyTorch TensorImpl.h for detail.
In view of that, change to a safer usage for old usage.
// template usage. Not encouraged.
input_.data_ptr<scalar_t>();1.3 New API: mutable_data_ptr() and const_data_ptr()
PyTorch has introduced new APIs for tensor.data_ptr(). Their usage is the same with the template data_ptr. Their meaning is just as the name suggests.
// Use this
input_.const_data_ptr<scalar_t>();
input_.mutable_data_ptr<scalar_t>();2. Avoid using char * data ptr for buffer
The existing code has the following pattern:
For example, the code snippet in torch-xpu-ops/...Reduce.h:
Tensor buffer;
Tensor semaphores;
buffer = at::empty(
config.global_memory_size(),
at::TensorOptions().dtype(kChar).device(kXPU));
semaphores = at::empty(
config.semaphore_size(), at::TensorOptions().dtype(kChar).device(kXPU));
at::detail::Array<char*, 1> data;
data[0] = (char*)semaphores.data_ptr();
buf_ptr = buffer.defined() ? (void*)buffer.data_ptr() : nullptr,The template tensor.data_ptr<T> does not support the char type. Thus, we should avoid using this. We could align with PyTorch's writing usage with the following:
2.1 Use int* instead of char*
For example, PyTorch has the following code in Normalization.cuh
at::Tensor semaphores;
semaphores = at::zeros({grid.x}, input.options().dtype(at::kInt));
int* semaphores_ptr = grid.y > 1 ? semaphores.mutable_data_ptr<int>() : nullptr;However, the above usage is not the best practice, this usage is constructing a tensor and using data in it. We could directly use an allocator. See section 2.2.
2.2 Use the allocator directly.
For example, for the code snippet in Reduce.cuh:
at::DataPtr buffer;
at::DataPtr semaphores;
if (config.should_global_reduce()) {
auto& allocator = *c10::cuda::CUDACachingAllocator::get();
buffer = allocator.allocate(config.global_memory_size());
semaphores = allocator.allocate(config.semaphore_size());
//...
}
auto reduce_kernel = ReduceOp<scalar_t, ops_t, uint32_t, out_scalar_t, vt0>(
buffer.get(),
(int*)semaphores.get(),
);It directly uses the allocator, rather than constructing the tensor. I believe that this usage could have some performance gain.