could be as simple as reshaping the input arrays before calling the kernel. NNlib would need similar changes to be consistent.