You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Use output directory instead of /tmp directory for building
* Can run GEMM with a wide B matrix
* Moved creating buffers for buffer lists B/C inside golden reference so that the major order logic is there instead
* Simplifying some logic in the operator
* Further simplification of the gemm operator
* Formatting
* Adjust comments for C tile streaming through shim DMA based on the separate_c_tiles parameter
* Remove using the separated C tile runtime streams in test
* Can offload the last linear layer, but TTFT goes way up--reason seems to be that the actual forward operation for last linear layer is ~4.5s, likely due to reading the output buffers and converting the output from np to torch since the GEMM kernel itself should take ~200ms with bfp16 emulation enabled
* Modified the torch_to_numpy/numpy_to_torch conversions to use zero-copy reinterprets, and removed unnecessary .to() calls which could result in extram unnecessary passes over memory
* Make read_buffer for BOs zero-copy
* Fix functionality when copy is True with read_buffer()
* Run decode stage on CPU for final linear layer, which fixes toks per sec but the outptut tokens still inconsistent with CPU-only inference
* Fix CPU final linear layer run with KV cache enabled and formatting
* Use map view for writing buffers like with reading buffers
* Make separate_c_tiles parameter based on partition_N value
* Formatting
* Use corect shapes for forward pass (padded N vs actual N)
* Formatting
* Clean up code and comments in new versions of read_buffer()/write_buffer() methods
* Fix comments in numpy/torch conversion utils
* fixes after merge
* format
* fixes
---------
Co-authored-by: andrej <[email protected]>
0 commit comments