Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P] Score: Comments: