I've implemented the OpenCL vload/vstore builtin functions in two parts. 1) Pure CL C implementation. No Assembly 2) Add assembly optimizations for 32-bit int/uint loads/stores of 4+ component vectors Note: The vstore implementation assumes that the hardware back end supports byte-addressable stores. This may not always be optimal.