The reason is SIMD instructions / vectorized assembly instructions with consideration for delay slots, instruction latency, memory access times etc., for which GCC and Clang optimizers are both terrible and cannot automatically transform C code to them in any but simple cases.
This is also a reason why specialized DSP processors with SIMD capabilities have dedicated proprietary compilers for them.
The reason is SIMD instructions / vectorized assembly instructions with consideration for delay slots, instruction latency, memory access times etc., for which GCC and Clang optimizers are both terrible and cannot automatically transform C code to them in any but simple cases.
This is also a reason why specialized DSP processors with SIMD capabilities have dedicated proprietary compilers for them.