Is there any way to write programs that bypass the processor's front end for most the processors in use? If not, it what sense is it true that "you can go further than assembler"?
This is often, though not always, the case with deeply embedded processors of the kind BearSSL seems to be designed for. There's generally some way to get predictable cycle-exact performance on them because it matters for some embedded applications.