They aren't just used for loads and stores; they are used when there are multiple 128-bit output vectors. See, for example,
It's a nice way of wrapping up operations which require multiple vectors without resorting to pointers, which are easy to mess up. This includes loads and stores which load/store multiple vectors in a single operation. For example, it would probably have been possible to implement the vld4_s8 function as something like:
void vld4_s8(int8x8_t* a, int8x8_t* b, int8x8_t* c, int8x8_t* d, int8_t* data);
If I saw that signature I would have to find some documentation to figure out WTF was going on with the first four parameters. Are they inputs or outputs? Are they pointers to arrays of vectors, or just a single vector? Do they all need to be set, or can/should I pass NULL if I don't need the value?
The current function, on the other hand, is hard to get wrong. There is an input which is an array of 8-bit integers, and it returns an array of four vectors. The only thing that I would have done different would be to use a conformant array parameter for the input data so you know exactly how many elements it needs, but that won't work in C++ or MSVC (possibly until a few months ago when they added C99/C11 support) anyways.
would I expect any performance difference to vld1q_f64_x4-ing a float64x2x4_t and using the same functions on elements of the array.
Possibly, but not from the multiply. The advantage comes from using a single instruction to load all four vectors; see https://godbolt.org/z/aY8a95 for example. In this case, you have one load and four multiplies instead of four loads and four multiplies.
In practice, I suspect that most good compilers would be able to optimize this to the same code (though I would want to check), but intrinsics typically map 1:1 with instructions so you don't have to trust the compiler to be smart. In this case, there is an ld4 instruction so there is a vld4_* family of functions, and these types are used in those APIs.