Ah yes, I see how this works. I wonder if there's some way we
can abstract out this sort of index calculation into a macro
or function so that we can comment what it's doing there and
then all the use-sites are more "obviously correct". Perhaps:
/*
* When considering the ZA storage as an array of elements of
* type T, the index within that array of the Nth element of
* a vertical slice of a tile can be calculated like this,
* regardless of the size of type T. This is because the tiles
* are interleaved, so if type T is size N bytes then row 1 of
* the tile is N rows away from row 0. The division by N to
* convert a byte offset into an array index and the multiplication
* by N to convert from vslice-index-within-the-tile to
* the index within the ZA storage cancel out.
*/
#define tile_vslice_index(i) ((i) * sizeof(ARMVectorReg))
/*
* When doing byte arithmetic on the ZA storage, the element
* byteoff bytes away in a tile vertical slice is always this
* many bytes away in the ZA storage, regardless of the
* size of the tile element, assuming that byteoff is a multiple
* of the element size. Again this is because of the interleaving
* of the tiles. For instance if we have 1 byte per element then
* each row of the ZA storage has one byte of the vslice data,
* and (counting from 0) byte 8 goes in row 8 of the storage
* at offset (8 * row-size-in-bytes).
* If we have 8 bytes per element then each row of the ZA storage
* has 8 bytes of the data, but there are 8 interleaved tiles and
* so byte 8 of the data goes into row 1 of the tile,
* which is again row 8 of the storage, so the offset is still
* (8 * row-size-in-bytes). Similarly for other element sizes.
*/
#define tile_vslice_offset(byteoff) ((byteoff) * sizeof(ARMVectorReg))
(or use functions if you like. Maybe we want versions that
take (row,col) arguments too.)