I had come to a similar conclusion, from looking at (and reproducing) Nicholas' timing results. In my original vectorized version, the operations spent computing the index ranges are more than what's saved by vectorizing. I did some precomputing of the ranges, similar to yours, and managed to get the vectorized version no slower than the loop.
As dismaying as that is, I think I see why: at each i, the number of things copied is a few or none, so there's nothing to be gained by vectorizing. This code fragment can only be sped up by eliminating the loop on i -- and I see no way to do that.
My conclusion is there's no alternative to implementing this is c++ (which I've already done).