Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frombuffer byte order #306

Closed
ubIQio opened this issue Feb 6, 2021 · 18 comments
Closed

frombuffer byte order #306

ubIQio opened this issue Feb 6, 2021 · 18 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@ubIQio
Copy link
Contributor

ubIQio commented Feb 6, 2021

I have a buffer of data from an ADC, which i am trying to access as an ndarray

The 1024 byte buffer is organised in 'frames' of 8 bytes, the bytes of interest in each frame being 1 through 6, being three pairs of 16 bit samples.

If I ignore the last byte of the buffer, I can get the buffer as an ndarray uising

buffer = b'\x40\x00\xda\xff\xd6\xc1\x2e\x0d\x40\x00\xde\xff\xca\xc1\x3c\x0d\x40\x00\xe6\xff\xc4\xc1\x40\x0d\x40\x00\xe0\xff\xb8\xc1\x32'


a = np.frombuffer(buffer, dtype=np.uint16, offset=1)

but the result is byteswapped values e.g

a:	 array([55808, 55039, 11969, ..., 57344, 47359, 12993], dtype=uint16)

dtype = np.>uint16 causes a reboot, any ideas how this can be achieved?

@v923z
Copy link
Owner

v923z commented Feb 6, 2021

I have a buffer of data from an ADC, which i am trying to access as an ndarray

The 1024 byte buffer is organised in 'frames' of 8 bytes, the bytes of interest in each frame being 1 through 6, being three pairs of 16 bit samples.

If I ignore the last byte of the buffer, I can get the buffer as an ndarray uising

buffer = b'\x40\x00\xda\xff\xd6\xc1\x2e\x0d\x40\x00\xde\xff\xca\xc1\x3c\x0d\x40\x00\xe6\xff\xc4\xc1\x40\x0d\x40\x00\xe0\xff\xb8\xc1\x32'


a = np.frombuffer(buffer, dtype=np.uint16, offset=1)

but the result is byteswapped values e.g

a:	 array([55808, 55039, 11969, ..., 57344, 47359, 12993], dtype=uint16)

I am not sure, I see the problem here. First, this is the numpy behaviour:

a = np.frombuffer(buffer, dtype=np.uint16, offset=1)

array([55808, 55039, 11969, 16397, 56832, 51967, 15553, 16397, 58880,
       50431, 16577, 16397, 57344, 47359, 12993], dtype=uint16)

Second, the frombuffer method is relatively primitive in the sense that it doesn't do anything. It simply grabs the buffer's pointer, and attaches a small header to it, so that later on the interpreter would know how the bytes are to be interpreted. It does not swap bytes, or re-arrange them in any way. There are no provisions for such, and ulab simply implements the numpy specifications: https://numpy.org/doc/stable/reference/generated/numpy.frombuffer.html

dtype = np.>uint16 causes a reboot, any ideas how this can be achieved?

Do you mean you tried a = np.frombuffer(buffer, dtype=np.>uint16, offset=1)? Basically, you want to change the endianness of the array, right? https://numpy.org/doc/stable/user/basics.byteswapping.html

Would it help you, if we implemented <i2 or something similar as a dtype for this single method, and then swapped the bytes in the array? Caveat emptor: this would overwrite the array, so if you wanted to pump the numbers into a DAC, or something similar, you would have to keep this fact in mind.

Finally, if you find out how this is done in numpy, let me know! The best solution is the one that conforms to numpy.

@v923z v923z added the question Further information is requested label Feb 6, 2021
@v923z
Copy link
Owner

v923z commented Feb 6, 2021

@ubIQio I think the cleanest solution is the byteswap array method: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.byteswap.html, so you could then do something like this:

buffer = b'\x40\x00\xda\xff\xd6\xc1\x2e\x0d\x40\x00\xde\xff\xca\xc1\x3c\x0d\x40\x00\xe6\xff\xc4\xc1\x40\x0d\x40\x00\xe0\xff\xb8\xc1\x32'
a = np.frombuffer(buffer, dtype=np.uint16, offset=1)
a.byteswap(inplace=True)

This would still break strict numpy-compatibility, because in numpy, a is a read-only array.

@ubIQio
Copy link
Contributor Author

ubIQio commented Feb 6, 2021

Thanks for the quick response, and also for creating a landmark piece of open source software for the microcontroller world.

The Numpy docs on byteswapping give an overview of the problem.

I am in the case 'Data and dtype endianness don’t match, change dtype to match data'.

So in Numpy I would just use dtype='>H' to give the correct endianness, which does not change the array in memory.

This numpy code shows the problem and solution:

buffer = b'\x40\x00\xda\xff\xd6\xc1\x2e\x0d\x40\x00\xde\xff\xca\xc1\x3c\x0d\x40\x00\xe6\xff\xc4\xc1\x40\x0d\x40\x00\xe0\xff\xb8\xc1\x32\xc1\xff'

print('buffer: ', buffer)

a = np.frombuffer(buffer, dtype=np.uint16, offset=1)
print('\nCurrent:\t', a)

b = np.frombuffer(buffer, dtype='>H', offset=1)
print('\nDesired:\t', b)

I can then perform my FFT before passing the result over the wire

The byteswap method is not preferred as this will change the data in memory, thus is slower and more memory intensive, which is to be avoided as these frames are arriving frequently, and as you point out is not numpy compliant.

I can imagine more sophisticated dtype is not an easy task to implement, but for reading streams from hardware, a key use case of the frombuffer method, it would be extremely useful.

However, supporting either method would be greatly appreciated.

@v923z
Copy link
Owner

v923z commented Feb 6, 2021

I am in the case 'Data and dtype endianness don’t match, change dtype to match data'.

So in Numpy I would just use dtype='<i2' to give the correct endianness, which does not change the array in memory.

I can then perform my FFT before passing the result over the wire

The byteswap method is not preferred as this will change the data in memory, thus is slower and more memory intensive,

If you mean that we have to operate on the RAM, then the statement is correct, but it doesn't imply that we would have to use a lot of RAM: you can swap the two bytes in-place. Also, while dtype = '<i2' leaves the underlying array alone, the burden of the byte swapping is simply transferred to the readout function.

I can imagine more sophisticated dtype is not an easy task to implement, but for reading streams from hardware, a key use case of the frombuffer method, it would be extremely useful.

I agree, and I also see that this is going to be a recurring issue. The only question is, how we should pull this off. Here are the implementation difficulties:

  1. I already pointed out that you cannot avoid byte swapping, you can only choose, where you do that. (This is also the gist of the numpy docs.)
  2. Moving the byte swapping to the readout phase would mean that all functions that operate on arrays would have to have this option, which would certainly influence both the firmware size, and the execution speed. Also, if you operate on the same array again and again, you would have to swap the bytes each time, because the data pointer is in the original format.
  3. This latter problem can be eliminated, if we swap the bytes, when we create the array from the buffer, but that breaks numpy-compatibility. However, I would mention that frombuffer() already breaks this compatibility, because numpy returns a new array (the buffer is also copied), while ulab grabs the pointer. We chose this implementation, because this saves RAM, and its initialisation. I think this is an important point, because MCUs are low on RAM, and with micropython, the amount is even more limited, and fragmentation might become an issue.
  4. ndarray(..., buffer=buffer) also just grabs the pointer, so it might make sense to remove frombuffer, and move the implementation to ndarray().

Let me know your thoughts!
@jepler Jeff, I would like you to weigh in.
@CallumJHays Cal, does point 4 significantly affect the ndarray() implementation that you are working on? #305

@jepler
Copy link
Collaborator

jepler commented Feb 7, 2021

Ignoring the problem of possibly moving an element past the end of the buffer, I think "offset=1" is wrong but looks plausible much of the time.

Plausible:

>>> buf = struct.pack("<4h", 1020, 1021, 1020, 1021)
>>> struct.unpack(">3h", buf[1:-1])
(1021, 1020, 1021)

clearly incorrect:

>>> buf = struct.pack("<4h", 1020, 1030, 1020, 1030)
>>> struct.unpack(">3h", buf[1:-1])
(774, 1276, 774)

If the underlying buffer is immutable according to python (e.g., item assignment would raise an exception) then I doubt ulab shold go modifying it.

Adding 4 more data types seems like it's not good for firmware size.

I think either ulab needs to take a copy before modifying the byte order, or else the original process producing the buffer should be modified so that it can provide a mutable buffer as an output (or swap the bytes itself!)

@ubIQio
Copy link
Contributor Author

ubIQio commented Feb 7, 2021

Hi Zoltán,

  1. I already pointed out that you cannot avoid byte swapping, you can only choose, where you do that. (This is also the gist of the numpy docs.)

Indeed; I'm far from being a compiler expert but since at least Tensilica Xtensa ISA and CortexM0 ISA have byte swaps, and that the ndarray may be in PSRAM, then in the general case, I would think it would be more efficient to perform the swap in registers than main memory, ie on read; however, as you say

Moving the byte swapping to the readout phase would mean that all functions that operate on arrays would have to have this option, which would certainly influence both the firmware size, and the execution speed.

Presumably would affect execution speed only for the degenerate case when byte swapping is required, since dtype lookup of some sort always performed?

Also, if you operate on the same array again and again, you would have to swap the bytes each time, because the data pointer is in the original format.

Agreed, in this use case the byteswap is the likely the best way to go. You would know much better than me how frequently this is done.

So in the end, I guess both options (as provided by numpy) will be required ; and so a project grows...

I'd be happy with an extra byteswap keyword argument just for FFT, but that is certainly not numpy compliant...

Maybe if a 'swap dtype' was allowed as a type, since the dtype is known to functions, then gradually functions could be enabled to support it, starting with FFT of course :)

@ubIQio
Copy link
Contributor Author

ubIQio commented Feb 7, 2021

@jepler yes the offset is a bit of a hack in order to align the bytes of interest from the hardware, not part of the byte swap issue. (I pad the capture buffer by 1 at the end to allow for this, so a 8192 byte buffer of 1024*8 bytes is passed to frombuffer as a 8193 byte buffer.

ie

 
## create fake sample buffer

n=1024*8 # must be multiple of 8

buf = np.array(range(n+1), dtype=np.uint8)  # create buffer with pad byte 

a= np.frombuffer(buf, dtype='>H', offset=1).reshape(n>>3,4)[:,0:3]

results in

buf: [  0   1   2 ... 254 255   0]

a:[[  258   772  1286]
 [ 2314  2828  3342]
 [ 4370  4884  5398]
 ...
 [59882 60396 60910]
 [61938 62452 62966]
 [63994 64508 65022]]

@v923z
Copy link
Owner

v923z commented Feb 7, 2021

Hi Zoltán,

  1. I already pointed out that you cannot avoid byte swapping, you can only choose, where you do that. (This is also the gist of the numpy docs.)

Indeed; I'm far from being a compiler expert but since at least Tensilica Xtensa ISA and CortexM0 ISA have byte swaps, and that the ndarray may be in PSRAM, then in the general case, I would think it would be more efficient to perform the swap in registers than main memory,

The problem is not necessarily where the swapping happens, but that you still have to treat both cases. As Jeff pointed out above, what you are, in effect, asking for is 3 extra data types, and that is expensive. To put this in perspective, the firmware size for many of the functions, and all binary operators scales with the square of the number of data types. Currently we have 5, which means 25 possible combinations. In certain cases, you can take shortcuts, so e.g., the associative operators can be condensed into 13 cases, but the others can't. Now, if you add 3 more data types, then you end up with 64 combinations. That is really too many. And on top of that, you could then ask, why the complex dtype is not added (useful for FFT). Now we are at 100. (Complexes also have endiannes).

Presumably would affect execution speed only for the degenerate case when byte swapping is required, since dtype lookup of some sort always performed?

That is right, but you still have to implement and compile the code for the degenerate case.

Also, if you operate on the same array again and again, you would have to swap the bytes each time, because the data pointer is in the original format.

Agreed, in this use case the byteswap is the likely the best way to go. You would know much better than me how frequently this is done.

So in the end, I guess both options (as provided by numpy) will be required ; and so a project grows...

The main reason for developing ulab is that numpy doesn't fit on an MCU. So, when re-implementing numpy features, we have to tread very carefully and think hard about how we do that. We should not underestimate the fact that you can reduce ulab's compiled size to around 5 kB, and you can still perform meaning operations with it, while numpy is 30 MB, plus all the external stuff that it depends on.

RAM is another concern. I can assure you that, when implementing a new function, I evaluate many possible implementations in terms of speed, RAM, and firmware size. It would be really foolish to throw everything into the wind in a haphazard manner. (As a side note: for me, this is what makes the whole project so exciting. The various constraints, and how you can negotiate your way around them.)

I'd be happy with an extra byteswap keyword argument just for FFT, but that is certainly not numpy compliant...

No, it is not.

Maybe if a 'swap dtype' was allowed as a type, since the dtype is known to functions, then gradually functions could be enabled to support it, starting with FFT of course :)

Partial implementations are a nightmare. Just think about the documentation effort, and the number of cases, when it would trip people. I think I wouldn't like to go down that path.

@v923z
Copy link
Owner

v923z commented Feb 7, 2021

@jepler

If the underlying buffer is immutable according to python (e.g., item assignment would raise an exception) then I doubt ulab shold go modifying it.

I think this is a numpy feature, but I am not sure.

Adding 4 more data types seems like it's not good for firmware size.

It is definitely a no-go.

I think either ulab needs to take a copy before modifying the byte order, or else the original process producing the buffer should be modified so that it can provide a mutable buffer as an output (or swap the bytes itself!)

What we could do is add the reversed byte order only in frombuffer(...) as an option, and return a copy of the buffer in the native format. On the other hand, in ndarray(...), we could add the buffer keyword, and that would return only a view. In this way, we would implement everything in a numpy-compatible way, and one could still use the fast (grab only the pointer) option, if byte swapping is not required.

@ubIQio I think this is the solution. Let's keep the issue open. I will set out to work out the implementation details. It shouldn't take too long. I hope we can wrap this up in a week, or less. I really appreciate that you brought this up.

@CallumJHays Cal, it seems that we should add keyword arguments to the ndarray constructor anyway, so we could definitely implement your suggestion.

@v923z v923z added the enhancement New feature or request label Feb 7, 2021
@ubIQio
Copy link
Contributor Author

ubIQio commented Feb 7, 2021

Sounds good, it will be interesting to see what the time penalty will be for the byte swapped buffer copy.

It is a shame that many ADCs and worst of all I2S use bigendian data, as the data is arriving fast (for a microcontroller) and is in the wrong format and has a real-time requirement.

Of course there are those who would say, if real time is a concern, why are you using Python anyway?
But after decades of programming micros in C, I just find MicroPython with ulab and asyncio so powerful it's not funny.

Many thanks again for your quick response.

@v923z
Copy link
Owner

v923z commented Feb 7, 2021

Sounds good, it will be interesting to see what the time penalty will be for the byte swapped buffer copy.

It is next to impossible to predict. Results from https://github.com/thiagofe/ulab_samples scatter a great deal, even if you normalise by the clock rate.

@v923z
Copy link
Owner

v923z commented Feb 11, 2021

@ubIQio I think I made a mistake here: both frombuffer, and ndarray produce a view of the buffer, without the option of copying its content. What we could do is add byteswap without keyword arguments (assuming inplace = False). I believe this should fulfil your requirements.

@ubIQio
Copy link
Contributor Author

ubIQio commented Feb 11, 2021

That would be awesome!

@v923z
Copy link
Owner

v923z commented Feb 12, 2021

@ubIQio Could you, please, check out #317, and see if does, what it is supposed to, and if the performance is acceptable. You can now do something like

from ulab import numpy as np

a = np.array([1, 2, 3, 4, 5, 6], dtype=np.uint16)
b = a.byteswap(inplace=False)
print(a)
print(b)
c = a.byteswap(inplace=True)
print(a)
print(c)

I haven't yet had time to write up the docs.

Do you think you could whip up a small test suite? If you do that, then request a pull against the byteswap branch! Thanks!

@ubIQio
Copy link
Contributor Author

ubIQio commented Feb 13, 2021

Truly awesome.

fyi on my platform (ESP32 w. SPIRAM) a raw 8193byte frombuffer takes 1.1ms, and with inplace byteswap it's only 1.9ms.

(I was surprised that the inplace ran a lot faster than the copy which is about 4.5ms)

You may be interested to know that the entire graph

s=spy.signal.spectrogram(np.linalg.norm(np.frombuffer(buf, dtype=np.uint16, offset=1).byteswap(inplace=True).reshape((1024,4))[:,0:3], axis=1)).tobytes()

takes 9ms, which is absolutely amazing.

Thanks v.much Zoltán, I'll go generate a test suite for that.

@v923z
Copy link
Owner

v923z commented Feb 14, 2021

Truly awesome.

fyi on my platform (ESP32 w. SPIRAM) a raw 8193byte frombuffer takes 1.1ms, and with inplace byteswap it's only 1.9ms.

(I was surprised that the inplace ran a lot faster than the copy which is about 4.5ms)

Well, that is the overhead of allocating that much extra memory. However, frombuffer shouldn't have taken that long. I found out that this was the result of an implementation glitch. I have fixed that, and merged everything into master. I believe, if you check out the latest version, you should be able to save approx. 1 ms.

You may be interested to know that the entire graph

s=spy.signal.spectrogram(np.linalg.norm(np.frombuffer(buf, dtype=np.uint16, offset=1).byteswap(inplace=True).reshape((1024,4))[:,0:3], axis=1)).tobytes()

takes 9ms, which is absolutely amazing.

I am glad it worked out for you, and especially that I could prove you wrong 😉 You were concerned that we might consume too much time shifting the bytes to and fro.

Since frombuffer works now with the buffer, if you can't swap the bytes in place, you have to change the keyword's value to False. This would happen, if the other end of your buffer (it was an ADC, right?) makes the buffer immutable.

I'll go generate a test suite for that.

Thanks! I'll delete the byteswap branch, you can PR against master.

@v923z
Copy link
Owner

v923z commented Feb 15, 2021

#325 completes the implementation.

@ubIQio
Copy link
Contributor Author

ubIQio commented Feb 15, 2021

I am glad it worked out for you, and especially that I could prove you wrong 😉

I like to think that it was my hunch that was wrong, not me 😉

The buffer is mutable in this case; it is created once and repeatedly fill in FIFO sized chunks by I2C.readfrom_mem_into using memoryview, until a complete frame is gathered (it's actually an accelerometer, used as a vibration sensor)

On the speed front, the 'implementation glitch' fix has made a dramatic improvement on my platform: the raw frombuffer is now ~120us, the inplace byteswap ~750us and my graph is now down to ~6.8ms.

Again, a big thank you Zoltan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants