ESP32 I2S 3-step Cadence Faster encoding #855
Replies: 6 comments 10 replies
-
|
Make note of the endianness of variables. I believe I ran into one of the ESP32 (c3?) uses a different core (arm) and the endianness layout was different, part of the reason the move to working only in bytes was to reduce complexity and fix some bugs around this. So, you need to test all the platforms to make sure nothing appears. |
Beta Was this translation helpful? Give feedback.
-
|
I have created a pull request as part of the DMX512 encoding for the faster 3 step encoding, and i have added the use of big endian byte order in it as well, although i have not found any boards that use it and on the arduino forum no one knew of one that uses it, but if the macro is defined, it should work just fine. I did not go as far as providing for the (new to me) pop-endian. Anyway it's there, have a look if you have time. |
Beta Was this translation helpful? Give feedback.
-
|
Hi Mr. Miller, hope you are doing well. As a continuation of this discussion, i have done a faster encoder for the NeoEspI2sMuxBusSize8Bit3Step. It uses a 48 (3 x 16) x 32-bit lookup table which converts a nibble of pixeldata into 3x 32-bit words and 'OR's those. Anyway Here is the code for the Lookup table generator and here is the code for the encoding and it's speed test. Since the x8 & x16 methods tend to use quite a lot of memory anyway, the 3-step is a particular big saver, but as a result also the encoding time can be quite significant, and therefore the time saving is as well. Anyway, have look and i will do first the X16 as well now that i have figured out how the encoding works and then later do a fork of a branch or however that is supposed to work. |
Beta Was this translation helpful? Give feedback.
-
|
So i got around to doing the X16 reducing 16x 4 Universe from something like 12ms to less than half, and then i spent like a whole day and part of a night thinking on how to solve the ESP8266 puzzle. I think it is actually the same puzzle as with the ESP32S2 (but i may be wrong, i am getting one in the mail one of these days) The only thing i can come up with to speed it up even more would be to use more lookup tables. I considered 3-5 bits or 5-3 bits but in my head i couldn't work out what would be more advantageous. 5-bits would be a 32 field table of 16-bits which could be ok(ish) and i guess both would then be better, and then i even considered shifting the final 3 bits to the left to combine with the first 3 and make another table for that, but that would be a 64 field table and then i reverted (in my head) to the final result. Now that trick with performing the same masking and shifting on 4 bytes at a time is so good to me that i think i should also incorporate it into the x16 , where mainly the multiplication (* 3) can be done quite easily already, but it would be worth it to do the masking and shifting as well. It just wasn't as obvious to me since only 1 source byte at a time is being processed and for the ESP8266 4 bytes go into 3x 32-bit words, but with the amount of times that bit of code is executed, it will be worth it. This is what i have for the ESP32x16 method at the moment. resulting in 11407us vs 4705us . But that is for encoding 10880 pixels though. I have a 32x80 pixel matrix in my workspace for testing, but with that many i can control something 4x as big ! |
Beta Was this translation helpful? Give feedback.
-
|
So i did some of that git stuff, i implemented whatever 3-step speed optimizations i have into this fork I was looking at the ESP8266 UART method, and thinking that at 7N1 a 3-step could be easily created but that would require 3 source bytes to be processed at once, though i guess that would result in less CPU and less interruptions. Anyway have a look. |
Beta Was this translation helpful? Give feedback.
-
|
Oh hey hold on, i just realized that eh, i have been testing on a buffer of 680 bytes not pixels Well that makes the speed differences even more relevant. But that also means that some of the numbers in the comments i wrote should be corrected. The order of difference is still the same, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
With the new 3-step encoding for the ESP32 I2S, the trade-off is for less memory use vs encoding speed. Now as i was busy encoding the DMX512 method which was, as a by product, much more easily to add for me., it occurred to me that the 3-step encoding could be significantly quicker. So i decided to have a go. My first idea was to use a double (32-bit) lookup table, similar to the 4-step, but double since the 12-bit result would need to be at least eventually have to result in a multiple of 8-bits. The way to speed things up from the current method was to remove as many bit-shifts as possible. Bit-shifts are relatively slow, and shifting a variable 8 bits either way takes 8 times the amount of time as shifting it 1 bit (unlike for instance additions where x + 8 takes as much time as x +17)
I was even looking for a way to directly read the High nibble from the source byte. If i remember correctly the Z80 had a specific instruction to do just that, but that probably doesn't exist on a modern ESP anymore. So anyway i came up with this
First i was testing the resulting bit pattern on my UNO and comparing them to what the current encoding produces, and after some fiddling with the memcpy() pointers i got it to match. And a quick speed comparison showed great promise.
Then another thought occurred to me, to get rid of the memcpy() and directly assign into 16-bit variables and use 6 x 16-bit lookup tables.
Again a bit of fiddling to get it right, but it appeared to marginally slower than the first attempt.
So i migrated to the ESP32 (which unlike the UNO is not always on my desk) and performed a speedtest using this sketch
And i the results (using 160MHz clockrate)
Conclusion. The first attempt at 3-step encoding is marginally quicker than the 2nd attempt and is more than 5x as fast as the current method. With small pixelbuffers it is less than twice as slow as the 4-step, and with large buffers it is almost as fast as the 4-step. The temporary memory demand for it is a bit more than the 4-step ( 2 x 16 x 32-bit lookup vs 16 x 16-bit lookup tables = 96 bytes more in lookup tables)
The quickest is of course 256 size lookup tables, but that seems excessive, wasting a whole KB on them.
Anyway i thought i'd share it. I'll get the whole cloning and branch thing sorted soon.
Beta Was this translation helpful? Give feedback.
All reactions