ESP32 I2S 3-step Cadence Faster encoding #855

devarishi7 · 2024-10-23T09:23:54Z

devarishi7
Oct 23, 2024

With the new 3-step encoding for the ESP32 I2S, the trade-off is for less memory use vs encoding speed. Now as i was busy encoding the DMX512 method which was, as a by product, much more easily to add for me., it occurred to me that the 3-step encoding could be significantly quicker. So i decided to have a go. My first idea was to use a double (32-bit) lookup table, similar to the 4-step, but double since the 12-bit result would need to be at least eventually have to result in a multiple of 8-bits. The way to speed things up from the current method was to remove as many bit-shifts as possible. Bit-shifts are relatively slow, and shifting a variable 8 bits either way takes 8 times the amount of time as shifting it 1 bit (unlike for instance additions where x + 8 takes as much time as x +17)
I was even looking for a way to directly read the High nibble from the source byte. If i remember correctly the Z80 had a specific instruction to do just that, but that probably doesn't exist on a modern ESP anymore. So anyway i came up with this

class NeoEsp32I2sCadence3Stepfast
{
public:
    const static size_t DmaBitsPerPixelBit = 3; // 3 step cadence, matches encoding
    //const static size_t DmaBitsPerPixelByte = 24; // 3 step cadence, matches encoding
	
    static void EncodeIntoDma(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
    {		
        const uint32_t bitpatternsLow[16] =
        {
            0x000924, 0x000926, 0x000934, 0x000936, 0x0009A4, 0x0009A6, 0x0009B4, 0x0009B6, 
			0x000D24, 0x000D26, 0x000D34, 0x000D36, 0x000DA4, 0x000DA6, 0x000DB4, 0x000DB6

        };
		const uint32_t bitpatternsHigh[16] =
        {
            0x924000, 0x926000, 0x934000, 0x936000, 0x9A4000, 0x9A6000, 0x9B4000, 0x9B6000, 
			0xD24000, 0xD26000, 0xD34000, 0xD36000, 0xDA4000, 0xDA6000, 0xDB4000, 0xDB6000

        };
		uint32_t output[2];  // 2x 24-bit bitPattern in consequetive location
		uint8_t * output8 = reinterpret_cast<uint8_t *>(output);

        uint8_t* pDma = dmaBuffer;
        const uint8_t* pEnd = data + sizeData - 1;  // Encode 2 bytes at a time, make sure they are there
        const uint8_t* pSrc = data; 
		while (pSrc < pEnd)  
        {
            output[0] = bitpatternsHigh[((*pSrc) >> 4)] | bitpatternsLow[((*pSrc) & 0x0f)];
			pSrc++;
			output[1] = bitpatternsHigh[((*pSrc) >> 4)] | bitpatternsLow[((*pSrc) & 0x0f)];
			pSrc++;   // note: the mask for the bitpatternsHigh index should be obsolete
			          // To get the 2x 3-byte values in the right order copy them Byte by Byte
			memcpy(pDma++, output8 + 1 , 1);  
			memcpy(pDma++, output8 + 2 , 1);  
			memcpy(pDma++, output8 + 6 , 1);  
			memcpy(pDma++, output8 + 0 , 1);  
			memcpy(pDma++, output8 + 4 , 1);  
			memcpy(pDma++, output8 + 5 , 1); 
        }
		if (pSrc == pEnd)  // the last pixelbuffer byte if it exists
		{
			output[0] = bitpatternsHigh[((*pSrc) >> 4) & 0x0f] | bitpatternsLow[((*pSrc) & 0x0f)];
			memcpy(pDma++, output8 + 1 , 1); 
			memcpy(pDma++, output8 + 2 , 1); 
			pDma++;
			memcpy(pDma++, output8 + 0 , 1); 			
		}		
    }
};

First i was testing the resulting bit pattern on my UNO and comparing them to what the current encoding produces, and after some fiddling with the memcpy() pointers i got it to match. And a quick speed comparison showed great promise.

Then another thought occurred to me, to get rid of the memcpy() and directly assign into 16-bit variables and use 6 x 16-bit lookup tables.

class NeoEsp32I2sCadence3Stepfast2
{
public:
        const static size_t DmaBitsPerPixelBit = 3; // 3 step cadence, matches encoding
	//const static size_t DmaBitsPerPixelByte = 24; // 3 step cadence, matches encoding

	static void EncodeIntoDma(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
	{
		const uint16_t bitpatterns[6][16] = {
		{
			0x9240, 0x9260, 0x9340, 0x9360, 0x9A40, 0x9A60, 0x9B40, 0x9B60,
			0xD240, 0xD260, 0xD340, 0xD360, 0xDA40, 0xDA60, 0xDB40, 0xDB60
		},
		{
			0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009,
			0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D
		},
		{
			0x0092, 0x0092, 0x0093, 0x0093, 0x009A, 0x009A, 0x009B, 0x009B,
			0x00D2, 0x00D2, 0x00D3, 0x00D3, 0x00DA, 0x00DA, 0x00DB, 0x00DB
		},
		{
			0x2400, 0x2600, 0x3400, 0x3600, 0xA400, 0xA600, 0xB400, 0xB600,
			0x2400, 0x2600, 0x3400, 0x3600, 0xA400, 0xA600, 0xB400, 0xB600
		},
		{
			0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000,
			0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000
		},
		{
			0x0924, 0x0926, 0x0934, 0x0936, 0x09A4, 0x09A6, 0x09B4, 0x09B6,
			0x0D24, 0x0D26, 0x0D34, 0x0D36, 0x0DA4, 0x0DA6, 0x0DB4, 0x0DB6
		}};

		uint16_t* pDma = reinterpret_cast<uint16_t*>(dmaBuffer);
		const uint8_t* pEnd = data + sizeData - 1;  // Encode 2 bytes at a time, make sure they are there
		const uint8_t* pSrc = data;

		while (pSrc < pEnd)
		{
			uint8_t msNibble = ((*pSrc) >> 4) & 0x0f, lsNibble = (*pSrc) & 0x0f;
			*(pDma++) = bitpatterns[0][msNibble] | bitpatterns[1][lsNibble];
			pSrc++;
			msNibble = ((*pSrc) >> 4) & 0x0f;
			*(pDma++) = bitpatterns[2][msNibble] | bitpatterns[3][lsNibble];
			lsNibble = (*pSrc) & 0x0f;
			*(pDma++) = bitpatterns[4][msNibble] | bitpatterns[5][lsNibble];
			pSrc++;
		}
		if (pSrc == pEnd)  // the last pixelbuffer byte if it exists
		{
			uint8_t msNibble = ((*pSrc) >> 4) & 0x0f, lsNibble = (*pSrc) & 0x0f;
			*(pDma++) = bitpatterns[0][msNibble] | bitpatterns[1][lsNibble];
			*(pDma++) = bitpatterns[3][lsNibble];
		}
	}
};

Again a bit of fiddling to get it right, but it appeared to marginally slower than the first attempt.
So i migrated to the ESP32 (which unlike the UNO is not always on my desk) and performed a speedtest using this sketch

uint32_t EncodeIntoDma4Step(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
{
  uint32_t startTime = micros();
  const uint16_t bitpatterns[16] =
  {
    0b1000100010001000, 0b1000100010001110, 0b1000100011101000, 0b1000100011101110,
    0b1000111010001000, 0b1000111010001110, 0b1000111011101000, 0b1000111011101110,
    0b1110100010001000, 0b1110100010001110, 0b1110100011101000, 0b1110100011101110,
    0b1110111010001000, 0b1110111010001110, 0b1110111011101000, 0b1110111011101110,
  };

  uint16_t* pDma = reinterpret_cast<uint16_t*>(dmaBuffer);
  const uint8_t* pEnd = data + sizeData;
  for (const uint8_t* pSrc = data; pSrc < pEnd; pSrc++)
  {
    *(pDma++) = bitpatterns[((*pSrc) >> 4) & 0x0f];
    *(pDma++) = bitpatterns[((*pSrc) & 0x0f)];
  }
  return micros() - startTime;
}

uint32_t EncodeIntoDma3Step(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
{
  uint32_t startTime = micros();
  const uint16_t OneBit =  0b00000110;
  const uint16_t ZeroBit = 0b00000100;
  const uint8_t SrcBitMask = 0x80;
  const size_t BitsInSample = sizeof(uint16_t) * 8;

  uint16_t* pDma = reinterpret_cast<uint16_t*>(dmaBuffer);
  uint16_t dmaValue = 0;
  uint8_t destBitsLeft = BitsInSample;

  const uint8_t* pSrc = data;
  const uint8_t* pEnd = pSrc + sizeData;

  while (pSrc < pEnd)
  {
    uint8_t value = *(pSrc++);
    for (uint8_t bitSrc = 0; bitSrc < 8; bitSrc++)
    {
      const uint16_t Bit = ((value & SrcBitMask) ? OneBit : ZeroBit);
      if (destBitsLeft > 3)
      {
        destBitsLeft -= 3;
        dmaValue |= Bit << destBitsLeft;  // this is the most time consuming way to do this
        // it is much more efficient to to '|=' first and shift as needed
      }
      else if (destBitsLeft <= 3)
      {
        uint8_t bitSplit = (3 - destBitsLeft);
        dmaValue |= Bit >> bitSplit;
        *(pDma++) = dmaValue;
        dmaValue = 0;

        destBitsLeft = BitsInSample - bitSplit;
        if (bitSplit)
        {
          dmaValue |= Bit << destBitsLeft;
        }
      }
      value <<= 1;
    }
  }
  *pDma++ = dmaValue;
  return micros() - startTime;
}


uint32_t EncodeIntoDma3StepFast(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
{
  uint32_t startTime = micros();
  const uint32_t bitpatternsLow[16] =
  {
    0x000924, 0x000926, 0x000934, 0x000936, 0x0009A4, 0x0009A6, 0x0009B4, 0x0009B6,
    0x000D24, 0x000D26, 0x000D34, 0x000D36, 0x000DA4, 0x000DA6, 0x000DB4, 0x000DB6
  };
  const uint32_t bitpatternsHigh[16] =
  {
    0x924000, 0x926000, 0x934000, 0x936000, 0x9A4000, 0x9A6000, 0x9B4000, 0x9B6000,
    0xD24000, 0xD26000, 0xD34000, 0xD36000, 0xDA4000, 0xDA6000, 0xDB4000, 0xDB6000
  };
  uint32_t output[2];  // 2x 24-bit bitPattern
  uint8_t * output8 = reinterpret_cast<uint8_t *>(output);

  uint8_t* pDma = dmaBuffer;
  const uint8_t* pEnd = data + sizeData - 1;  // Encode 2 bytes at a time, make sure they are there
  const uint8_t* pSrc = data;
  while (pSrc < pEnd)
  {
    output[0] = bitpatternsHigh[((*pSrc) >> 4) /*& 0x0f*/] | bitpatternsLow[((*pSrc) & 0x0f)];
    //Serial.println(output[0], BIN);
    pSrc++;
    output[1] = bitpatternsHigh[((*pSrc) >> 4) /*& 0x0f*/] | bitpatternsLow[((*pSrc) & 0x0f)];
    pSrc++;   // note: the mask for the bitpatternsHigh index should be obsolete
    // To get the 2x 3-byte values in the right order copy them Byte by Byte
    memcpy(pDma++, output8 + 1 , 1);
    memcpy(pDma++, output8 + 2 , 1);
    //memcpy(pDma++, output8 + 1 , 2); pDma++;
    memcpy(pDma++, output8 + 6 , 1);
    memcpy(pDma++, output8 + 0 , 1);
    memcpy(pDma++, output8 + 4 , 1);
    memcpy(pDma++, output8 + 5 , 1);
    //memcpy(pDma++, output8 + 4 , 2); pDma++;
  }
  if (pSrc == pEnd)  // the last pixelbuffer byte if it exists
  {
    output[0] = bitpatternsHigh[((*pSrc) >> 4) & 0x0f] | bitpatternsLow[((*pSrc) & 0x0f)];
    memcpy(pDma++, output8 + 1 , 1);
    memcpy(pDma++, output8 + 2 , 1);
    pDma++;
    memcpy(pDma++, output8 + 0 , 1);
  }
  return micros() - startTime;
}

uint32_t EncodeIntoDma3StepFast2(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData)
{
  uint32_t startTime = micros();
  const uint16_t bitpatterns[6][16] = {
    {
      0x9240, 0x9260, 0x9340, 0x9360, 0x9A40, 0x9A60, 0x9B40, 0x9B60,
      0xD240, 0xD260, 0xD340, 0xD360, 0xDA40, 0xDA60, 0xDB40, 0xDB60
    },
    {
      0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009, 0x0009,
      0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D, 0x000D
    },
    {
      0x0092, 0x0092, 0x0093, 0x0093, 0x009A, 0x009A, 0x009B, 0x009B,
      0x00D2, 0x00D2, 0x00D3, 0x00D3, 0x00DA, 0x00DA, 0x00DB, 0x00DB
    },
    {
      0x2400, 0x2600, 0x3400, 0x3600, 0xA400, 0xA600, 0xB400, 0xB600,
      0x2400, 0x2600, 0x3400, 0x3600, 0xA400, 0xA600, 0xB400, 0xB600
    },
    {
      0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000,
      0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000, 0x4000, 0x6000
    },
    {
      0x0924, 0x0926, 0x0934, 0x0936, 0x09A4, 0x09A6, 0x09B4, 0x09B6,
      0x0D24, 0x0D26, 0x0D34, 0x0D36, 0x0DA4, 0x0DA6, 0x0DB4, 0x0DB6
    }
  };

  uint16_t* pDma = reinterpret_cast<uint16_t*>(dmaBuffer);
  const uint8_t* pEnd = data + sizeData - 1;  // Encode 2 bytes at a time, make sure they are there
  const uint8_t* pSrc = data;

  while (pSrc < pEnd)
  {
    uint8_t msNibble = ((*pSrc) >> 4) & 0x0f, lsNibble = (*pSrc) & 0x0f;
    *(pDma++) = bitpatterns[0][msNibble] | bitpatterns[1][lsNibble];
    pSrc++;
    msNibble = ((*pSrc) >> 4) & 0x0f;
    *(pDma++) = bitpatterns[2][msNibble] | bitpatterns[3][lsNibble];
    lsNibble = (*pSrc) & 0x0f;
    *(pDma++) = bitpatterns[4][msNibble] | bitpatterns[5][lsNibble];
    pSrc++;
  }
  if (pSrc == pEnd)  // the last pixelbuffer byte if it exists
  {
    uint8_t msNibble = ((*pSrc) >> 4) & 0x0f, lsNibble = (*pSrc) & 0x0f;
    *(pDma++) = bitpatterns[0][msNibble] | bitpatterns[1][lsNibble];
    *(pDma++) = bitpatterns[3][lsNibble];
  }
  return micros() - startTime;
}

void setup() {

  uint16_t bufferSize = 1050;
  uint8_t pixelData27[27] = {0xfe, 0x68, 0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68,
                                   0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68, 0x12, 0xdc,
                                   0x86, 0x34, 0xdc, 0x86, 0x34
                                  };
                                  
  uint8_t* pixelData;
  pixelData = (uint8_t*) malloc(bufferSize);
  uint8_t* pix27 = pixelData;
  for (uint8_t i = 0; i < 27; i++) {
    *(pix27++) = pixelData27[i];
  }
                                  
  for (uint16_t i = 27; i < bufferSize; i++) {
    pixelData[i] = pixelData[i - 27];
  }
  uint32_t elaps;

  uint8_t* dmaOutputBuffer4Step;
  dmaOutputBuffer4Step = (uint8_t*) malloc(bufferSize * 4 + 4);
  memset(dmaOutputBuffer4Step, 0x00, bufferSize * 4 + 4);
  uint8_t* dmaOutputBuffer3Step;
  dmaOutputBuffer3Step = (uint8_t*) malloc(bufferSize * 3 + 4);
  memset(dmaOutputBuffer3Step, 0x00, bufferSize * 3 + 4);
  uint8_t* dmaOutputBuffer3StepFast;
  dmaOutputBuffer3StepFast = (uint8_t*) malloc(bufferSize * 3 + 4);
  memset(dmaOutputBuffer3StepFast, 0x00, bufferSize * 3 + 4);
  uint8_t* dmaOutputBuffer3StepFast2;
  dmaOutputBuffer3StepFast2 = (uint8_t*) malloc(bufferSize * 3 + 4);
  memset(dmaOutputBuffer3StepFast2, 0x00, bufferSize * 3 + 4);

  Serial.begin(500000);
  Serial.println();
  Serial.println();
  Serial.println();

  Serial.println("* 4 Step encoding");
  delay(1000);
  elaps = EncodeIntoDma4Step(dmaOutputBuffer4Step, pixelData, bufferSize);
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer4Step, bufferSize * 4 + 4);
  

  Serial.println();
  Serial.println("* 3 Step encoding");
  delay(1000);
  elaps = EncodeIntoDma3Step(dmaOutputBuffer3Step, pixelData, bufferSize);
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3Step, bufferSize * 3 + 4);
  Serial.println();

  Serial.println("* 3 Step encoding fast");
  delay(1000);
  elaps = EncodeIntoDma3StepFast(dmaOutputBuffer3StepFast, pixelData, bufferSize);
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3StepFast, bufferSize * 3 + 4);
  Serial.println();

  Serial.println("* 3 Step encoding fast 2");
  delay(1000);
  elaps = EncodeIntoDma3StepFast2(dmaOutputBuffer3StepFast2, pixelData, bufferSize);
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3StepFast2, bufferSize * 3 + 4);
  Serial.println();


}

void PrintOutputBuffer(uint8_t * outputBuf, uint32_t bufferSize) {
  for (uint32_t i = 0; i < bufferSize; i++) {
    Serial.print("0x");
    String s =  String(*(outputBuf++), HEX);
    while (s.length() < 2) {
      s = "0" + s;
    }
    Serial.print(s);
    if (i < bufferSize - 1) {
      Serial.print(", ");
    }
    if (!((i +1) % 24)) {
      Serial.println();
    }
  }
  Serial.println();
  Serial.println();
}

void loop() {
}

And i the results (using 160MHz clockrate)

* 4 Step encoding
   us Elapsed : 156

* 3 Step encoding
   us Elapsed : 969

* 3 Step encoding fast
   us Elapsed : 163

* 3 Step encoding fast 2
   us Elapsed : 177

Conclusion. The first attempt at 3-step encoding is marginally quicker than the 2nd attempt and is more than 5x as fast as the current method. With small pixelbuffers it is less than twice as slow as the 4-step, and with large buffers it is almost as fast as the 4-step. The temporary memory demand for it is a bit more than the 4-step ( 2 x 16 x 32-bit lookup vs 16 x 16-bit lookup tables = 96 bytes more in lookup tables)

The quickest is of course 256 size lookup tables, but that seems excessive, wasting a whole KB on them.

Anyway i thought i'd share it. I'll get the whole cloning and branch thing sorted soon.

Makuna · 2024-10-24T15:58:50Z

Makuna
Oct 24, 2024
Maintainer

Make note of the endianness of variables. I believe I ran into one of the ESP32 (c3?) uses a different core (arm) and the endianness layout was different, part of the reason the move to working only in bytes was to reduce complexity and fix some bugs around this.

So, you need to test all the platforms to make sure nothing appears.

2 replies

devarishi7 Oct 24, 2024
Author

Ok well i don't have boards for all platforms, but it's good you mention this. It is not complicated to provide code for both endianness though. I will do some research on the matter. I was anyway thinking to add it as an option, not to completely replace,.
The dual 16-bit output is the same for all ? And the pins can be inverted for all as well right ? A single 32-bit for the ESP8266 will be a little bit more complex, but not to bad, and that i can test. As you know i am also working on the DMX, and i have come up with a method that is almost twice as fast as the method that is used for the ESP8266 for that as well. (the method i posted still had a few small errors and is actually nearly twice as slow) converting the ESP8266 method which is 32-bit to the dual-16-bit of the ESP32 was quite confusing, but i got my head around it, and as long as i have the correct bit-patterns i can simply compare the output to the current method and correct the location of any bytes that end up in the wrong place. Anyway if there is someone who does have all boards, it really is not much work to make sure it all works correctly. I made a fork, i did a clone but i think it is the wrong way around, i have to look again in the morning. Adding the 3-step encoding to the ESP8266 is probably interesting as well. Sure there are no parallel modes and so it will not be as interesting, but an ESP8266 has less memory though, Still it is a saving of 2.5KB when encoding 4 Universes. That is significant enough. Also wondering if 3 Step encoding for any UART methods would be a good idea. 7N1 ? Easiest encoding of the lot i'd say.

But yes Endianness, i will try and check which MCU uses what.

devarishi7 Oct 25, 2024
Author

Ok so a little investigating and i ended up here via a link on reddit about endianness.
Wrote a small sketch to confirm these macros exist in the core, and they do.

void setup() {
  Serial.begin(115200);
  Serial.println("Endianness Test");
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
  Serial.println("Little Endian");
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
  Serial.println("Big Endian");
#elif __BYTE_ORDER__ == __ORDER_PDP_ENDIAN__
  Serial.println("Pop Endian");
#else
  Serial.println("Endianness not defined");
#endif
}
void loop() {
}

Implementing this should solve the issue. Never heard about Pop-endian before, must be rare and i guess tends to only appear on 16-bit architecture MCUs I have been learning a lot this week.

devarishi7 · 2024-11-08T11:06:22Z

devarishi7
Nov 8, 2024
Author

I have created a pull request as part of the DMX512 encoding for the faster 3 step encoding, and i have added the use of big endian byte order in it as well, although i have not found any boards that use it and on the arduino forum no one knew of one that uses it, but if the macro is defined, it should work just fine. I did not go as far as providing for the (new to me) pop-endian. Anyway it's there, have a look if you have time.

0 replies

devarishi7 · 2026-01-19T11:19:02Z

devarishi7
Jan 19, 2026
Author

Hi Mr. Miller, hope you are doing well. As a continuation of this discussion, i have done a faster encoder for the NeoEspI2sMuxBusSize8Bit3Step. It uses a 48 (3 x 16) x 32-bit lookup table which converts a nibble of pixeldata into 3x 32-bit words and 'OR's those.
I haven't completely incorporated it into the current version as yet, but i thought i should at least share the progress that i made (in a few hours) since the results are really very good. I am bench testing it with 8 channels of 680 pixels (4 universe) each, and with the current encoder i come to a processing time of 5922us, while my encoder does it within 1452us, which is more than 4x as fast. I will do the x16 version as well later in the week, which will actually use a smaller lookup table since there is no benefit in using 64-bit variables on a 32-bit architecture and therefore it makes sense to lookup only 2 bits at a time. The speed gain will most probably be not as great as on the X8
At first i started creating the lookup table manually, until it dawned on me that i could just use your encoder to do this which would eliminate any errors.

Anyway Here is the code for the Lookup table generator

#define TABLE_SIZE 48
//#define CONFIG_IDF_TARGET_ESP32S2

uint32_t pDma32[TABLE_SIZE]; //  16 x 3 x 32 bit 

void setup() {
  Serial.begin(500000);
  Serial.println();
  Serial.println("Lookup table generator.");
  memset(pDma32, 0x00, TABLE_SIZE * 4);
  CreateLookup3StepX8();
  Serial.println("uint32_t lookupNibble[] = {");
  for (uint8_t i = 0; i < 16; i++) {
    Serial.print(Hex32(pDma32[i * 3]));
    Serial.print(", ");
    Serial.print(Hex32(pDma32[i * 3 + 1]));
    Serial.print(", ");
    Serial.print(Hex32(pDma32[i * 3 + 2]));
    if (i < 15) {
      Serial.print(",    ");
      if (i % 2) {
        Serial.println();
      }
    }
  }
  Serial.println("};");
  Serial.println();
}

void loop() {}

String Hex32(uint32_t hex) {
  String Hex = String(hex, HEX);
  String H32 = "0x";
  for (uint8_t i = 0; i < 8 - Hex.length(); i++) {
    H32 += '0';
  }
  H32 += Hex;
  return H32;
}

void CreateLookup3StepX8() {
  uint8_t* pDma = reinterpret_cast<uint8_t*>(pDma32);
  const uint8_t muxBit = 0x1;
#if defined(CONFIG_IDF_TARGET_ESP32S2)
 const uint8_t offsetMap[] = { 0, 1, 2, 3 }; // i2s sample is two 16bit values
#else
  const uint8_t offsetMap[] = { 2, 3, 0, 1 }; // i2s sample is two 16bit values
#endif

  uint8_t offset = 0;
  for (uint8_t nibble = 0; nibble < 16; nibble++) {
    uint8_t value = nibble;
    for (uint8_t bit = 0; bit < 4; bit++) {
      pDma[offsetMap[offset]] |= muxBit;
      offset++;
      if (offset > 3) {
        offset %= 4;
        pDma += 4;
      }
      if (value & 0x8) {  // checking if bit 3 is set (instead of bit 7)
        pDma[offsetMap[offset]] |= muxBit;
      }
      offset += 2;
      if (offset > 3) {
        offset %= 4;
        pDma += 4;
      }
      value <<= 1;
    }
  }
}

and here is the code for the encoding and it's speed test.

#define BUFFERSIZE 680
//#define CONFIG_IDF_TARGET_ESP32S2

// by using a 3 step cadence, the dma data can't be updated with a single OR operation as
//    its value resides across a non-uint16_t aligned 3 element type, so it requires two separate OR
//    operations to update a single pixel bit, the last element can be skipped as its always 0
void EncodeIntoDma3StepX8(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData, uint8_t muxId) {

  uint8_t* pDma = dmaBuffer;
  const uint8_t* pValue = data;
  const uint8_t* pEnd = pValue + sizeData;
  const uint8_t muxBit = 0x1 << muxId;
#if defined(CONFIG_IDF_TARGET_ESP32S2)
  const uint8_t offsetMap[] = { 0, 1, 2, 3 }; // i2s sample is two 16bit values
#else
  const uint8_t offsetMap[] = { 2, 3, 0, 1 }; // i2s sample is two 16bit values
#endif

  uint8_t offset = 0;
  while (pValue < pEnd) {
    uint8_t value = *(pValue++);
    for (uint8_t bit = 0; bit < 8; bit++) {
      // first cadence step init to 1
      pDma[offsetMap[offset]] |= muxBit;
      offset++;
      if (offset > 3) {
        offset %= 4;
        pDma += 4;
      }
      // second cadence step set based on bit
      if (value & 0x80) {  // checking is bit 7 is set
        pDma[offsetMap[offset]] |= muxBit;
      }
      // last cadence step already init to 0, skip it
      offset += 2;
      if (offset > 3) {
        offset %= 4;
        pDma += 4;
      }
      // Next bit.
      value <<= 1;
    }
  }
}


void EncodeIntoDma3StepX8Fast(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData, uint8_t muxId) {

  uint32_t *pDma32 = reinterpret_cast<uint32_t*>(dmaBuffer);
  const uint8_t* pValue = data;
  const uint8_t* pEnd = pValue + sizeData;

#if defined(CONFIG_IDF_TARGET_ESP32S2)
  uint32_t lookupNibble[] = {
    0x01000001, 0x00010000, 0x00000100,    0x01000001, 0x00010000, 0x00010100,
    0x01000001, 0x01010000, 0x00000100,    0x01000001, 0x01010000, 0x00010100,
    0x01000001, 0x00010001, 0x00000100,    0x01000001, 0x00010001, 0x00010100,
    0x01000001, 0x01010001, 0x00000100,    0x01000001, 0x01010001, 0x00010100,
    0x01000101, 0x00010000, 0x00000100,    0x01000101, 0x00010000, 0x00010100,
    0x01000101, 0x01010000, 0x00000100,    0x01000101, 0x01010000, 0x00010100,
    0x01000101, 0x00010001, 0x00000100,    0x01000101, 0x00010001, 0x00010100,
    0x01000101, 0x01010001, 0x00000100,    0x01000101, 0x01010001, 0x00010100
  };
#else
  uint32_t lookupNibble[] = {
    0x00010100, 0x00000001, 0x01000000,    0x00010100, 0x00000001, 0x01000001,
    0x00010100, 0x00000101, 0x01000000,    0x00010100, 0x00000101, 0x01000001,
    0x00010100, 0x00010001, 0x01000000,    0x00010100, 0x00010001, 0x01000001,
    0x00010100, 0x00010101, 0x01000000,    0x00010100, 0x00010101, 0x01000001,
    0x01010100, 0x00000001, 0x01000000,    0x01010100, 0x00000001, 0x01000001,
    0x01010100, 0x00000101, 0x01000000,    0x01010100, 0x00000101, 0x01000001,
    0x01010100, 0x00010001, 0x01000000,    0x01010100, 0x00010001, 0x01000001,
    0x01010100, 0x00010101, 0x01000000,    0x01010100, 0x00010101, 0x01000001
  };
#endif

  for (uint8_t i = 0; i < 48; i++) {  // shift the table to the proper bit
    lookupNibble[i] <<= muxId;
  }

  while (pValue < pEnd) {
    uint8_t value = *(pValue++);
    uint8_t nibble = (value >> 4) * 3;
    *(pDma32++) |= lookupNibble[nibble++];
    *(pDma32++) |= lookupNibble[nibble++];
    *(pDma32++) |= lookupNibble[nibble];

    nibble = (value & 0xF) * 3;
    *(pDma32++) |= lookupNibble[nibble++];
    *(pDma32++) |= lookupNibble[nibble++];
    *(pDma32++) |= lookupNibble[nibble];
  }
}


void setup() {
  
  uint16_t bufferSize = BUFFERSIZE;
  uint8_t pixelData27[27] = {0xfe, 0x68, 0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68,
                             0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68, 0x12, 0xdc,
                             0x86, 0x34, 0xdc, 0x86, 0x34
                            };

  uint8_t* pixelData;
  pixelData = (uint8_t*) malloc(bufferSize + 8);
  uint8_t* pix27 = pixelData;
  for (uint8_t i = 0; i < 27; i++) {
    *(pix27++) = pixelData27[i];
  }

  for (uint16_t i = 27; i < bufferSize + 8; i++) {
    pixelData[i] = pixelData[i - 27];
  }
  uint32_t elaps, startTime;

  uint8_t* dmaOutputBuffer3StepX8;
  dmaOutputBuffer3StepX8 = (uint8_t*) malloc(bufferSize * 3 * 8 + 4);
  memset(dmaOutputBuffer3StepX8, 0x00, bufferSize * 3 * 8 + 4);

  uint8_t* dmaOutputBuffer3StepX8Fast;
  dmaOutputBuffer3StepX8Fast = (uint8_t*) malloc(bufferSize * 3 * 8 + 4);
  memset(dmaOutputBuffer3StepX8Fast, 0x00, bufferSize * 3 * 8 + 4);

  Serial.begin(500000);
  Serial.println();
  Serial.println();
  delay(500);
  Serial.println("Encoding Speed Tester");
  Serial.println();

  Serial.println();
  Serial.println("* 3 Step X8 encoding");
  delay(1000);
  startTime = micros();
  for (uint8_t i = 0; i < 8; i++) {
    EncodeIntoDma3StepX8(dmaOutputBuffer3StepX8, pixelData + i, bufferSize, i);
  }
  elaps = micros() - startTime;
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3StepX8, 100); //bufferSize * 3 * 8 + 4);
  Serial.println();

  Serial.println("* 3 Step X8 encoding fast");
  delay(1000);
  startTime = micros();
  for (uint8_t i = 0; i < 8; i++) {
    EncodeIntoDma3StepX8Fast(dmaOutputBuffer3StepX8Fast, pixelData + i, bufferSize, i);
  }
  elaps = micros() - startTime;
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3StepX8Fast, 100); //bufferSize * 3 * 8 + 4);
  Serial.println();

  int cmp = memcmp(dmaOutputBuffer3StepX8, dmaOutputBuffer3StepX8Fast, bufferSize * 3 * 8 + 4); 
  if (!cmp) {
    Serial.println("Data Blocks compare !!");
  }
}

void PrintOutputBuffer(uint8_t * outputBuf, uint32_t bufferSize) {
  for (uint32_t i = 0; i < bufferSize; i++) {
    Serial.print("0x");
    String s =  String(*(outputBuf++), HEX);
    while (s.length() < 2) {
      s = "0" + s;
    }
    Serial.print(s);
    if (i < bufferSize - 1) {
      Serial.print(", ");
    }
    if (!((i + 1) % 24)) {
      Serial.println();
    }
  }
  Serial.println();
  Serial.println();
}

void loop() {}

Since the x8 & x16 methods tend to use quite a lot of memory anyway, the 3-step is a particular big saver, but as a result also the encoding time can be quite significant, and therefore the time saving is as well. Anyway, have look and i will do first the X16 as well now that i have figured out how the encoding works and then later do a fork of a branch or however that is supposed to work.

4 replies

Makuna Jan 19, 2026
Maintainer

Great job.
If you look at the root of the project, there is an "extras" folders that you can checkin these helpful sketches. This folder should not get released with the Arduino official releases; but allows other to fix "possible bugs" in the future. It's also good to put a reference where the generated output is to placed into the library code, and at that code location, a comment about the generating sketch.

devarishi7 Jan 21, 2026
Author

Ah thank you. As usual in these cases i get inspiration from comments in the code

    // by using a 3 step cadence, the dma data can't be updated with a single OR operation as
    //    its value resides across a non-uint16_t aligned 3 element type

Well that is nonsense of course, it can be done, just differently. I did think the only possible speed up could be the bit-shifting of the lookup table. Typically all outputs in my sketches get their Show() call sequentially and it could save time by only having to only shift the lookup table 1 bit every time, but the resulting check on that and determining what would be quicker would also cost some time i guess, and the gain would be marginal.

I already added the X8 to the NeoEsp32I2sXMethod.h in my working copy of the library and using a macro to define that as an option in the same way as a choice between the 3 step and the 4 step encoding. I will add the X16 once that is done (though that takes so much memory when using 4 universe per port that i am not sure i could even use it like that, but i am considering to see how it could work for matrix board displays) Then go over the single I2S 3 step method again and just drop the DMX encoding for now.

I just can't seem to get rid of the occasional glitch and also i see not so much purpose for it anymore. DMX output can be done easily enough on the UART, though it would be cool to have a device which can 8 or 16 outputs DMX. I will look into it later. I will also go over the ESP8266 I2S 3 step.
The single channel I2S on the ESP32 is also really slow and my method makes it (almost) as fast as the 4 step encoding, but at least the current method still takes less than 1ms for 4 universe which is still really acceptable.

I was wondering if there is any other boards where you have used 3 step encoding ?

Makuna Feb 3, 2026
Maintainer

RP2040 I believe I added 3 step and 4 step, but that was FAR cleaner and easier to do.

devarishi7 Feb 4, 2026
Author

Yeah i had a look, well there is nothing i could improve upon that i think.

devarishi7 · 2026-01-27T10:03:42Z

devarishi7
Jan 27, 2026
Author

So i got around to doing the X16 reducing 16x 4 Universe from something like 12ms to less than half, and then i spent like a whole day and part of a night thinking on how to solve the ESP8266 puzzle. I think it is actually the same puzzle as with the ESP32S2 (but i may be wrong, i am getting one in the mail one of these days)
I got the order right fairly easily on Sunday afternoon, but it was clearly ugly and contained with loads of different lookup tables and not at all to my liking, but fairly quick already. At 80Mhz the 4-step tested to 172us for 4 universe, and the original 3-step came to 1421us and that version took 223us (or thereabouts, the speed test does not always return exactly the same for some reason..)
So thinking on how to tackle the lookup table issue, in the end i came to not having 12-bits per nibble, but use 3-bit, 2-bit, 3-bit parts each resulting in a byte, and distribute those to their correct location and the first version of that results in 328us but has only a 20 bytes lookup table and looks a lot neater. Masking out and shifting the 3-2-3 bits to their correct form takes fair bit of time though, and then it dawned on me that i could do that to 4 bytes at the same time, rather than byte for byte. It does require one extra mask to prevent sliding bits into the adjacent byte, but that clearly was worth it, since it brought it down to 233us.

The only thing i can come up with to speed it up even more would be to use more lookup tables. I considered 3-5 bits or 5-3 bits but in my head i couldn't work out what would be more advantageous. 5-bits would be a 32 field table of 16-bits which could be ok(ish) and i guess both would then be better, and then i even considered shifting the final 3 bits to the left to combine with the first 3 and make another table for that, but that would be a 64 field table and then i reverted (in my head) to the final result.

#define BUFFERSIZE 680
//#define INVERTED_METHOD
#ifndef ARDUINO_ARCH_ESP8266
  #error **Sketch meant for ESP8266**
#endif

void setup() {

  uint16_t bufferSize = BUFFERSIZE;
  uint8_t pixelData27[27] = {0xfe, 0x68, 0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68,
                             0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68, 0x12, 0xdc,
                             0x86, 0x34, 0xdc, 0x86, 0x34
                            };

  uint8_t* pixelData;
  pixelData = (uint8_t*) malloc(bufferSize);
  uint8_t* pix27 = pixelData;
  for (uint8_t i = 0; i < 27; i++) {
    *(pix27++) = pixelData27[i];
  }

  for (uint16_t i = 27; i < bufferSize; i++) {
    pixelData[i] = pixelData[i - 27];
  }
  uint32_t elaps, startTime;

  Serial.begin(500000);
  Serial.println();
  Serial.println();
  delay(500);
  Serial.println("Encoding Speed Tester");
  Serial.println();
  Serial.print("F_CPU at : ");
  Serial.println(F_CPU, DEC);
  Serial.println();
#if defined (INVERTED_METHOD)
  Serial.println("Inverted method");
#else
  Serial.println("Normal method");
#endif
  Serial.println();


  uint8_t* dmaOutputBuffer4Step;
  dmaOutputBuffer4Step = (uint8_t*) malloc(bufferSize * 4 + 4);
  memset(dmaOutputBuffer4Step, 0x00, bufferSize * 4 + 4);
  uint8_t* dmaOutputBuffer3Step;
  dmaOutputBuffer3Step = (uint8_t*) malloc(bufferSize * 3 + 4);
  memset(dmaOutputBuffer3Step, 0x00, bufferSize * 3 + 4);
  uint8_t* dmaOutputBuffer3StepFast;
  dmaOutputBuffer3StepFast = (uint8_t*) malloc(bufferSize * 3 + 4);
  memset(dmaOutputBuffer3StepFast, 0x00, bufferSize * 3 + 4);

  Serial.println("* 4 Step encoding");
  delay(1000);
  startTime = micros();
  NeoEsp8266Dma4StepEncode(dmaOutputBuffer4Step, pixelData, bufferSize);
  elaps = micros() - startTime;
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer4Step, 30); //bufferSize * 4 + 4);

  Serial.println();
  Serial.println("* 3 Step encoding");
  delay(1000);
  startTime = micros();
  NeoEsp8266Dma3StepEncode(dmaOutputBuffer3Step, pixelData, bufferSize);
  elaps = micros() - startTime;
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3Step, 30); //bufferSize * 3 + 4);
  Serial.println();

  Serial.println("* 3 Step encoding fast");
  delay(1000);
  startTime = micros();
  NeoEsp8266Dma3StepEncodeFast(dmaOutputBuffer3StepFast, pixelData, bufferSize);
  elaps = micros() - startTime;
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3StepFast, 30); //bufferSize * 3 + 4);
  Serial.println();

  int cmp = memcmp(dmaOutputBuffer3Step, dmaOutputBuffer3StepFast, bufferSize * 3 + 4);
  if (!cmp) {
    Serial.println("Data Blocks compare !!");
  }
}

void PrintOutputBuffer(uint8_t * outputBuf, uint32_t bufferSize) {
  for (uint32_t i = 0; i < bufferSize; i++) {
    Serial.print("0x");
    String s =  String(*(outputBuf++), HEX);
    while (s.length() < 2) {
      s = "0" + s;
    }
    Serial.print(s);
    if (i < bufferSize - 1) {
      Serial.print(", ");
    }
    if (!((i + 1) % 24)) {
      Serial.println();
    }
  }
  Serial.println();
  Serial.println();
  //ShowBitPatterns();
}

void loop() {}


void NeoEsp8266Dma4StepEncode(uint8_t* i2sBuffer, const uint8_t* data, size_t sizeData) {
#if defined (INVERTED_METHOD)
  const uint16_t bitpatterns[16] = {
    0b0111011101110111, 0b0111011101110001, 0b0111011100010111, 0b0111011100010001,
    0b0111000101110111, 0b0111000101110001, 0b0111000100010111, 0b0111000100010001,
    0b0001011101110111, 0b0001011101110001, 0b0001011100010111, 0b0001011100010001,
    0b0001000101110111, 0b0001000101110001, 0b0001000100010111, 0b0001000100010001,
  };
#else
  const uint16_t bitpatterns[16] = {
    0b1000100010001000, 0b1000100010001110, 0b1000100011101000, 0b1000100011101110,
    0b1000111010001000, 0b1000111010001110, 0b1000111011101000, 0b1000111011101110,
    0b1110100010001000, 0b1110100010001110, 0b1110100011101000, 0b1110100011101110,
    0b1110111010001000, 0b1110111010001110, 0b1110111011101000, 0b1110111011101110,
  };
#endif
  uint16_t* pDma = (uint16_t*)i2sBuffer;
  const uint8_t* pEnd = data + sizeData;
  for (const uint8_t* pData = data; pData < pEnd; pData++) {
    *(pDma++) = bitpatterns[(*pData) & 0x0f];
    *(pDma++) = bitpatterns[((*pData) >> 4) & 0x0f];
  }
}

void NeoEsp8266Dma3StepEncode(uint8_t* i2sBuffer, const uint8_t* data, size_t sizeData) {
  const uint8_t SrcBitMask = 0x80;
  const size_t BitsInSample = sizeof(uint32_t) * 8;
#if defined (INVERTED_METHOD)
  const uint16_t OneBit3Step =  0b00000001;
  const uint16_t ZeroBit3Step = 0b00000011;
#else
  const uint16_t OneBit3Step = 0b00000110;
  const uint16_t ZeroBit3Step = 0b00000100;
#endif

  uint32_t* pDma = reinterpret_cast<uint32_t*>(i2sBuffer);
  uint32_t dmaValue = 0;
  uint8_t destBitsLeft = BitsInSample;

  const uint8_t* pSrc = data;
  const uint8_t* pEnd = pSrc + sizeData;

  while (pSrc < pEnd) {
    uint8_t value = *(pSrc++);

    for (uint8_t bitSrc = 0; bitSrc < 8; bitSrc++) {
      const uint16_t Bit = ((value & SrcBitMask) ? OneBit3Step : ZeroBit3Step);
      if (destBitsLeft > 3) {
        destBitsLeft -= 3;
        dmaValue |= Bit << destBitsLeft;
      }
      else if (destBitsLeft <= 3) {
        uint8_t bitSplit = (3 - destBitsLeft);
        dmaValue |= Bit >> bitSplit;
        *(pDma++) = dmaValue;
        dmaValue = 0;
        destBitsLeft = BitsInSample - bitSplit;
        if (bitSplit) {
          dmaValue |= Bit << destBitsLeft;
        }
      }
      value <<= 1;
    }
  }
  *pDma++ = dmaValue;
}

void NeoEsp8266Dma3StepEncodeFast(uint8_t* i2sBuffer, const uint8_t* data, size_t sizeData) {

#if defined (INVERTED_METHOD)
  const uint8_t bpHigh[8] = {
    0x6D, 0x6C, 0x65, 0x64, 0x2D, 0x2C, 0x25, 0x24
  };
  const uint8_t bpMid[4] = {
    0xB6, 0xB2, 0x96, 0x92
  };
  const uint8_t bpLow[8] = {
    0xDB, 0xD9, 0xCB, 0xC9, 0x5B, 0x59, 0x4B, 0x49
  };
#else
  const uint8_t bpHigh[8] = {
    0x92, 0x93, 0x9A, 0x9B, 0xD2, 0xD3, 0xDA, 0xDB
  };
  const uint8_t bpMid[4] = {
    0x49, 0x4D, 0x69, 0x6D
  };
  const uint8_t bpLow[8] = {
    0x24, 0x26, 0x34, 0x36, 0xA4, 0xA6, 0xB4, 0xB6
  };
#endif

  uint8_t* pDma = i2sBuffer;
  const uint8_t* pEnd = data + sizeData - 3;  // Encode 4 bytes at a time, make sure they are there
  const uint8_t* pSrc = data;
  uint8_t Src[4]; // used to store 4 data Bytes at a time and for up to 3 separate bytes to complete the frame
  uint32_t* pSrc32 = reinterpret_cast<uint32_t*>(Src);
  uint8_t Source[12]; // to store those 4 Bytes into 3 parts, 3 Low bits, 2 middle bits, 3 Low Bits
  uint32_t* pSource32 = reinterpret_cast<uint32_t*>(Source);

  while (pSrc < pEnd) {
    *(pSrc32) = (*(reinterpret_cast<const uint32_t*>(pSrc)));
    pSrc += 4;

    *(pSource32) = *(pSrc32) & 0x07070707;
    *(pSrc32) = (*(pSrc32) & 0xF8F8F8F8) >> 3;
    *(pSource32 + 1) = *(pSrc32) & 0x03030303;
    *(pSource32 + 2) = (*(pSrc32) & 0xFCFCFCFC) >> 2;

    *(pDma++) = bpHigh[Source[9]];
    *(pDma++) = bpLow[Source[0]];
    *(pDma++) = bpMid[Source[4]];
    *(pDma++) = bpHigh[Source[8]];

    *(pDma++) = bpMid[Source[6]];
    *(pDma++) = bpHigh[Source[10]];
    *(pDma++) = bpLow[Source[1]];
    *(pDma++) = bpMid[Source[5]];

    *(pDma++) = bpLow[Source[3]];
    *(pDma++) = bpMid[Source[7]];
    *(pDma++) = bpHigh[Source[11]];
    *(pDma++) = bpLow[Source[2]];
  }
  //  complete last few source bytes
  
  pEnd += 3;  // up the endpoint again
  *(pSrc32) = 0; // clear the 32-bit
  uint8_t i = 0;
  while (pSrc < pEnd) {
    Src[i++] = *(pSrc++);  // put them in a byte at a time, not reading any bytes we don't have
  }
  *(pSource32) = *(pSrc32) & 0x07070707;
  *(pSrc32) = (*(pSrc32) & 0xF8F8F8F8) >> 3;
  *(pSource32 + 1) = *(pSrc32) & 0x03030303;
  *(pSource32 + 2) = (*(pSrc32) & 0xFCFCFCFC) >> 2;
  if (i) {
    *(pDma + 1) = bpLow[Source[0]];
    *(pDma + 2) = bpMid[Source[4]];
    *(pDma + 3) = bpHigh[Source[8]];
    i--;    
  }
  if (i) {
    *(pDma) = bpHigh[Source[9]];
    *(pDma + 6) = bpLow[Source[1]];
    *(pDma + 7) = bpMid[Source[5]];
    i--;
  }
  if (i) {
    *(pDma + 4) = bpMid[Source[6]];
    *(pDma + 5) = bpHigh[Source[10]];
    *(pDma + 11) = bpLow[Source[2]];
  }
}

void NeoEsp8266Dma3StepEncodeFast1(uint8_t* i2sBuffer, const uint8_t* data, size_t sizeData) {

  uint32_t bitpatternsHigh0[16] = {
    0x92400000, 0x92600000, 0x93400000, 0x93600000, 0x9A400000, 0x9A600000, 0x9B400000, 0x9B600000,
    0xD2400000, 0xD2600000, 0xD3400000, 0xD3600000, 0xDA400000, 0xDA600000, 0xDB400000, 0xDB600000
  };

  uint32_t bitpatternsLow0[16] = {
    0x00092400, 0x00092600, 0x00093400, 0x00093600, 0x0009A400, 0x0009A600, 0x0009B400, 0x0009B600,
    0x000D2400, 0x000D2600, 0x000D3400, 0x000D3600, 0x000DA400, 0x000DA600, 0x000DB400, 0x000DB600
  };

  const uint8_t bitpatternsHigh1a[16] = {
    0x92, 0x92, 0x93, 0x93, 0x9A, 0x9A, 0x9B, 0x9B,
    0xD2, 0xD2, 0xD3, 0xD3, 0xDA, 0xDA, 0xDB, 0xDB
  };

  const uint32_t bitpatternsHigh1b[16] = {
    0x40000000, 0x60000000, 0x40000000, 0x60000000, 0x40000000, 0x60000000, 0x40000000, 0x60000000,
    0x40000000, 0x60000000, 0x40000000, 0x60000000, 0x40000000, 0x60000000, 0x40000000, 0x60000000
  };

  const uint32_t bitpatternsLow1[16] = {
    0x09240000, 0x09260000, 0x09340000, 0x09360000, 0x09A40000, 0x09A60000, 0x09B40000, 0x09B60000,
    0x0D240000, 0x0D260000, 0x0D340000, 0x0D360000, 0x0DA40000, 0x0DA60000, 0x0DB40000, 0x0DB60000
  };

  const uint16_t bitpatternsHigh2[16] = {
    0x9240, 0x9260, 0x9340, 0x9360, 0x9A40, 0x9A60, 0x9B40, 0x9B60,
    0xD240, 0xD260, 0xD340, 0xD360, 0xDA40, 0xDA60, 0xDB40, 0xDB60
  };

  const uint8_t bitpatternsLow2a[16] = {
    0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9, 0x9,
    0xD, 0xD, 0xD, 0xD, 0xD, 0xD, 0xD, 0xD
  };

  const uint32_t bitpatternsLow2b[16] = {
    0x24000000, 0x26000000, 0x34000000, 0x36000000, 0xA4000000, 0xA6000000, 0xB4000000, 0xB6000000,
    0x24000000, 0x26000000, 0x34000000, 0x36000000, 0xA4000000, 0xA6000000, 0xB4000000, 0xB6000000
  };

  const uint32_t bitpatternsHigh3[16] = {
    0x924000, 0x926000, 0x934000, 0x936000, 0x9A4000, 0x9A6000, 0x9B4000, 0x9B6000,
    0xD24000, 0xD26000, 0xD34000, 0xD36000, 0xDA4000, 0xDA6000, 0xDB4000, 0xDB6000
  };

  const uint16_t bitpatternsLow3[16] = {
    0x924, 0x926, 0x934, 0x936, 0x9A4, 0x9A6, 0x9B4, 0x9B6,
    0xD24, 0xD26, 0xD34, 0xD36, 0xDA4, 0xDA6, 0xDB4, 0xDB6
  };

  uint32_t* pDma32 = reinterpret_cast<uint32_t *>(i2sBuffer);
  const uint8_t* pEnd = data + sizeData - 3;  // Encode 4 bytes at a time, make sure they are there
  const uint8_t* pSrc = data;
  while (pSrc < pEnd) {
    *(pDma32) = bitpatternsHigh0[((*pSrc) >> 4)] | bitpatternsLow0[((*pSrc) & 0x0f)];
    pSrc++;
    *(pDma32++) |= bitpatternsHigh1a[((*pSrc) >> 4)];
    *(pDma32) = bitpatternsHigh1b[((*pSrc) >> 4)] | bitpatternsLow1[((*pSrc) & 0x0f)];
    pSrc++;
    *(pDma32++) |= bitpatternsHigh2[((*pSrc) >> 4)] | bitpatternsLow2a[((*pSrc) & 0x0f)];
    *(pDma32) = bitpatternsLow2b[((*pSrc) & 0x0f)];
    pSrc++;
    *(pDma32++) |= bitpatternsHigh3[((*pSrc) >> 4)] | bitpatternsLow3[((*pSrc) & 0x0f)];
    pSrc++;
  }
  pEnd += 3;
  if (pSrc < pEnd) {
    *(pDma32) = bitpatternsHigh0[((*pSrc) >> 4)] | bitpatternsLow0[((*pSrc) & 0x0f)];
    pSrc++;
  }
  if (pSrc < pEnd) {
    *(pDma32++) |= bitpatternsHigh1a[((*pSrc) >> 4)];
    *(pDma32) = bitpatternsHigh1b[((*pSrc) >> 4)] | bitpatternsLow1[((*pSrc) & 0x0f)];
    pSrc++;
  }
  if (pSrc < pEnd) {
    *(pDma32++) |= bitpatternsHigh2[((*pSrc) >> 4)] | bitpatternsLow2a[((*pSrc) & 0x0f)];
    *(pDma32) = bitpatternsLow2b[((*pSrc) & 0x0f)];
  }
}

void NeoEsp8266Dma3StepEncodeFast2(uint8_t* i2sBuffer, const uint8_t* data, size_t sizeData) {

  const uint8_t bpHigh[8] = {
    0x92, 0x93, 0x9A, 0x9B, 0xD2, 0xD3, 0xDA, 0xDB
  };
  const uint8_t bpMid[4] = {
    0x49, 0x4D, 0x69, 0x6D
  };
  const uint8_t bpLow[8] = {
    0x24, 0x26, 0x34, 0x36, 0xA4, 0xA6, 0xB4, 0xB6
  };

  uint8_t* pDma = i2sBuffer;
  const uint8_t* pEnd = data + sizeData - 3;  // Encode 4 bytes at a time, make sure they are there
  const uint8_t* pSrc = data;
  uint8_t Src[4];
  uint32_t* pSrc32 = reinterpret_cast<uint32_t*>(Src);
  uint8_t Source[4][3];
  uint32_t* pSource32 = reinterpret_cast<uint32_t*>(Source);

  while (pSrc < pEnd) {
    *(pSrc32) = (*(reinterpret_cast<const uint32_t*>(pSrc)));
    pSrc += 4;
    for (uint8_t i = 0; i < 4; i++) {
      Source[i][0] = Src[i] & 0x7;
      Src[i] = Src[i] >> 3;
      Source[i][1] = Src[i] & 0x3;
      Source[i][2] = Src[i] >> 2;
    }

    *(pDma++) = bpHigh[Source[1][2]];
    *(pDma++) = bpLow[Source[0][0]];
    *(pDma++) = bpMid[Source[0][1]];
    *(pDma++) = bpHigh[Source[0][2]];

    *(pDma++) = bpMid[Source[2][1]];
    *(pDma++) = bpHigh[Source[2][2]];
    *(pDma++) = bpLow[Source[1][0]];
    *(pDma++) = bpMid[Source[1][1]];

    *(pDma++) = bpLow[Source[3][0]];
    *(pDma++) = bpMid[Source[3][1]];
    *(pDma++) = bpHigh[Source[3][2]];
    *(pDma++) = bpLow[Source[2][0]];
  }
}

Now that trick with performing the same masking and shifting on 4 bytes at a time is so good to me that i think i should also incorporate it into the x16 , where mainly the multiplication (* 3) can be done quite easily already, but it would be worth it to do the masking and shifting as well. It just wasn't as obvious to me since only 1 source byte at a time is being processed and for the ESP8266 4 bytes go into 3x 32-bit words, but with the amount of times that bit of code is executed, it will be worth it. This is what i have for the ESP32x16 method at the moment.

#define BUFFERSIZE 680
#define CONFIG_IDF_TARGET_ESP32S2

// by using a 3 step cadence, the dma data can't be updated with a single OR operation as
//    its value resides across a non-uint16_t aligned 3 element type, so it requires two separate OR
//    operations to update a single pixel bit, the last element can be skipped as its always 0
void EncodeIntoDma3StepX16(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData, uint8_t muxId) {

  uint16_t* pDma = reinterpret_cast<uint16_t*>(dmaBuffer);
  const uint8_t* pValue = data;
  const uint8_t* pEnd = pValue + sizeData;
  const uint16_t muxBit = 0x1 << muxId;
#if defined(CONFIG_IDF_TARGET_ESP32S2)
  const uint8_t offsetMap[] = { 0, 1, 2, 3 }; // i2s sample is two 16bit values
#else
  const uint8_t offsetMap[] = { 1, 0, 3, 2 }; // i2s sample is two 16bit values
#endif
  uint8_t offset = 0;

  while (pValue < pEnd) {
    uint8_t value = *(pValue++);
    for (uint8_t bit = 0; bit < 8; bit++) {
      // first cadence step init to 1
      pDma[offsetMap[offset]] |= muxBit;
      offset++;
      if (offset > 3) {
        offset %= 4;
        pDma += 4;
      }
      // second cadence step set based on bit
      if (value & 0x80) {
        pDma[offsetMap[offset]] |= muxBit;
      }
      offset++;
      // last cadence step already 0, skip it
      offset++;
      if (offset > 3) {
        offset %= 4;
        pDma += 4;
      }
      // Next
      value <<= 1;
    }
  }
}



void EncodeIntoDma3StepX16Fast(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData, uint8_t muxId) {

  uint32_t *pDma32 = reinterpret_cast<uint32_t*>(dmaBuffer);
  const uint8_t* pValue = data;
  const uint8_t* pEnd = pValue + sizeData;

#if defined(CONFIG_IDF_TARGET_ESP32S2)
  uint32_t lookupBitty[] = {
    0x00000001, 0x00010000, 0x00000000,    0x00000001, 0x00010000, 0x00000001,
    0x00010001, 0x00010000, 0x00000000,    0x00010001, 0x00010000, 0x00000001
  };

#else
  uint32_t lookupBitty[] = {
    0x00010000, 0x00000001, 0x00000000,    0x00010000, 0x00000001, 0x00010000,
    0x00010001, 0x00000001, 0x00000000,    0x00010001, 0x00000001, 0x00010000
  };
#endif

  for (uint8_t i = 0; i < 12; i++) {  // shift the table to the proper bit
    lookupBitty[i] <<= muxId;
  }

  while (pValue < pEnd) {
    uint8_t value = *(pValue++);
    uint8_t bitty4 = (value & 0x3) * 3;
    value >>= 2;
    uint8_t bitty3 = (value & 0x3) * 3;
    value >>= 2;
    uint8_t bitty2 = (value & 0x3) * 3;
    value >>= 2;
    uint8_t bitty1 = value * 3;
    
    *(pDma32++) |= lookupBitty[bitty1++];
    *(pDma32++) |= lookupBitty[bitty1++];
    *(pDma32++) |= lookupBitty[bitty1];
    
    *(pDma32++) |= lookupBitty[bitty2++];
    *(pDma32++) |= lookupBitty[bitty2++];
    *(pDma32++) |= lookupBitty[bitty2];

    *(pDma32++) |= lookupBitty[bitty3++];
    *(pDma32++) |= lookupBitty[bitty3++];
    *(pDma32++) |= lookupBitty[bitty3];

    *(pDma32++) |= lookupBitty[bitty4++];
    *(pDma32++) |= lookupBitty[bitty4++];
    *(pDma32++) |= lookupBitty[bitty4];
  }
}


void setup() {

  uint16_t bufferSize = BUFFERSIZE;
  uint8_t pixelData27[27] = {0xfe, 0x68, 0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68,
                             0x12, 0xdc, 0x86, 0x34, 0xdc, 0x86, 0x34, 0xfe, 0x68, 0x12, 0xdc,
                             0x86, 0x34, 0xdc, 0x86, 0x34
                            };

  uint8_t* pixelData;
  pixelData = (uint8_t*) malloc(bufferSize + 16);
  uint8_t* pix27 = pixelData;
  for (uint8_t i = 0; i < 27; i++) {
    *(pix27++) = pixelData27[i];
  }

  for (uint16_t i = 27; i < bufferSize + 8; i++) {
    pixelData[i] = pixelData[i - 27];
  }
  uint32_t elaps, startTime;

  uint8_t* dmaOutputBuffer3StepX16;
  dmaOutputBuffer3StepX16 = (uint8_t*) malloc(bufferSize * 3 * 16 + 4);
  memset(dmaOutputBuffer3StepX16, 0x00, bufferSize * 3 * 16 + 4);

  uint8_t* dmaOutputBuffer3StepX16Fast;
  dmaOutputBuffer3StepX16Fast = (uint8_t*) malloc(bufferSize * 3 * 16 + 4);
  memset(dmaOutputBuffer3StepX16Fast, 0x00, bufferSize * 3 * 16 + 4);

  Serial.begin(500000);
  Serial.println();
  Serial.println();
  delay(500);
  Serial.println("Encoding Speed Tester");
  Serial.println();

  Serial.println();
  Serial.println("* 3 Step X16 encoding");
  delay(1000);
  startTime = micros();
  for (uint8_t i = 0; i < 16; i++) {
    EncodeIntoDma3StepX16(dmaOutputBuffer3StepX16, pixelData + i, bufferSize, i);
  }
  elaps = micros() - startTime;
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3StepX16, 100); //bufferSize * 3 * 16 + 4);
  Serial.println();

  Serial.println("* 3 Step X16 encoding fast");
  delay(1000);
  startTime = micros();
  for (uint8_t i = 0; i < 16; i++) {
    EncodeIntoDma3StepX16Fast(dmaOutputBuffer3StepX16Fast, pixelData + i, bufferSize, i);
  }
  elaps = micros() - startTime;
  Serial.print("   us Elapsed : ");
  Serial.println(elaps, DEC);
  PrintOutputBuffer(dmaOutputBuffer3StepX16Fast, 100); //bufferSize * 3 * 16 + 4);
  Serial.println();

  int cmp = memcmp(dmaOutputBuffer3StepX16, dmaOutputBuffer3StepX16Fast, bufferSize * 3 * 16 + 4);
  if (!cmp) {
    Serial.println("Data Blocks compare !!");
  }
}

void PrintOutputBuffer(uint8_t * outputBuf, uint32_t bufferSize) {
  for (uint32_t i = 0; i < bufferSize; i++) {
    Serial.print("0x");
    String s =  String(*(outputBuf++), HEX);
    while (s.length() < 2) {
      s = "0" + s;
    }
    Serial.print(s);
    if (i < bufferSize - 1) {
      Serial.print(", ");
    }
    if (!((i + 1) % 24)) {
      Serial.println();
    }
  }
  Serial.println();
  Serial.println();
}

void loop() {}

resulting in 11407us vs 4705us . But that is for encoding 10880 pixels though. I have a 32x80 pixel matrix in my workspace for testing, but with that many i can control something 4x as big !

1 reply

devarishi7 Jan 28, 2026
Author

Ok so splitting splitting the Source data into 2-bits, 4 bytes at a time wasn't quicker/. Just to much overhead from selecting the proper byte for processing. But a small change to the lookup tables made a bit of difference. By adding 3 dummy fields, the words can be 4-word aligned, which eliminates the need to multiply by 3, which saves 268us My final version for now.

void EncodeIntoDma3StepX16Fast(uint8_t* dmaBuffer, const uint8_t* data, size_t sizeData, uint8_t muxId) {

  uint32_t *pDma32 = reinterpret_cast<uint32_t*>(dmaBuffer);
  const uint8_t* pValue = data;
  const uint8_t* pEnd = pValue + sizeData;
  uint8_t bitty[4];

#if defined(CONFIG_IDF_TARGET_ESP32S2)
  uint32_t lookupBitty[] = {
    0x00000001, 0x00010000, 0x00000000, 0x0,   0x00000001, 0x00010000, 0x00000001, 0x0,
    0x00010001, 0x00010000, 0x00000000, 0x0,   0x00010001, 0x00010000, 0x00000001
  };

#else
  uint32_t lookupBitty[] = {
    0x00010000, 0x00000001, 0x00000000, 0x0,   0x00010000, 0x00000001, 0x00010000, 0x0,
    0x00010001, 0x00000001, 0x00000000, 0x0,   0x00010001, 0x00000001, 0x00010000
  };
#endif

  for (uint8_t i = 0; i < 15; i++) {  // shift the table to the proper bit
    lookupBitty[i] <<= muxId;
  }

  while (pValue < pEnd) {
    uint8_t value = *(pValue++);
    bitty[3] = (value & 0x3) << 2;
    bitty[2] = value & 0xC;
    value >>= 2;
    bitty[1] = value & 0xC;
    value >>= 2;
    bitty[0] = value & 0xC;

    *(pDma32++) |= lookupBitty[bitty[0]++];
    *(pDma32++) |= lookupBitty[bitty[0]++];
    *(pDma32++) |= lookupBitty[bitty[0]];

    *(pDma32++) |= lookupBitty[bitty[1]++];
    *(pDma32++) |= lookupBitty[bitty[1]++];
    *(pDma32++) |= lookupBitty[bitty[1]];

    *(pDma32++) |= lookupBitty[bitty[2]++];
    *(pDma32++) |= lookupBitty[bitty[2]++];
    *(pDma32++) |= lookupBitty[bitty[2]];

    *(pDma32++) |= lookupBitty[bitty[3]++];
    *(pDma32++) |= lookupBitty[bitty[3]++];
    *(pDma32++) |= lookupBitty[bitty[3]];
  }
}

Once my ESP32S2 comes in i can verify the single channel version for both endianness and incorporate everything into the respective files as well as adding the reference for the development into the extras folder.

devarishi7 · 2026-02-03T14:49:54Z

devarishi7
Feb 3, 2026
Author

So i did some of that git stuff, i implemented whatever 3-step speed optimizations i have into this fork
f25c0de
Created the #define like statement to select the faster methods as is done with the 4 step encoding.
Now i am never really sure if i did this all the correct way with gitbash, i just lack the experience for now, should i create a pull-request ?
As far as i can tell the encoding methods are correct and significantly faster and only in the case of the x8 method require a bigger lookup table, though nothing huge (48x 4 bytes) compared to the amount of data being processed.

I was looking at the ESP8266 UART method, and thinking that at 7N1 a 3-step could be easily created but that would require 3 source bytes to be processed at once, though i guess that would result in less CPU and less interruptions.

Anyway have a look.

2 replies

Makuna Feb 3, 2026
Maintainer

The problem with UART has always been that they don't support a serial send, it's a UART send, so there is one bit of the byte that I can't control, its always 1 or always 0.

devarishi7 Feb 3, 2026
Author

Yes but the length of the byte is variable. for 2 Source bits per UART Byte 6N1 has the 1 start and 1 stop bit, spread over 8 bits. But using a 7N1 i think i can fit 3 bits. Thing is of course a source byte has 8 bits, which can easily be chopped into 4.
When using 3 Source bits per per UART byte, i need to read 3 Source bytes, resulting in 24-bits, which wil need to be turned into 8 UART bytes. This is not to complicated to do. The lookup table will look

const uint8_t _uartData[8] = {
  0b1011011, // On wire: 1 00 100 10 0 [Neopixel reads 000]
  0b0011011, // On wire: 1 00 100 11 0 [Neopixel reads 001]
  0b1010011, // On wire: 1 00 110 10 0 [Neopixel reads 010]
  0b0010011, // On wire: 1 00 110 11 0 [Neopixel reads 011]
  0b1011010, // On wire: 1 10 100 10 0 [Neopixel reads 100]
  0b0011010, // On wire: 1 10 100 11 0 [Neopixel reads 101]
  0b1010010, // On wire: 1 10 110 10 0 [Neopixel reads 110]
  0b0010010, // On wire: 1 10 110 11 0 [Neopixel reads 111]
};

And the baudrate would need to be modified to 800KHz * 3 Not 4

The great thing about a UART method is that it can be implemented on any board that supports that speed, more or less, even in a very crude form.

I was thinking to just load 4 bytes from the Source buffer into whatever a 32-bit pointer points to,
Clearing the msB (by clearing what that pointer as an 8-bit pointer + 3 points to, both these pointers are constant)
bit-shifting 3 left what the 32-pointer points to,
Lookup what is in the msB and send that byte.
and loopback to the clearing of the msB until 8 bytes are written.

There is some overhead, i am not sure what would be exactly the gain time wise, a bit hard to say, more bytes will fit into the FIFO, but on some other level who cares, they still don't all fit in there at once.

And so in some way if it isn't faster there isn't any gain i suppose. Then 4 step is simpler to implement (as it always was) So only if let's say 2.4MHz is supported as a UART speed but 3.2MHz isn't, then 3 step could be an option. That would be a rare circumstance.

devarishi7 · 2026-02-03T23:54:15Z

devarishi7
Feb 3, 2026
Author

Oh hey hold on, i just realized that eh, i have been testing on a buffer of 680 bytes not pixels Well that makes the speed differences even more relevant. But that also means that some of the numbers in the comments i wrote should be corrected. The order of difference is still the same,

1 reply

devarishi7 Feb 4, 2026
Author

I modified the comments with the correct speed test results, should i make a pull-request ? I mean i am quite impressed with the numbers.

Uh oh!

ESP32 I2S 3-step Cadence Faster encoding #855

Uh oh!

devarishi7 Oct 23, 2024

Replies: 6 comments · 10 replies

Uh oh!

Makuna Oct 24, 2024 Maintainer

Uh oh!

devarishi7 Oct 24, 2024 Author

Uh oh!

devarishi7 Oct 25, 2024 Author

Uh oh!

devarishi7 Nov 8, 2024 Author

Uh oh!

devarishi7 Jan 19, 2026 Author

Uh oh!

Makuna Jan 19, 2026 Maintainer

Uh oh!

devarishi7 Jan 21, 2026 Author

Uh oh!

Makuna Feb 3, 2026 Maintainer

Uh oh!

devarishi7 Feb 4, 2026 Author

Uh oh!

devarishi7 Jan 27, 2026 Author

Uh oh!

devarishi7 Jan 28, 2026 Author

Uh oh!

devarishi7 Feb 3, 2026 Author

Uh oh!

Makuna Feb 3, 2026 Maintainer

Uh oh!

devarishi7 Feb 3, 2026 Author

Uh oh!

devarishi7 Feb 3, 2026 Author

Uh oh!

devarishi7 Feb 4, 2026 Author

devarishi7
Oct 23, 2024

Replies: 6 comments 10 replies

Makuna
Oct 24, 2024
Maintainer

devarishi7 Oct 24, 2024
Author

devarishi7 Oct 25, 2024
Author

devarishi7
Nov 8, 2024
Author

devarishi7
Jan 19, 2026
Author

Makuna Jan 19, 2026
Maintainer

devarishi7 Jan 21, 2026
Author

Makuna Feb 3, 2026
Maintainer

devarishi7 Feb 4, 2026
Author

devarishi7
Jan 27, 2026
Author

devarishi7 Jan 28, 2026
Author

devarishi7
Feb 3, 2026
Author

Makuna Feb 3, 2026
Maintainer

devarishi7 Feb 3, 2026
Author

devarishi7
Feb 3, 2026
Author

devarishi7 Feb 4, 2026
Author