Today, on “Why we can’t have nice things theater”
I’ve been looking around and digging into why interrupts aren’t behaving nicely on avr and discovered two things. The first is that adafruit’s neopixels only have about an 8-9µs threshold for a gap between pixels, not the 50 that I have seen with other pixel suppliers. The second is that the default arduino implementation of the timer for counting is really really dumb and can take, on it’s own, 95 clocks. Combine that with the ~12+ clock overhead of servicing interrupts and the 10-20 clock loop/housekeeping overhead the library has between leds and you quickly realize that the extra 8 clocks necessary to check for interrupt overruns between pixels pushes you into the realm of unhappy pixels.
This is no fun, and leaves the world looking like interrupt handling on avr is not doable, courtesy of the millis timer! So, I did some digging. Here is what the interrupt handler looks like:
volatile unsigned long timer0_overflow_count = 0;
volatile unsigned long timer0_millis = 0;
static unsigned char timer0_fract = 0;
#if defined(AVR_ATtiny24) || defined(AVR_ATtiny44) || defined(AVR_ATtiny84)
ISR(TIM0_OVF_vect)
#else
ISR(TIMER0_OVF_vect)
#endif
{
// copy these to local variables so they can be stored in registers
// (volatile variables must be read from memory on every access)
unsigned long m = timer0_millis;
unsigned char f = timer0_fract;
m += MILLIS_INC;
f += FRACT_INC;
if (f >= FRACT_MAX) {
f -= FRACT_MAX;
m += 1;
}
timer0_fract = f;
timer0_millis = m;
timer0_overflow_count++;
}
Well, there’s a bunch of our problems right there. It’s working with not one, but two 32-bit values (each of which will take 8 clocks to load, 8 clocks to save, not to mention 4 clocks to increment which happens up to 3 times/interrupt - 44 clocks just for this!). But wait, it gets worse. Also, there’s this conditional in there for fractional handling. What’s up with that? Let’s move that logic off into millis where it belongs and replace the interrupt handler with something a little bit saner looking:
#if defined(AVR_ATtiny24) || defined(AVR_ATtiny44) || defined(AVR_ATtiny84)
ISR(TIM0_OVF_vect)
#else
ISR(TIMER0_OVF_vect)
#endif
{
// fastinc32(FastLED_timer0_overflow_count);
FastLED_timer0_overflow_count++;
}
(Ignore that fastinc32 reference for now). This improved things, enough that interrupts were working right and leds weren’t getting cut short once a milllisecond (or, after 32 WS2812 leds). Yay! However, since I still had my scope hooked up I took a look at the timing, and still saw more of a gap when the interrupt fired than I would’ve liked. What was going on?
So, back into the disassembly:
00000e7c <__vector_16>:
e7c: 1f 92 push r1
e7e: 0f 92 push r0
e80: 0f b6 in r0, 0x3f ; 63
e82: 0f 92 push r0
e84: 11 24 eor r1, r1
e86: 8f 93 push r24
e88: 9f 93 push r25
e8a: af 93 push r26
e8c: bf 93 push r27
e8e: 80 91 a3 01 lds r24, 0x01A3
e92: 90 91 a4 01 lds r25, 0x01A4
e96: a0 91 a5 01 lds r26, 0x01A5
e9a: b0 91 a6 01 lds r27, 0x01A6
e9e: 01 96 adiw r24, 0x01 ; 1
ea0: a1 1d adc r26, r1
ea2: b1 1d adc r27, r1
ea4: 80 93 a3 01 sts 0x01A3, r24
ea8: 90 93 a4 01 sts 0x01A4, r25
eac: a0 93 a5 01 sts 0x01A5, r26
eb0: b0 93 a6 01 sts 0x01A6, r27
eb4: bf 91 pop r27
eb6: af 91 pop r26
eb8: 9f 91 pop r25
eba: 8f 91 pop r24
ebc: 0f 90 pop r0
ebe: 0f be out 0x3f, r0 ; 63
ec0: 0f 90 pop r0
ec2: 1f 90 pop r1
ec4: 18 95 reti
Ah! Right. Working with 32 bit values will also take up 4 registers, which is 8 clocks to save values off on entry into the function and another 8 clocks to restore them on exiting the function. Also, there’s still that 4 clocks every time through for the add. 55 clocks. I bet we can squish this down further.
I’m incrementing a 32-bit value. Which means 255/256 times, I’m only changing the value of one byte, the low byte (and so one for the 3rd and 4th bytes). I bet this means I can abuse the world a bit more and load just the low byte, increment it, save it back out, and if it’s still a non-zero value, well then I know I haven’t wrapped, and i’m done. In fact, this takes 31 clocks. (It’d be 30 if gcc was smart enough to realize that it doesn’t need to do an and to get the value to check in the zero register, it already has it from the increment operator). However - this means I’m only using 1 register, not 4. It also means i’m only loading data when it’s going to change, not always.
This new version is 31 clocks 255/256 times. If it has to adjust two bytes it is 38 clocks. If it has to adjust three bytes it is 45 clocks. Finally, if it has to adjust four bytes it is only 49 clocks (but only once every 4.5 hours). So even in my worst case scenario, the performance is still better than the every case scenario. Over those 4.5 hours, the original code would eat up 99 seconds of CPU time. The new code will only eat up 32 seconds. Sure, these numbers don’t sound like a lot. But when it’s the difference between 32 clocks and you can use interrupts with WS2812’s and 99 clocks and you can’t, it all adds up ![]()
And this is that fastinc32 function:
typedef union { unsigned long _long; uint8_t raw[4]; } tBytesForLong;
LIB8STATIC void attribute((always_inline)) fastinc32 (volatile uint32_t & _long) {
uint8_t b = ++((tBytesForLong&)_long).raw[0];
if(!b) {
b = ++((tBytesForLong&)_long).raw[1];
if(!b) {
b = ++((tBytesForLong&)_long).raw[2];
if(!b) {
++((tBytesForLong&)_long).raw[3];
}
}
}
}
I know that not everyone is going to want these things (either the FastLED custom versions of wiring, or the interrupt availability - as it does slow down WS2812 writing a little bit, as well as interfere with using PWM on pin 5). However, I haven’t come up with a good way yet to make these easily changeable features - but working on some ideas there.
Also - for you folks using IRRemote or other interrupt handler based libraries - those interrupt handlers are going to need to do a LOT of slimming down before they’ll fit in the ~90 clock cycles you have, give or take.
(All this code is now in the FastLED 3.1 branch)

