Today, on "Why we can't have nice things theater" I've been looking around and

Daniel_Garcia · November 12, 2014, 10:35am

Today, on “Why we can’t have nice things theater”

I’ve been looking around and digging into why interrupts aren’t behaving nicely on avr and discovered two things. The first is that adafruit’s neopixels only have about an 8-9µs threshold for a gap between pixels, not the 50 that I have seen with other pixel suppliers. The second is that the default arduino implementation of the timer for counting is really really dumb and can take, on it’s own, 95 clocks. Combine that with the ~12+ clock overhead of servicing interrupts and the 10-20 clock loop/housekeeping overhead the library has between leds and you quickly realize that the extra 8 clocks necessary to check for interrupt overruns between pixels pushes you into the realm of unhappy pixels.

This is no fun, and leaves the world looking like interrupt handling on avr is not doable, courtesy of the millis timer! So, I did some digging. Here is what the interrupt handler looks like:

volatile unsigned long timer0_overflow_count = 0;
volatile unsigned long timer0_millis = 0;
static unsigned char timer0_fract = 0;

#if defined(AVR_ATtiny24) || defined(AVR_ATtiny44) || defined(AVR_ATtiny84)
ISR(TIM0_OVF_vect)
#else
ISR(TIMER0_OVF_vect)
#endif
{
// copy these to local variables so they can be stored in registers
// (volatile variables must be read from memory on every access)
unsigned long m = timer0_millis;
unsigned char f = timer0_fract;

    m += MILLIS_INC;
    f += FRACT_INC;
    if (f >= FRACT_MAX) {
            f -= FRACT_MAX;
            m += 1;
    }

    timer0_fract = f;
    timer0_millis = m;
    timer0_overflow_count++;

}

Well, there’s a bunch of our problems right there. It’s working with not one, but two 32-bit values (each of which will take 8 clocks to load, 8 clocks to save, not to mention 4 clocks to increment which happens up to 3 times/interrupt - 44 clocks just for this!). But wait, it gets worse. Also, there’s this conditional in there for fractional handling. What’s up with that? Let’s move that logic off into millis where it belongs and replace the interrupt handler with something a little bit saner looking:

#if defined(AVR_ATtiny24) || defined(AVR_ATtiny44) || defined(AVR_ATtiny84)
ISR(TIM0_OVF_vect)
#else
ISR(TIMER0_OVF_vect)
#endif
{
// fastinc32(FastLED_timer0_overflow_count);
FastLED_timer0_overflow_count++;
}

(Ignore that fastinc32 reference for now). This improved things, enough that interrupts were working right and leds weren’t getting cut short once a milllisecond (or, after 32 WS2812 leds). Yay! However, since I still had my scope hooked up I took a look at the timing, and still saw more of a gap when the interrupt fired than I would’ve liked. What was going on?

So, back into the disassembly:

00000e7c <__vector_16>:
e7c: 1f 92 push r1
e7e: 0f 92 push r0
e80: 0f b6 in r0, 0x3f ; 63
e82: 0f 92 push r0
e84: 11 24 eor r1, r1
e86: 8f 93 push r24
e88: 9f 93 push r25
e8a: af 93 push r26
e8c: bf 93 push r27
e8e: 80 91 a3 01 lds r24, 0x01A3
e92: 90 91 a4 01 lds r25, 0x01A4
e96: a0 91 a5 01 lds r26, 0x01A5
e9a: b0 91 a6 01 lds r27, 0x01A6
e9e: 01 96 adiw r24, 0x01 ; 1
ea0: a1 1d adc r26, r1
ea2: b1 1d adc r27, r1
ea4: 80 93 a3 01 sts 0x01A3, r24
ea8: 90 93 a4 01 sts 0x01A4, r25
eac: a0 93 a5 01 sts 0x01A5, r26
eb0: b0 93 a6 01 sts 0x01A6, r27
eb4: bf 91 pop r27
eb6: af 91 pop r26
eb8: 9f 91 pop r25
eba: 8f 91 pop r24
ebc: 0f 90 pop r0
ebe: 0f be out 0x3f, r0 ; 63
ec0: 0f 90 pop r0
ec2: 1f 90 pop r1
ec4: 18 95 reti

Ah! Right. Working with 32 bit values will also take up 4 registers, which is 8 clocks to save values off on entry into the function and another 8 clocks to restore them on exiting the function. Also, there’s still that 4 clocks every time through for the add. 55 clocks. I bet we can squish this down further.

I’m incrementing a 32-bit value. Which means 255/256 times, I’m only changing the value of one byte, the low byte (and so one for the 3rd and 4th bytes). I bet this means I can abuse the world a bit more and load just the low byte, increment it, save it back out, and if it’s still a non-zero value, well then I know I haven’t wrapped, and i’m done. In fact, this takes 31 clocks. (It’d be 30 if gcc was smart enough to realize that it doesn’t need to do an and to get the value to check in the zero register, it already has it from the increment operator). However - this means I’m only using 1 register, not 4. It also means i’m only loading data when it’s going to change, not always.

This new version is 31 clocks 255/256 times. If it has to adjust two bytes it is 38 clocks. If it has to adjust three bytes it is 45 clocks. Finally, if it has to adjust four bytes it is only 49 clocks (but only once every 4.5 hours). So even in my worst case scenario, the performance is still better than the every case scenario. Over those 4.5 hours, the original code would eat up 99 seconds of CPU time. The new code will only eat up 32 seconds. Sure, these numbers don’t sound like a lot. But when it’s the difference between 32 clocks and you can use interrupts with WS2812’s and 99 clocks and you can’t, it all adds up

And this is that fastinc32 function:

typedef union { unsigned long _long; uint8_t raw[4]; } tBytesForLong;
LIB8STATIC void attribute((always_inline)) fastinc32 (volatile uint32_t & _long) {
uint8_t b = ++((tBytesForLong&)_long).raw[0];
if(!b) {
b = ++((tBytesForLong&)_long).raw[1];
if(!b) {
b = ++((tBytesForLong&)_long).raw[2];
if(!b) {
++((tBytesForLong&)_long).raw[3];
}
}
}
}

I know that not everyone is going to want these things (either the FastLED custom versions of wiring, or the interrupt availability - as it does slow down WS2812 writing a little bit, as well as interfere with using PWM on pin 5). However, I haven’t come up with a good way yet to make these easily changeable features - but working on some ideas there.

Also - for you folks using IRRemote or other interrupt handler based libraries - those interrupt handlers are going to need to do a LOT of slimming down before they’ll fit in the ~90 clock cycles you have, give or take.

(All this code is now in the FastLED 3.1 branch)

Daniel_Garcia · November 12, 2014, 10:38am

And for those that are curious, here’s what the disassembly for FastLED’s TIMER0 interrupt handler looks like:

00000e74 <__vector_16>:
e74: 1f 92 push r1
e76: 0f 92 push r0
e78: 0f b6 in r0, 0x3f ; 63
e7a: 0f 92 push r0
e7c: 11 24 eor r1, r1
e7e: 8f 93 push r24
e80: 80 91 a3 01 lds r24, 0x01A3
e84: 8f 5f subi r24, 0xFF ; 255
e86: 80 93 a3 01 sts 0x01A3, r24
e8a: 88 23 and r24, r24
e8c: 99 f4 brne .+38 ; 0xeb4 <__vector_16+0x40>
e8e: 80 91 a4 01 lds r24, 0x01A4
e92: 8f 5f subi r24, 0xFF ; 255
e94: 80 93 a4 01 sts 0x01A4, r24
e98: 88 23 and r24, r24
e9a: 61 f4 brne .+24 ; 0xeb4 <__vector_16+0x40>
e9c: 80 91 a5 01 lds r24, 0x01A5
ea0: 8f 5f subi r24, 0xFF ; 255
ea2: 80 93 a5 01 sts 0x01A5, r24
ea6: 88 23 and r24, r24
ea8: 29 f4 brne .+10 ; 0xeb4 <__vector_16+0x40>
eaa: 80 91 a6 01 lds r24, 0x01A6
eae: 8f 5f subi r24, 0xFF ; 255
eb0: 80 93 a6 01 sts 0x01A6, r24
eb4: 8f 91 pop r24
eb6: 0f 90 pop r0
eb8: 0f be out 0x3f, r0 ; 63
eba: 0f 90 pop r0
ebc: 1f 90 pop r1
ebe: 18 95 reti

(Note: this function would be 6 bytes shorter and run in 1/2/3 fewer clock cycles if it got rid of the and r24,r24’s in there)

Daniel_Garcia · November 12, 2014, 11:01am

Also, I’m a little irritated that I had to blow time on this when there’s other AVR things i’d rather be spreading my time on.

Mark_Kriegsman · November 12, 2014, 12:17pm

I want to editorialize here for a moment:

@Daniel_Garcia , take a bow. Everyone else, applaud, and take note: THIS is what optimization is about.

You start with your goal, no matter how crazy (eg allow interrupts fire between every LED), and you build it as best you can. And it falls short. It doesn’t meet your goal. But instead of giving up, you ask WHY is it not fast enough? What is actually in my way? What’s actually going on here?

And you discover things you didn’t know before, things that seem to clash with your very idea about how the world works: Surprise! Some NeoPixels are 5X pickier than others about timing! Surprise! The Arduino interrupt handler for millis that fires a thousand times a second just to increment a counter, is so slow that it’s actually getting in the way!

And then you sort out what you’ve learned into things you can potentially control (interrupt handlers), and things you can’t (manufacturing of Adafruit Neopixels). And then you roll up your sleeves, grab the things you can control, and you squeeze with all your might and skill.

And you iterate. Hit your goal yet? No? Go back and dig in again. What’s standing in your way? What could you change? Dig in and try. And try. And try.

And when the dust clears, if you’re lucky, AND persistent, AND skilled, there it is: you’ve cleared the bar. You’ve done it. You’ve done the impossible. Again. And since you’re doing open source, the whole world benefits from what you’ve done. Not just the people running your code, but the developers who will come afterwards, and learn from you how to make things really, really fast.

THIS is what optimization is about.

THIS is what FastLED is about.

Mark_Kriegsman · November 12, 2014, 12:20pm

And THIS is what KLF is about.

Kasper_Kamperman · November 12, 2014, 2:23pm

Wow! Applause (Clapping loudly). @Daniel_Garcia very good and interesting read to see what kind of trouble you are solving and making the impossible possible. @Mark_Kriegsman nice write up on persistence.

Andrew_Tuline · November 12, 2014, 2:27pm

Scotty (of Star Trek) would be hugely proud. This update now works with WS2812B’s and Nico Hood’s IRL Remote at: https://github.com/NicoHood/IRLremote

PaulStoffregen · November 12, 2014, 2:39pm

The difficult trouble with millis() on Arduino is the timer0 interrupt rate is actually 976.5625 Hz, not 1 kHz. It’s due to the clock division of 16e6 / 16384. The AVR timer can not divide by 16000.

All that fractional stuff allows millis() and micros() to work properly, giving accurate millisecond and microsecond output (with a non-millisecond interrupt rate), including proper 32 bit overflow on both.

Many Arduino sketchs and libraries will break if you decrease the millis count to 0.9765625 kHz, or make micros() discontinuous (it reads the timer register to gain the extra timing resolution, assuming fully 8 bit rollover in the timer), or try to post-processes and not end up with clean 32 bit rollover, especially on micros().

Here’s my optimization of the timer0 ISR used on Teensy 2.0. It preserves all that stuff needed for Arduino compatibility. I believe this is the fastest possible implementation that’s fully compatible.

github.com

PaulStoffregen/cores/blob/master/teensy/pins_teensy.c#L1732


#elif F_CPU == 1000000L
  #define TIMER0_MILLIS_INC  	16
  #define TIMER0_FRACT_INC	48
  #define TIMER0_MICROS_INC  	64
#endif

volatile unsigned long timer0_micros_count = 0;
volatile unsigned long timer0_millis_count = 0;
volatile unsigned char timer0_fract_count = 0;

void TIMER0_OVF_vect() __attribute__((naked));
void TIMER0_OVF_vect()
{
	asm volatile(
		"push	r24"				"\n\t"
		"in	r24, __SREG__"			"\n\t"
		"push	r24"				"\n\t"

		"lds	r24, timer0_fract_count"	"\n\t"
		"subi	r24, 256 - %0"			"\n\t"
		"cpi	r24, 125"			"\n\t"

Adam_Haile · November 12, 2014, 2:47pm

I apologize for making you think about the busted WS2811 interrupts again

Justin_Eastman · November 12, 2014, 3:09pm

Not sure how to really express my gratitude for all you guys do. Thank you!

Mark_Kriegsman · November 12, 2014, 3:45pm

TrueMillis = (OverflowCount * 262) >> 8; //edited

better yet

// More accurate than the oscillator
TrueMillis = (OverflowCount * 67109) >> 16;

Chris_Hughes · November 12, 2014, 3:45pm

Forgive my lack of knowledge here. I understand the normal use of ISRs but what is the purpose of ISRs between pixels?

Anton_Mulidran · November 12, 2014, 3:59pm

That is some real programming

Mark_Kriegsman · November 12, 2014, 4:26pm

There are lots of sensors and input devices that need interrupt service faster than waiting to the end of the led string would allow.

Chris_Hughes · November 12, 2014, 5:19pm

Ah, Ok. Thanks for explaining.

Daniel_Garcia · November 12, 2014, 5:30pm

@PaulStoffregen I still need to account for millis rollover, but I am accounting for micros. I’m incrementing an overflow coutner, not a millis counter - since the overflow counter is all that micros is using, no harm done to micros.

For millis, I do still need decide whether or not to deal with the difference in overflow between the overflow counter and milliseconds[1]

I’m just moving the work of this accounting into millis which is called less frequently than the interrupt. Also - I want to hand redo the asm for this function anyway (since gcc is stupid sometimes and which I know better than to dig into at 3am and get rid of, among other things, those extra and’s I was mentioning above, as well as strip down the function prolog a bit more. I figure there’s another 9 clocks to save in the general case, as well as another 14 bytes to strip from the length of the function.

Also - I’m not at all surprised to hear that you tackled this for the teensy as well. I am disappointed that the arduino folks aren’t paying attention to/incorporating that work back in the library, even in 1.5.8.

[1] but I figure I have about 49 days to get a good solution for that - and even then, at the moment (assuming gcc is handling the mul as 64 bit internally, i need to check and make sure) the millis rollover will happen twice, once at 49.7 days (when the number of milliseconds crosses 2^32), and then once again at (roughly) 1.2 days later (when the number of overflows crosses 2^32). Anyone who is using millis to count any time gap longer than 1.2 days and wants it done reliably should be using an RTC anyway

Joseph_Crum · November 12, 2014, 6:08pm

This is a fantastic write up! Thank you for sharing these details and spending the time to look into it, even though you would rather place it in more important areas. If I read the entire post correctly, does this timing issue cause the strange jitter in long strands of WS2812s?

Daniel_Garcia · November 12, 2014, 6:14pm

And actually, now that I think about it - the way that it is implemented currently (fuck you gcc for stopping at 32bit math), millis is going to roll over about once every 9.5 hours - and that’d be a regular rollover (We’re only going to be using the low 25 bits of the overflow counter, since there will be a <<7 applied to it before a /125 in my implementation of millis).

Of course, if I want absolute parity with the arduino implementation of millis (which is that millis rolls over every 2^32 milliseconds (49.7 days) precisely) then I still have some extra housekeeping to do.

Daniel_Garcia · November 12, 2014, 6:24pm

@Joseph_Crum No, it shouldn’t cause jitter. What do you mean by jitter for that matter - are you talking about bits getting dropped? (that typically only happens if you push the data rate above 800khz) or about leds at the end getting skipped (happens if you have more than ~8-10µs between each pixel, the leds reset and your remaining pixel data gets read by the first N leds in your strip rather than the last) - this is why most ws2812 implementations disable interrupts entirely, and I have the interrupt handling/detecting code, which stops writing out the frame if an interrupt takes more than 9(ish)µs. Of course, if you have a long strand of leds (say, something that takes multiple ms to write out) and you have a lot of poorly written interrupts firing, it increases the odds of an interrupt taking too long and cutting off writing out the strand. All of this is leaving out/ignoring issues related to power/voltage drop along the strand if you aren’t injecting power every hundred leds or so.

That said, with WS2812 style leds, you shouldn’t be doing long strands. It’s a stupidly slow protocol, and if you’re doing more than a hundred or two hundred WS2812 leds you should really be using something with parallel output (OctoWS2811 or FadeCandy if you’re generating visualizations on a computer, or OctoWS2811 or FastLED on a teensy 3.x, or FastLED on a due).

Because of how stupid AVR is with interrupts and how low/tight the timing of the WS2812 gaps are, I need to put in a way to have WS2812’s always block interrupts, for people who have inefficient interrupt handlers that can take being delayed by a couple ms while writing out WS2812 data.

People who have inefficient interrupts that can’t be delayed by more than a couple of ms should switch to either arm based platforms (which won’t 100% fix the inefficient interrupt problem, but will give you some more head room) or get off of WS2812’s and switch over to APA102 or LPD8806 based rigs where you can interrupt the writing of leds until the cows come home and the leds won’t care.

Mark_Kriegsman · November 12, 2014, 6:30pm

FastLED v3.1. Now with homing cows.