Want to play with octaves with the new noise functions but don't want to

Want to play with octaves with the new noise functions but don’t want to wait for me to put full examples/apis together and in the library. Here’s a simple function for 3d noise:

uint8_t octaveNoise8(uint8_t octaves, uint16_t x, uint16_t y, uint16_t z) {
uint8_t accum = 0;
uint16_t freq = 1;
for(int octave = 0; octave < octaves && octave < 8; octave++) {
// right shifting by our octave count effectively gives us an amplitude of 1/2^octave
accum = qadd8(accum, inoise8(xfreq,yfreq,z*freq)>>o);
// increase our multiplier
freq <<= 1;
}
}

uint16_t octaveNoise16(uint8_t octaves, uint32_t x, uint32_t y, uint32_t z) {
uint16_t accum = 0;
uint16_t freq = 1;
for(int octave = 0; octave < octaves && octave < 16; octave++) {
// right shifting by our octave count effectively gives us an amplitude of 1/2^octave
accum = qadd8(accum, inoise8(xfreq,yfreq,z*freq)>>o);
// increase our multiplier
freq <<= 1;
}
}

Why didn’t I just include this in the library as it is? Mostly because it has a serious performance issue. Let’s say you’re going to generate 100 points of 4-octave noise per frame. The way the implementation above works, in addition to the 400 calls to inoise8/inoise16 that would happen, there would be 1200 8 or 16 bit multiplies (at 2 cycles for the 8bit multiplies, that’d be 2400 cycles, or 300µs). Not only that, but you’d have 400 passes through setting up, running, and exiting a loop, which has its own overhead.

Here’s what I’m playing with in my own code that i’m using to tune/explore the noise functions more, where x,y, and z are my coordinates, which are managed outside the function, adj is my scaling for x/y point spacing, NX/NY are the width/height of my grid.

void fillnoise(int nOctaves) {
// local versions of adj adn x,y,z so I can mutilate them
uint32_t ladj = adj;
uint32_t lz = z;
uint32_t lx = x;
uint32_t ly = y;

for(int o = 0; o < nOctaves; o++) {
for(int i = 0, iadj=0; i < NX; i++, iadj+=ladj) {
for(int j = 0, jadj=0; j < NY; j++, jadj+=ladj) {
uint8_t accum = inoise8(lx + iadj,ly + jadj,lz)>>o;
if(o > 0) { accum = qadd8(noise[i][j],accum); }
noise[i][j] = accum;
}
}
// scale up our indexing values
ladj <<= 1;
lx <<= 1;
ly <<= 1;
lz <<= 1;
}

z += spd;
}

So with this function, and the example above, there’s still going to be 400 calls to inoise8 (no way around that one, sadly). However, in this version, i’m basically filling in my noise matrix one octave at a time, rather than filling each point in the matrix once with a 4-octave noise value. There’s a number of optimizations worth digging into that help minimize the rest of the math.

The first is how i’m adjusting the co-ordinates to pass into the noise function. A first pass implementation for this might look something like:

for(int o = 0; o < nOctaves; o++) {
for(int i = 0; i < NX; i++) {
for(int j = 0; j < NY; j++) {
uint8_t accum = inoise8((x+(iadj))(2^o), (y+(jadj))(2^o),z*(2^o))>>o;
}
}
}

(note, just including the calls to the noise functions above. Of course, that loop above will end up generating 5 multiplies and 3 power functions for each point for each octave. Or, 2000 multiplies and 900 powerfunctions. (And for completeness, 800 additions - giving us a total of 3700 math operations, not including the loop control functions)

So, how can things be pulled out? First off, we know what we really want is, for each point we’re computing in the inner loops, are x/y values that are adjusted as we walk across the matrix. So, for example, (0,0) == (x,y), (1,0) == (x+adj,y), (2,0) == (x+2*adj,y), etc… We can get rid of the multiply and turn it into an add by keeping a separate value tracking the current “value to add” - and that gives us:

uint32_t ladj = adj;
for(int o = 0; o < nOctaves; o++) {
for(int i = 0, iadj=0; i < NX; i++,iadj+=ladj) {
for(int j = 0, jadj=0; j < NY; j++,iadj+=ladj) {
uint8_t accum = inoise8((x+(iadj))(2^o), (y+(jadj))(2^o),z*(2^o))>>o;
}
}
}

Now we’re at 1200 multiplies, 900 power function calls, our 800 original additions, 400 additions to jadj, but only (assuming a 10x10 grid), 40 additions to iadj for a total of 1240 additions. Still at 3340 math operations, however additions are often far cheaper than multiplies, especially on the avr.

Still, though - lots of multiplies in there. What more can be done? Remember from math classes that Z * (A + B) === ZA * ZB? If we do that here, we see something like (x * 2^o) + (iadj * 2^o) for each of our coordinates. 2^o is basically a power of 2 - when octave is 0, that’ll ahve the value 1, when octave is 1, the value will be 2, when octave is 2, the value will be 4, etc… but basically, it’s just doubling it. Well, what if we just adjusted our base x,y,z and adjustment values once per octave, then?

uint32_t ladj=adj,lx=x,ly=y,lz=z;

for(int o = 0; o < nOctaves; o++) {
for(int i = 0, iadj=0; i < NX; i++,iadj+=ladj) {
for(int j = 0, jadj=0; j < NY; j++,iadj+=ladj) {
uint8_t accum = inoise8(lx + iadj, ly + jadj, lz)>>o;
}
}
ladj *= 2;
lx *= 2;
ly *= 2;
lz *= 2;
}

Et voila. We still have our 800 basic additions, our 440 additions for adjusting iadj/jadj, but now only 16 multiplies (4 per octave) for each of our ladj, lx, ly, and lz values.

That is much much much nicer, isn’t it?

Of course you could also change the var*=2 to var=var<<1
Some compilers will optimise multiply by 2’s with a left bit shift but I prefer to make sure myself :slight_smile:

Yes, although when you are only doing 16/frame it matters less, and so the clarity in code I’m giving others is preferred. (Also, GCC is pretty good about mul to left shift. Div to right shift? Not at all. Also - failure to convert costs you one cycle with mul and dozens with div

Good point, heavily optimised code can become quite unreadable except to the author!
One aspect I miss when using the Arduino dev system is the lack of an assembler output file as this is very useful to see what the compiler is doing with your C code quickly and easily.