Mature and premature optimisation

Earlier this week I wrote some code that wasted 90% of its time moving data around in memory, because I just ‘grew’ a long vector with the idiom 

> stuff<-c(stuff, morestuff)

Here’s the github commit that changed the code.

I’m writing about it because it illustrates a few useful points.  First, the inefficient code was absolutely the right choice initially.  I didn’t know how long each additional vector would be, and while I could have worked it out in principle, in practice I would quite likely have got it wrong. Or at least not been sure it was right.

Second, the reason I know the code was inefficient was that I profiled it.  There were lots of potential inefficiencies in the code because I was trying to write it as simply as possible.  I didn’t know in advance which (if any) would matter, and I was surprised when the profiler found most of the time was being spent in the function c()

Third, the new code illustrates an approach to use when you don’t know (or don’t want to work out) how big the pieces of vector will be.  I work out  a crude upper bound for the possible size of the vector and have a variable that points to the first unused position so I can put new numbers in the right place. Then, at the end, I return just the part of the vector that I actually used.