Earlier this week I wrote some code that wasted 90% of its time moving data around in memory, because I just ‘grew’ a long vector with the idiom
> stuff<-c(stuff, morestuff)
Here’s the github commit that changed the code.
I’m writing about it because it illustrates a few useful points. First, the inefficient code was absolutely the right choice initially. I didn’t know how long each additional vector would be, and while I could have worked it out in principle, in practice I would quite likely have got it wrong. Or at least not been sure it was right.
Second, the reason I know the code was inefficient was that I profiled it. There were lots of potential inefficiencies in the code because I was trying to write it as simply as possible. I didn’t know in advance which (if any) would matter, and I was surprised when the profiler found most of the time was being spent in the function c()
.
Third, the new code illustrates an approach to use when you don’t know (or don’t want to work out) how big the pieces of vector will be. I work out a crude upper bound for the possible size of the vector and have a variable that points to the first unused position so I can put new numbers in the right place. Then, at the end, I return just the part of the vector that I actually used.