Search...

Friday, May 27, 2016

Inlining a Method Call for Performance

Caveat

I am NOT a math person. I often have to deep dive/study anything related to math that is beyond what most people would consider "easy" math. You know the stuff, addition, subtraction, multiplication and division. If I can break it down to the basics then I can understand it. I may at times while talking about matrix math go down a path that is just plain wrong. I'm admitting that up front. I am however engaging a friend that is a math teacher/wizard to help me out with some of these concepts. So I hope I don't get too much wrong about matrix math while working in an area I do understand - performance.

On with the show

So previously we saw that with large data structures the method of accessing memory can be important. http://mikeytalksaboutcode.blogspot.com/2016/05/faster-code-understanding-memory-access.html

Today we are going to look at inlining a method call for performance reasons.

Why

When a CPU encounters a branch instruction - any instruction that causes the instruction pointer to move somewhere that is not the next instruction. When this happens the CPU does all manner of non performant things.

It may jump to a location that is not in the instruction cache and then it has to flush/load the cache.

It will likely kick out work that has been completed in the CPU pipeline.

If you're interested here is an article to read on the instruction pipeline/prefetch queue
https://en.wikipedia.org/wiki/Prefetch_input_queue is pretty good.

Slower Matrix Multiply

Remember we optimized the memory layout last time. So lets look at the unmodified, "slow" code. This code is in a new class that I am checking in MatrixBaseNonPerformant. This class will implement all functionality in a base, easy to understand way, so that we can always look and compare functionality to ensure that an optimized class does what it is supposed to do. Because if an optimized class returns garbage results really fast it still returns garbage and that would make it unusable.

public virtual IMatrix Multiply(IMatrix m2)
{
    IMatrix m3 = new MatrixBase(m2.Columns, m2.Rows);
    for (int row = 0; row < this.rows; row++)
    {
        var tempc = (row * this.columns);
        for (int cola = 0; cola < m2.Columns; cola++)
        {
            float val = 0;
            for (int colb = 0; colb < this.Columns; colb++)
            {
                val += this[row, colb] * m2[colb, cola];
            }
            m3[tempc + cola] = val;
        }
    }
    return m3;
}

So in the innermost loop we are accessing two matrices which effectively makes two method calls.

val += this[row, colb] * m2[colb, cola];

We can inline these two method calls, inlining simply means that we take whatever code is in the method and move it to the place we are calling it from.

So the code that is being called looks like this

public virtual float this[int row, int column]
{
    get { return data[(column * rows) + row]; }
    set { data[(column * rows) + row] = value; }
}

So we just need to perform this simple math inside our Multiply method. Which results in the inner value calculation looking like this:

val += this[(colb * rows) + row] * m2[(cola * columns) + colb];

Let's take a look and see if we have accomplished anything.

Performance Results

Using the MatrixBaseNonPerformant class.
1000 x 1000 matrix 
With the method call : 11,194.67
With the method call inlined: 7,084.33

So we can see that a simple repetitive call inlined can make a big difference, about 30%, in the performance of the code we are running.

Next Time

Refactoring. This code is getting very messy and it's time to clean it up.

No comments:

Post a Comment