Search...

Showing posts with label performance. Show all posts
Showing posts with label performance. Show all posts

Saturday, July 2, 2016

Matrix Multiply On a Beast

The Beast - Scaling up

I spun up a beast instance on Azure

Standard DS5_v2 (16 Cores, 56 GB memory)

First I changed the executable to run five iterations - from 1,000 x 1,000 size to 5,000 x 5,000 size. Then I changed the data type from float to double. So that added to the memory load as the double takes 8 bytes not 4 and then to the processing time because doubles take longer to multiply. So how fast did the matrix multiply run on the beast?

Test Run

----------------------------------------
Score:

1000: RowOptimized * RowOptimized = NonOptimized - 121.67
1000: RowOptimized * ColumnOptimized = NonOptimized - 125.00
1000: ColumnOptimized * ColumnOptimized = NonOptimized - 131.67
1000: ColumnOptimized * RowOptimized = NonOptimized - 169.67
2000: ColumnOptimized * RowOptimized = NonOptimized - 843.67
2000: RowOptimized * ColumnOptimized = NonOptimized - 863.33
2000: ColumnOptimized * ColumnOptimized = NonOptimized - 892.33
2000: RowOptimized * RowOptimized = NonOptimized - 988.00
3000: ColumnOptimized * RowOptimized = NonOptimized - 3,295.67
3000: RowOptimized * ColumnOptimized = NonOptimized - 3,363.33
3000: RowOptimized * RowOptimized = NonOptimized - 3,565.00
3000: ColumnOptimized * ColumnOptimized = NonOptimized - 3,784.33
4000: RowOptimized * ColumnOptimized = NonOptimized - 10,433.33
4000: ColumnOptimized * RowOptimized = NonOptimized - 10,793.33
4000: RowOptimized * RowOptimized = NonOptimized - 11,515.33
4000: ColumnOptimized * ColumnOptimized = NonOptimized - 11,567.33

----------------------------------------

The data is pretty simple to read

1000 : RowOptimized * RowOptimized = NonOptimized - 121.67

  • 1000 is the size of the matrix - 1,000 by 1,000 in this case
1000 : RowOptimized * RowOptimized = NonOptimized - 121.67
  • RowOptimized * RowOptimized = NonOptimized
    • Indicates the matrix type
    • "Row Optimized" meaning that it is laid out in memory to optimize the access by row
    • "Column Optimized" is optimized for access by column
    • "Non Optimized" indicates that there are no optimizations if you call the Multiply method on this object. It is laid out in memory so that it would be considered row optimized but there are no compute optimizations in the code.
1000 : RowOptimized * RowOptimized = NonOptimized - 121.67
  • The final value is the number of milliseconds that it took to complete the multiplication.
Pretty amazing really. If you have a lot of horsepower you can multiply a lot of matrices. And you can multiply them with reasonable speed and efficiency.

Notice that when you get into very large matrices that the sheer volume of data that is being accessed causes things to slow down quite a bit.

A 1,000 by 1,000 matrix using doubles is 8,000,000 bytes. Change that to a 2,000 by 2,000 matrix and the size quadruples to 32,000,000 bytes.

Matrix Size Bytes
1,000 8,000,000
2,000 32,000,000
3,000 72,000,000
4,000 128,000,000
5,000 200,000,000

A 4,000 by 4,000 cell matrix takes up 128 megabytes of memory. I guess this explains the time it takes to multiply that beast. It's a good thing we have beasts to do the calculations on.

Next

Soon I may try to move this code to using what I hope will be the more powerful graphics hardware. I think it would be interesting to see what that could do for speed.

Friday, May 27, 2016

Inlining a Method Call for Performance

Caveat

I am NOT a math person. I often have to deep dive/study anything related to math that is beyond what most people would consider "easy" math. You know the stuff, addition, subtraction, multiplication and division. If I can break it down to the basics then I can understand it. I may at times while talking about matrix math go down a path that is just plain wrong. I'm admitting that up front. I am however engaging a friend that is a math teacher/wizard to help me out with some of these concepts. So I hope I don't get too much wrong about matrix math while working in an area I do understand - performance.

On with the show

So previously we saw that with large data structures the method of accessing memory can be important. http://mikeytalksaboutcode.blogspot.com/2016/05/faster-code-understanding-memory-access.html

Today we are going to look at inlining a method call for performance reasons.

Why

When a CPU encounters a branch instruction - any instruction that causes the instruction pointer to move somewhere that is not the next instruction. When this happens the CPU does all manner of non performant things.

It may jump to a location that is not in the instruction cache and then it has to flush/load the cache.

It will likely kick out work that has been completed in the CPU pipeline.

If you're interested here is an article to read on the instruction pipeline/prefetch queue
https://en.wikipedia.org/wiki/Prefetch_input_queue is pretty good.

Slower Matrix Multiply

Remember we optimized the memory layout last time. So lets look at the unmodified, "slow" code. This code is in a new class that I am checking in MatrixBaseNonPerformant. This class will implement all functionality in a base, easy to understand way, so that we can always look and compare functionality to ensure that an optimized class does what it is supposed to do. Because if an optimized class returns garbage results really fast it still returns garbage and that would make it unusable.

public virtual IMatrix Multiply(IMatrix m2)
{
    IMatrix m3 = new MatrixBase(m2.Columns, m2.Rows);
    for (int row = 0; row < this.rows; row++)
    {
        var tempc = (row * this.columns);
        for (int cola = 0; cola < m2.Columns; cola++)
        {
            float val = 0;
            for (int colb = 0; colb < this.Columns; colb++)
            {
                val += this[row, colb] * m2[colb, cola];
            }
            m3[tempc + cola] = val;
        }
    }
    return m3;
}

So in the innermost loop we are accessing two matrices which effectively makes two method calls.

val += this[row, colb] * m2[colb, cola];

We can inline these two method calls, inlining simply means that we take whatever code is in the method and move it to the place we are calling it from.

So the code that is being called looks like this

public virtual float this[int row, int column]
{
    get { return data[(column * rows) + row]; }
    set { data[(column * rows) + row] = value; }
}

So we just need to perform this simple math inside our Multiply method. Which results in the inner value calculation looking like this:

val += this[(colb * rows) + row] * m2[(cola * columns) + colb];

Let's take a look and see if we have accomplished anything.

Performance Results

Using the MatrixBaseNonPerformant class.
1000 x 1000 matrix 
With the method call : 11,194.67
With the method call inlined: 7,084.33

So we can see that a simple repetitive call inlined can make a big difference, about 30%, in the performance of the code we are running.

Next Time

Refactoring. This code is getting very messy and it's time to clean it up.

Saturday, May 21, 2016

Faster Code - Understanding Memory Access Patterns

The Problem

When you are working with large data sets you need to understand the way the data is laid out in memory and how the code will be accessing that data.

I am going to take a look at some example code to help us understand how memory access patterns can affect the performance of our software.

Our example is matrix multiplication - see here https://en.wikipedia.org/wiki/Matrix_multiplication.

A small 10 x 10 matrix filled with 8 byte doubles will only take 800 bytes. Unless you are working with 10,000 you may not need to worry about your memory access patterns.

As has been said many times in many other places "Premature optimization is the enemy". A lot of time can be wasted working on the wrong problem.

Let's get back to our matrix multiplication problem. Think about the memory footprint of two 1000 x 1000 matrices, or three if you want your answers in another matrix. 1,000 * 1,000 = 1,000,000 * 8 = 8,000,000. So each one takes 8 megabytes in memory. That's not a lot if memory but your CPU doesn't access memory directly, it accesses memory through the cache. Since the cache is much smaller than main memory data has to be swapped in and out. The smaller your data set the less swapping that will happen. The larger your data set the more potential you have for invalidating the cache and causing it to have to load data from main memory repeatedly.

I've posted the example code at https://github.com/mikeywashere/Matrix-Multiply

Since the matrices are large and we are multiplying two matrices and placing the results into another we will be accessing about 24mb of memory. So this is a nice way to show that when an algorithm needs to access larges amounts of memory it should try to do so in an orderly non-random manner when possible.

You can read here for more information on cpu/memory cache: https://en.wikipedia.org/wiki/CPU_cache

Accessing Data

When you are accessing memory or disk there is one thing that is true. Sequential access is the fastest way to access data. For spinning disks this is because of the actual mechanical time spent moving the head to get to data that is not stored sequentially. For random access memory it is because the CPU is faster than ram memory, If we just coupled our speedy CPUs to our slow ram we'd have a CPU that spent most of its time waiting to get something to work on.

This is why fast CPUs have cache memory. Cache memory sits between our CPU and the random access memory. Random access memory has close to equal access speeds to any location in memory, The issue is that to keep the CPU working we need to supply it with new instructions and new data all of the time or it will stall, meaning that it cannot work until data is available.

I'm going to be talking about the x86 architecture, I'm not going to pretend I know anything about any other architecture. I know there are other caching architectures but I don't know enough about them to intelligently discuss them nor do I have hardware to run any kind of test on.

So with that said...

On the x86 architecture the Level 1 cache (referred to as L1) is split into "cache lines". A common cache line size is 64 bytes. When memory is accessed that is not in the cache a cache line worth of data is fetched. If that cache line contained data that is "dirty" (not yet written to main memory) then that data needs to be written to main memory and then the cache line needs to be filled with the data from the main memory.

If our memory access is sequential then the cache lines are filled and written in a predictable way, If our memory access is random then a cache line might be filled on one instruction, flushed, then filled again a few instructions later. With a random pattern of access we could force the CPU to stall while waiting for data. And then stall again while waiting for that same data. Each stall isn't a terrific waste of time of the CPU has something else to work on, but if our code forces stall after stall we will definitely see a performance impact.

The Code

In the project Matrix at I've posted the example code at https://github.com/mikeywashere/Matrix-Multiply you will find four files of interest.

The interface IMatrix that describes how to access a basic matrix. An indexer with an x and a y position and two properties Rows and Columns that tell how large the matrix is.

MatrixBase witch implements a few basic methods. MatrixMultiply, RandomFill and ToString along with basic data access. This class also implements the data storage as a single dimensional array which lets us control the access order, row or column optimized. Row optimized accesses the memory sequentially if the row index is incremented by one, column optimized accessed the data sequentially

RowOptimizedMatrix which is basically just redefining MatrixBase because it is already accessing the data in a row optimized way.

ColumnOptimizedMatrix which changes the data access to be in a column optimized way.

The Test

The basis of the test is this data:

const int repeat = 3;
const int size = 1000;

static MatrixBase co1 = new ColumnOptimizedMatrix(size, size);
static MatrixBase ro1 = new RowOptimizedMatrix(size, size);
static MatrixBase co2 = new ColumnOptimizedMatrix(size, size);
static MatrixBase ro2 = new RowOptimizedMatrix(size, size);

static MatrixBase aco1 = new ColumnOptimizedMatrix(size, size);
static MatrixBase aro1 = new RowOptimizedMatrix(size, size);

You'll also find in the matrix multiply code that there are a few parallel options defined.

var parallelOptions1 = new ParallelOptions
{
    MaxDegreeOfParallelism = Environment.ProcessorCount * 2
};

var parallelOptions2 = new ParallelOptions
{
    MaxDegreeOfParallelism = Environment.ProcessorCount * 4
};

I found that using option 2 on my machine yielded the best results. You mileage may vary.

In program.cs there are tests for optimized and several non-optimized scenarios.

On my machine, an AMD A10 6800K APU with four logical processors the winner is "Optimized input and column output" although "Optimized input and row output" was a very close second

The Results

On my machine, an AMD A10 6800K APU with four logical processors the winner is "Optimized input and column output" although "Optimized input and row output" was a very close second

Running: Optimized input and column output
ms: 3,427.00
ms: 3,238.00
ms: 3,184.00
Optimized input and column output: 3,283.00
Running: Optimized input and row output
ms: 3,320.00
ms: 3,565.00
ms: 3,260.00
Optimized input and row output: 3,381.67
Running: Non-Optimized, input 2 matrices by column and column output
ms: 6,920.00
ms: 6,829.00
ms: 6,799.00
Non-Optimized, input 2 matrices by column and column output: 6,849.33
Running: Non-Optimized, input 2 matrices by column and row output
ms: 6,946.00
ms: 7,014.00
ms: 6,671.00
Non-Optimized, input 2 matrices by column and row output: 6,877.00
Running: Non-Optimized, input 2 matrices by row and column output
ms: 6,981.00
ms: 7,174.00
ms: 7,283.00
Non-Optimized, input 2 matrices by row and column output: 7,146.00
Running: Non-Optimized, input 2 matrices by row and row output
ms: 7,288.00
ms: 7,656.00
ms: 7,345.00
Non-Optimized, input 2 matrices by row and row output: 7,429.67

Why

The matrix multiply method accesses memory from one matrix in a row ordered method, and from a second matrix in a column ordered method. By creating data structures that are optimized for their usage scenario you can see a performance gain of up to two times (on my system).

Being twice as fast at anything is good. Understanding and being able to prove why one piece of code is faster than another is very good.