Search...

Saturday, July 2, 2016

Matrix Multiply On a Beast

The Beast - Scaling up

I spun up a beast instance on Azure

Standard DS5_v2 (16 Cores, 56 GB memory)

First I changed the executable to run five iterations - from 1,000 x 1,000 size to 5,000 x 5,000 size. Then I changed the data type from float to double. So that added to the memory load as the double takes 8 bytes not 4 and then to the processing time because doubles take longer to multiply. So how fast did the matrix multiply run on the beast?

Test Run

----------------------------------------
Score:

1000: RowOptimized * RowOptimized = NonOptimized - 121.67
1000: RowOptimized * ColumnOptimized = NonOptimized - 125.00
1000: ColumnOptimized * ColumnOptimized = NonOptimized - 131.67
1000: ColumnOptimized * RowOptimized = NonOptimized - 169.67
2000: ColumnOptimized * RowOptimized = NonOptimized - 843.67
2000: RowOptimized * ColumnOptimized = NonOptimized - 863.33
2000: ColumnOptimized * ColumnOptimized = NonOptimized - 892.33
2000: RowOptimized * RowOptimized = NonOptimized - 988.00
3000: ColumnOptimized * RowOptimized = NonOptimized - 3,295.67
3000: RowOptimized * ColumnOptimized = NonOptimized - 3,363.33
3000: RowOptimized * RowOptimized = NonOptimized - 3,565.00
3000: ColumnOptimized * ColumnOptimized = NonOptimized - 3,784.33
4000: RowOptimized * ColumnOptimized = NonOptimized - 10,433.33
4000: ColumnOptimized * RowOptimized = NonOptimized - 10,793.33
4000: RowOptimized * RowOptimized = NonOptimized - 11,515.33
4000: ColumnOptimized * ColumnOptimized = NonOptimized - 11,567.33

----------------------------------------

The data is pretty simple to read

1000 : RowOptimized * RowOptimized = NonOptimized - 121.67

  • 1000 is the size of the matrix - 1,000 by 1,000 in this case
1000 : RowOptimized * RowOptimized = NonOptimized - 121.67
  • RowOptimized * RowOptimized = NonOptimized
    • Indicates the matrix type
    • "Row Optimized" meaning that it is laid out in memory to optimize the access by row
    • "Column Optimized" is optimized for access by column
    • "Non Optimized" indicates that there are no optimizations if you call the Multiply method on this object. It is laid out in memory so that it would be considered row optimized but there are no compute optimizations in the code.
1000 : RowOptimized * RowOptimized = NonOptimized - 121.67
  • The final value is the number of milliseconds that it took to complete the multiplication.
Pretty amazing really. If you have a lot of horsepower you can multiply a lot of matrices. And you can multiply them with reasonable speed and efficiency.

Notice that when you get into very large matrices that the sheer volume of data that is being accessed causes things to slow down quite a bit.

A 1,000 by 1,000 matrix using doubles is 8,000,000 bytes. Change that to a 2,000 by 2,000 matrix and the size quadruples to 32,000,000 bytes.

Matrix Size Bytes
1,000 8,000,000
2,000 32,000,000
3,000 72,000,000
4,000 128,000,000
5,000 200,000,000

A 4,000 by 4,000 cell matrix takes up 128 megabytes of memory. I guess this explains the time it takes to multiply that beast. It's a good thing we have beasts to do the calculations on.

Next

Soon I may try to move this code to using what I hope will be the more powerful graphics hardware. I think it would be interesting to see what that could do for speed.

Wednesday, June 1, 2016

This Is Not a Lock

Locking

Education means a lot. When we are working on a system and we are developing features we need to understand the constructs that we use. Locking for instance. Yes it can be tricky, yes it may not be easy to understand, it may even be hard to test. If you are not sure about something you need to read about it, consult a more experienced developer, whatever it takes for you to reasonably understand what it is you are writing. Nothing you write as a programmer should be magic to you - at least once you are past the Junior Developer stage.

Lets Dig In

I've seen this a few times in code that is in the wild at large corporations.

How Not To Lock

In the code below look at the method NotALock(). An experienced developer should immediately see this and know there is something wrong

using System;
using System.Threading;
using System.Threading.Tasks;
 
namespace This_Is_Not_A_Lock
{
    public class NotLocked
    {
        private int counter = 0;
        private int timesCalled = 0;
        private static object locker = new object();
 
        public void NotALock()
        {
            Interlocked.Increment(ref timesCalled);
            locker = new object();
            lock (locker)
            {
                Interlocked.Increment(ref counter);
                Thread.Sleep(50);
                Interlocked.Decrement(ref counter);
            }
        }
 
        public void IsALock()
        {
            Interlocked.Increment(ref timesCalled);
            lock (locker)
            {
                Interlocked.Increment(ref counter);
                Thread.Sleep(50);
                Interlocked.Decrement(ref counter);
            }
        }
 
        public void WriteValue()
        {
            Console.WriteLine("Counter={0} Called={1}", counter, timesCalled);
        }
 
        public void Reset()
        {
            counter = 0;
            timesCalled = 0;
        }
    }
 
    internal class Program
    {
        private static void Main(string[] args)
        {
            Console.WriteLine("\r\nDemonstrating not a lock.\r\n");
 
            var nl = new NotLocked();
            var t = Task.Factory.StartNew(() =>
            {
                while (true)
                {
                    nl.WriteValue();
                    Thread.Sleep(500);
                }
            });
            Parallel.For(0500new ParallelOptions { MaxDegreeOfParallelism = 128 }, (i) =>
            {
                nl.NotALock();
            });
 
            Console.WriteLine("\r\nTransitioning to an actual lock.\r\n");
            nl.Reset();
 
            Parallel.For(0500new ParallelOptions { MaxDegreeOfParallelism = 128 }, (i) =>
            {
                nl.IsALock();
            });
        }
    }
}

Locking explained

Here is a very nice and technical explanation by Joe Albahari http://www.albahari.com/threading/part2.aspx. This is a mutually exclusive lock - only one thread is allowed inside the lock(locker) { } constraint at a time. When a lock is written correctly it uses the variable lock(variable) to "track" the mutual exclusion. When I say "track" I am just trying to simplify things. You can read more about the actual technical details and realities involved in locking in the link.

Here is how I like to explain a lock/mutex. A mutex is a door that controls access to a room. The room is the code between the curly braces. At the final/end curly brace the "door" is unlocked. Correct code looks like this. If you are a geek and following along - the door object would be static and be initialized once in the static constructor.

lock (door)
{
    room();
}

The door is the control point, no one can gain access to the room except through the door. If it is written properly locks work just fine. However if you write the lock like the NotALock() method then you basically get this.

door = new object();
lock (door)
{
    room();
}

What this code does is create a new door to the "room" for everyone - we call those threads - that wants to enter the room. Each door will only let one thread in - but we are creating a new door every single time we want in. Imagine this...



Yeah, if you've seen Monsters Inc you know the scene. Doors everywhere. Well that's what happens when you new up a door object just before using it in a lock statement.

Not only are you creating a new door, you are creating an object that as soon as that lock is over, has to be garbage collected.

What To Do

If you see this in code that you work on. First, be brave. Second, explain to your team/manager/product owner why it HAS to be fixed. This is clearly not the intention of the original writer of the code. Locks written like this have no purpose, they do nothing and whatever they are supposed to be protecting is no longer protected.

If you have to explain it to someone who is not as technical then use the door analogy, it's not perfect but I find that it typically works pretty well.

The Code

You can just copy and paste the code into a command line C# app (program.cs) and run it.

The lock should only allow the value of counter to become 1/one. Never more than one

You'll see that NotALock() does not keep the counter from becoming more than 1/one. It seems to max out at about 5/five on my machine which is a quad core AMD A10. When the code gets to the transition point and switches to the correct locking mechanism you will see that the value NEVER increases above 1/one. Another side effect of the code being correct. The code that calls IsALock() takes longer to execute. Why is that? That is because the lock is working and only one thread is inside at anytime. And each thread has to wait individually at the Thread.Sleep(50) method call. So each thread has to wait 50 milliseconds. so the given number of calls is 500, in the Parallel.For, so 500 * 50 is 25,000 milliseconds, so the code should take at least 25 seconds to run.

Egoless programming

You are not the code you wrote yesterday. Your code does not define you. Code reviews are important. If you see something that you feel or you know is broken you MUST say something. The person who wrote the code should not take offense because we are not the code we wrote... No one is attacking anyone for the code they wrote, we are just asking questions, checking assumptions and sometimes educating.

We all have conversations with other developers that include teaching, and being taught. A lot of what we do is grunt work, and a lot is very technical, very demanding, very difficult work. We need to get it right. If you work on a piece of code that you feel is not important, that doesn't have the newest whiz bang feature doesn't mean that it does not need to be crafted as carefully as that front line application that everyone in the company uses or that everyone on the Internet uses.

Friday, May 27, 2016

Inlining a Method Call for Performance

Caveat

I am NOT a math person. I often have to deep dive/study anything related to math that is beyond what most people would consider "easy" math. You know the stuff, addition, subtraction, multiplication and division. If I can break it down to the basics then I can understand it. I may at times while talking about matrix math go down a path that is just plain wrong. I'm admitting that up front. I am however engaging a friend that is a math teacher/wizard to help me out with some of these concepts. So I hope I don't get too much wrong about matrix math while working in an area I do understand - performance.

On with the show

So previously we saw that with large data structures the method of accessing memory can be important. http://mikeytalksaboutcode.blogspot.com/2016/05/faster-code-understanding-memory-access.html

Today we are going to look at inlining a method call for performance reasons.

Why

When a CPU encounters a branch instruction - any instruction that causes the instruction pointer to move somewhere that is not the next instruction. When this happens the CPU does all manner of non performant things.

It may jump to a location that is not in the instruction cache and then it has to flush/load the cache.

It will likely kick out work that has been completed in the CPU pipeline.

If you're interested here is an article to read on the instruction pipeline/prefetch queue
https://en.wikipedia.org/wiki/Prefetch_input_queue is pretty good.

Slower Matrix Multiply

Remember we optimized the memory layout last time. So lets look at the unmodified, "slow" code. This code is in a new class that I am checking in MatrixBaseNonPerformant. This class will implement all functionality in a base, easy to understand way, so that we can always look and compare functionality to ensure that an optimized class does what it is supposed to do. Because if an optimized class returns garbage results really fast it still returns garbage and that would make it unusable.

public virtual IMatrix Multiply(IMatrix m2)
{
    IMatrix m3 = new MatrixBase(m2.Columns, m2.Rows);
    for (int row = 0; row < this.rows; row++)
    {
        var tempc = (row * this.columns);
        for (int cola = 0; cola < m2.Columns; cola++)
        {
            float val = 0;
            for (int colb = 0; colb < this.Columns; colb++)
            {
                val += this[row, colb] * m2[colb, cola];
            }
            m3[tempc + cola] = val;
        }
    }
    return m3;
}

So in the innermost loop we are accessing two matrices which effectively makes two method calls.

val += this[row, colb] * m2[colb, cola];

We can inline these two method calls, inlining simply means that we take whatever code is in the method and move it to the place we are calling it from.

So the code that is being called looks like this

public virtual float this[int row, int column]
{
    get { return data[(column * rows) + row]; }
    set { data[(column * rows) + row] = value; }
}

So we just need to perform this simple math inside our Multiply method. Which results in the inner value calculation looking like this:

val += this[(colb * rows) + row] * m2[(cola * columns) + colb];

Let's take a look and see if we have accomplished anything.

Performance Results

Using the MatrixBaseNonPerformant class.
1000 x 1000 matrix 
With the method call : 11,194.67
With the method call inlined: 7,084.33

So we can see that a simple repetitive call inlined can make a big difference, about 30%, in the performance of the code we are running.

Next Time

Refactoring. This code is getting very messy and it's time to clean it up.

Saturday, May 21, 2016

Faster Code - Understanding Memory Access Patterns

The Problem

When you are working with large data sets you need to understand the way the data is laid out in memory and how the code will be accessing that data.

I am going to take a look at some example code to help us understand how memory access patterns can affect the performance of our software.

Our example is matrix multiplication - see here https://en.wikipedia.org/wiki/Matrix_multiplication.

A small 10 x 10 matrix filled with 8 byte doubles will only take 800 bytes. Unless you are working with 10,000 you may not need to worry about your memory access patterns.

As has been said many times in many other places "Premature optimization is the enemy". A lot of time can be wasted working on the wrong problem.

Let's get back to our matrix multiplication problem. Think about the memory footprint of two 1000 x 1000 matrices, or three if you want your answers in another matrix. 1,000 * 1,000 = 1,000,000 * 8 = 8,000,000. So each one takes 8 megabytes in memory. That's not a lot if memory but your CPU doesn't access memory directly, it accesses memory through the cache. Since the cache is much smaller than main memory data has to be swapped in and out. The smaller your data set the less swapping that will happen. The larger your data set the more potential you have for invalidating the cache and causing it to have to load data from main memory repeatedly.

I've posted the example code at https://github.com/mikeywashere/Matrix-Multiply

Since the matrices are large and we are multiplying two matrices and placing the results into another we will be accessing about 24mb of memory. So this is a nice way to show that when an algorithm needs to access larges amounts of memory it should try to do so in an orderly non-random manner when possible.

You can read here for more information on cpu/memory cache: https://en.wikipedia.org/wiki/CPU_cache

Accessing Data

When you are accessing memory or disk there is one thing that is true. Sequential access is the fastest way to access data. For spinning disks this is because of the actual mechanical time spent moving the head to get to data that is not stored sequentially. For random access memory it is because the CPU is faster than ram memory, If we just coupled our speedy CPUs to our slow ram we'd have a CPU that spent most of its time waiting to get something to work on.

This is why fast CPUs have cache memory. Cache memory sits between our CPU and the random access memory. Random access memory has close to equal access speeds to any location in memory, The issue is that to keep the CPU working we need to supply it with new instructions and new data all of the time or it will stall, meaning that it cannot work until data is available.

I'm going to be talking about the x86 architecture, I'm not going to pretend I know anything about any other architecture. I know there are other caching architectures but I don't know enough about them to intelligently discuss them nor do I have hardware to run any kind of test on.

So with that said...

On the x86 architecture the Level 1 cache (referred to as L1) is split into "cache lines". A common cache line size is 64 bytes. When memory is accessed that is not in the cache a cache line worth of data is fetched. If that cache line contained data that is "dirty" (not yet written to main memory) then that data needs to be written to main memory and then the cache line needs to be filled with the data from the main memory.

If our memory access is sequential then the cache lines are filled and written in a predictable way, If our memory access is random then a cache line might be filled on one instruction, flushed, then filled again a few instructions later. With a random pattern of access we could force the CPU to stall while waiting for data. And then stall again while waiting for that same data. Each stall isn't a terrific waste of time of the CPU has something else to work on, but if our code forces stall after stall we will definitely see a performance impact.

The Code

In the project Matrix at I've posted the example code at https://github.com/mikeywashere/Matrix-Multiply you will find four files of interest.

The interface IMatrix that describes how to access a basic matrix. An indexer with an x and a y position and two properties Rows and Columns that tell how large the matrix is.

MatrixBase witch implements a few basic methods. MatrixMultiply, RandomFill and ToString along with basic data access. This class also implements the data storage as a single dimensional array which lets us control the access order, row or column optimized. Row optimized accesses the memory sequentially if the row index is incremented by one, column optimized accessed the data sequentially

RowOptimizedMatrix which is basically just redefining MatrixBase because it is already accessing the data in a row optimized way.

ColumnOptimizedMatrix which changes the data access to be in a column optimized way.

The Test

The basis of the test is this data:

const int repeat = 3;
const int size = 1000;

static MatrixBase co1 = new ColumnOptimizedMatrix(size, size);
static MatrixBase ro1 = new RowOptimizedMatrix(size, size);
static MatrixBase co2 = new ColumnOptimizedMatrix(size, size);
static MatrixBase ro2 = new RowOptimizedMatrix(size, size);

static MatrixBase aco1 = new ColumnOptimizedMatrix(size, size);
static MatrixBase aro1 = new RowOptimizedMatrix(size, size);

You'll also find in the matrix multiply code that there are a few parallel options defined.

var parallelOptions1 = new ParallelOptions
{
    MaxDegreeOfParallelism = Environment.ProcessorCount * 2
};

var parallelOptions2 = new ParallelOptions
{
    MaxDegreeOfParallelism = Environment.ProcessorCount * 4
};

I found that using option 2 on my machine yielded the best results. You mileage may vary.

In program.cs there are tests for optimized and several non-optimized scenarios.

On my machine, an AMD A10 6800K APU with four logical processors the winner is "Optimized input and column output" although "Optimized input and row output" was a very close second

The Results

On my machine, an AMD A10 6800K APU with four logical processors the winner is "Optimized input and column output" although "Optimized input and row output" was a very close second

Running: Optimized input and column output
ms: 3,427.00
ms: 3,238.00
ms: 3,184.00
Optimized input and column output: 3,283.00
Running: Optimized input and row output
ms: 3,320.00
ms: 3,565.00
ms: 3,260.00
Optimized input and row output: 3,381.67
Running: Non-Optimized, input 2 matrices by column and column output
ms: 6,920.00
ms: 6,829.00
ms: 6,799.00
Non-Optimized, input 2 matrices by column and column output: 6,849.33
Running: Non-Optimized, input 2 matrices by column and row output
ms: 6,946.00
ms: 7,014.00
ms: 6,671.00
Non-Optimized, input 2 matrices by column and row output: 6,877.00
Running: Non-Optimized, input 2 matrices by row and column output
ms: 6,981.00
ms: 7,174.00
ms: 7,283.00
Non-Optimized, input 2 matrices by row and column output: 7,146.00
Running: Non-Optimized, input 2 matrices by row and row output
ms: 7,288.00
ms: 7,656.00
ms: 7,345.00
Non-Optimized, input 2 matrices by row and row output: 7,429.67

Why

The matrix multiply method accesses memory from one matrix in a row ordered method, and from a second matrix in a column ordered method. By creating data structures that are optimized for their usage scenario you can see a performance gain of up to two times (on my system).

Being twice as fast at anything is good. Understanding and being able to prove why one piece of code is faster than another is very good.