Go vs C benchmark. Could Go be faster than C?
During last semester I was attending Multiprocessor Architectures course, given at Facultad de Informática where I study my Computer Science degree.
As part of the assignments due to pass the course, we had to do several programs written in C to benchmark matrix multiplication by testing different techniques and technologies. First of all we had to do a secuential program in three different versions:
- A normal one where the result matrix is ordered by rows and the loops range the matrix by rows too
- An “inter” version where the result matrix is ordered by rows but the loops range the matrix by columns (loops exchange).
- A “noacum” version where the variable keeping up the partial sums of the elements is deleted and each time a value is calculated, it is written to and read from the result matrix.
Later on, we had to do a parallel implementation using OpenMP, which will not be explained here, and another one using 3 threads. In this version we have other three versions:
- One where we use bounded threads (system or kernel level threads) and we parallelize elements of the result matrix. Each thread is bounded to a system process and there is real parallelism.
- Another one where we use bounded threads but we parallelize entire rows of the result matrix.
- A last one where we use unbounded threads (user level threads) parallelizing elements of the result matrix where there is no real parallelism as we multiplex several threads into a system process.
There are some considerations to be made. The programs where made and compiled for a 64 bit machine using Linux CentOS 6.0 with kernel 2.6.32. This machine is part of a cluster though programs were running in a single computer, and mounts two Quad-core ©Intel Xeon X5450 microprocessor at 3GHz. So we have a 8 core machine in 2 dies, which each of them has separate Cache L1 for instructions and data with 32KB capacity each and a write-back policy, and they share a 12MB Cache L2. Some interesting values in the configuration of the machine are that max locked memory is limited to 64KB per user and the stack size is 10,240 KBytes.
Another considerations are that this version of Linux doesn’t support unbounded threads in C, so we will not have into account their execution times and that the Go versions have been done using the GOMAXPROCS function to determine the existence of real parallelism.
Before digging into the code and the time results let’s see which tests have been passed and how are the matrix we will be multiplying:
- Test 1
- Matrix 1: 800×900 – 720,000 elements.
- Matrix 2: 900×3000 – 2,700,000 elements.
- Result Matrix: 800×3000 – 2,400,000 elements.
- Test 2
- Matrix 1: 8000×90 – 720,000 elements.
- Matrix 2: 90×8000 – 720,000 elements.
- Result Matrix: 8000×8000 – 64,000,000 elements
- Test 3
- Matrix 1: 8000×10 – 80,000 elements.
- Matrix 2: 10×80000 – 800,000 elements.
- Result Matrix: 8000×80000 – 640,000,000 elements.
So let’s analyze the source code proposed. To minimize the differences between the C an Go programs the global structure is being coherent between both languages although there could be better implementations in both. This test is supposed to compare how each language performs with a common implementation, rather than having its best performance in each version.
Source code shown here is explained. The full version can be downloaded on GitHub (https://github.com/rcostu/Go-vs-C-Matrix-benchmark).
Note.- GCC version used is [gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC)] and Go version is [0c39eee85b0d weekly/weekly.2011-12-06].
There are two different implementations depending on how matrix are declared. The files called Matrix, contain matrix defined as vectors of vectors in C as follows:
int **matrix1 = (int **) calloc(matrix1_rows, sizeof(int*)); for (i = 0; i < matrix1_rows; i++){ matrix1[i] = (int *) calloc(matrix1_cols, sizeof(int)); }
Go Version:
matrix1 := make([][]int, matrix1_rows) for i := 0; i < matrix1_rows; i++ { matrix1[i] = make([]int, matrix1_cols) }
The files named with vector declare matrix as a single vector making displacements such as matrix[i * cols + j], where i and j are the indexes of an element and cols is the number of columns in the matrix. They are declared in C as follows:
int *matrix1 = (int *) calloc(matrix1_rows * matrix1_cols, sizeof(int));
Go version:
matrix1 := make([]int, matrix1_rows * matrix1_cols)
Every single version has a init_matrix function which sets the matrix elements to 1. This function is executed in one thread both in secuential and parallel versions and parallelism is only applied to calculations.
Matrix multiplication is done as follows in C (or its equivalent in Go):
for (i = 0; i < matrix1_rows; i++) { for (j = 0; j < matrix2_cols; j++) { acum = 0; for (k = 0; k < matrix1_cols; k++) { acum += matrix1[i][k] * matrix2[k][j]; } matrixR[i][j] = acum; } }
The parallel version has this variables defined at the top (C version listed):
// Thread list pthread_t *thread_list; // Variables to get exclusive access pthread_mutex_t mutex; // Number of pending jobs int pending_jobs = 0; // Struct to store the job to be done. struct job { int i, j; struct job *next; }; // Simple linked list of total jobs. struct job *job_list = NULL; struct job *last_job = NULL;
Jobs are added an deleted from the list as indicated in this functions:
void add_job(int i, int j){ struct job *job = malloc(sizeof(struct job)); job->i = i; job->j = j; job->next = NULL; if(pending_jobs == 0){ job_list = job; last_job = job; } else{ last_job->next = job; last_job = job; } pending_jobs++; }  struct job* get_job(){ struct job *job = NULL; if(pending_jobs > 0){ job = job_list; job_list = job->next; if(job_list == NULL){ last_job = NULL; } pending_jobs--; } return job; }
Each launched thread executes the dispatch_job() function, where a job is obtained from the list and its calculations are done:
void do_job(struct job *job) { int k, acum; acum = 0; for (k = 0; k < matrix1_cols; k++) { acum += matrix1[job->i][k] * matrix2[k][job->j]; } matrixR[job->i][job->j] = acum; } void* dispatch_job () { struct job *job; while(1) { pthread_mutex_lock(&mutex); job = get_job(); pthread_mutex_unlock(&mutex); if (job) { do_job(job); free(job); } else { pthread_exit(NULL); } } }
And the main thread waits for all the threads to finish. Go version is slightly different. The dispatch_job() function is launched in a new goroutine (and will run in parallel if we have set the GOMAXPROCS to one or more processors). I have defined two different channels ch which will signal that a goroutine has ended, and chm which will give the goroutine exclusive access to the shared list.
The dispatch_job() function is as follows:
func dispatch_job() { var job *struct_job for { chm <- 1 job = get_job() <- chm if (job != nil) { do_job(job) } else { ch <- 1 runtime.Goexit() } } }
Now that we have analyzed the global structure of the code in both versions, let’s see how are time calculated and which are the final results. Each time has been taken by running the make testX command with the given makefiles for the X test (which runs the executable taking times with time command). Each time was taken 3 times to avoid high load of the server (which is a shared machine at the university), and the final taken time is the result of the formula: ((fastest time + slowest time) / 2 + middle time) / 2.
With this formula we take into account margin values but getting near the two nearest values.
So here are the times for each test, being the first figure the execution of the secuential versions, and the second one the parallelized one. Note that we don’t have unbounded threads (user level threads) in C because of Linux implementation. System stands for bounded threads (kernel level threads).
- Test 1
- Test 2
- Test 3





There are some things to remark here:
- Go version implementing the matrix as double-dimension arrays is very, very slow on test1. It gets better when the matrix are bigger.
- Single-dimension arrays are better than double-dimension arrays. This is logical because we have contiguous memory against non-contiguous memory. Go vector version is amazingly fast.
- Overall speedup is between 10-20% faster, being Go faster than C.
Although this test is not very scientific and implementations can be improved, I think it makes a point where in some kind of applications and depending on the implementation Go is faster than C. Where it is not, it should improve in the next Go 1 stable version.
I hope this post gets discussed and is able to gain new users in the Go community. Do not hesitate to fork the GitHub repository or leaving comment.
Profile cancel
Tweets
- RT @go_nuts: The Go sessions at Google I/O 2012 have been announced. https://t.co/C6yyVMqO #golang #io2012 7 hours ago
- RT @WilI_Smith: If you're important they'll make a way to be with you, if you're not they'll make an excuse. 7 hours ago
- RT @WilI_Smith: Over-thinking ruins you. Ruins the situation, twists things around, makes you worry & just makes everything much wor ... 7 hours ago
- @marsicor ¿Qué tal por mi tierra? Veo que siguen las murallas donde las dejé el fin de semana pasado ;-) 17 hours ago
- “@javiercuervo: Una ley injusta es como un diseño feo, cosas que pueden parecer prácticas pero que no van a funcionar.” /cc @nelsonmedinilla 22 hours ago
Blogs I recommend







Hi, in you Makefile you compile with -g. I know that additional debug information is stored in the executable, does this have an negative effect on performance? Additional you could use -O3 for your c code. I tested it on my side and saw that the whole textX process only takes 0.067 seconds in total for C on my 3ghz single core (just ht). Still if I do this 3 times the result wold not be useful. (http://www.azulsystems.com/events/javaone_2002/microbenchmarks.pdf)
Java once said (well not java itself =) ) that it would be faster than C++. Which was true until you use compiler optimization.
Please correct me if I see something wrong.
PS: I like go.
Hi,
Thanks for your comment. As far as I know additional debug included by compiling with -g does not affects performance but increases the executable size. If you see the next post I re-run all the tests compiling with -O2 to be fair with C comparing to Go compilation system. You can check it and see how times have changed quite a lot.
Running the tests 3 times was due to high loads in the machine, to avoid other processes interfere in the actual results as it is a shared machine with lots of people executing at the same time and with no control over queue planification.
Thanks for the link. Every benchmark has a little lie on it as there’s no such thing as equal performance due to a lot of factors to take into account.
Check the second part of my post and if you want we can discuss the new results. And if you like Go I suggest you to follow the official go-nuts mailing list if you do not yet!
I had this “old” link from G+. Seems like I missed the mailing list post. =)
hello Roberto,
the vector library has been made obsolete.
it is no longer part of Go1
there is something similar as a “builtin” version in go.
so if you try something like this (not complete listing just general idea)
type row struct {
line []int
}
matrix1 := make(row, matrix1_rows)
for i, _ := range matrix1 {
matrix1[i].line = make([]int, matrix1_cols)
}
then you can use slicing.
you see that i range over the matrix.
i don’t know if it’s faster but shouldn’t be slower than
a for loop and shows some syntactic sugar.
on slices:
to append a row to the matrix you could for example do following
matrix1= append(matrix1, somerow)
not sure if this is a faster mnethod but i think the cool stuff about Go is that you can express yourselves very concise and get an easily readable sourcecode.
thanx for the cool post and i wish you good times