Sparks

Coroutines 5: Performance of C++03, C++20 and Python generators

Coroutines 5: Benchmarking Generators (& Conclusion)

So far in this series we’ve covered what coroutines are, what (python PEP255 style) simple generators are, how they can look in C++, looked at C++20 and C++03 implementations. The question arises - what is the performance?

Coroutines and generators can be used for things a lot more interesting than fibonacci sequence generation. However it it does form a useful basis for benchmarking the cost of:

… at least when comparing like with like. Primarily it gives you a “cost” for context switching - since you switch contexts when a value is yielded and consumed.

Given this I’ve done some very simplistic benchmarking, specifically:

The harness for each of implementation follows.

The C++03 harness:

#include "cpp03generators.h"

class Fibonnaci: public Generator {
    int a, b, s;
public:
    Fibonnaci() : a(1), b(1), s(0)  { }
    int next() {
    GENERATOR_CODE_START
        while ( true ) {
            YIELD(a);
            s = a + b;
            a = b;
            b = s;
        };
    GENERATOR_CODE_END
    };
};

void timeFibonnaci() {
    Fibonnaci a;
    for(int j=0; j<22; j++) {
        try {
            a.next();
        } catch(StopIteration null){
            std::cout << " Exception Caught" << "...\n";
            break;
        }
    };
}

int main(int, char **) {
    for(long k=0; k<1000; k++) {
        for(long j=0; j<1000; j++) {
            for(long i=0; i<1000; i++) {
                timeFibonnaci();
            }
        }
    }
    return 0;
}

The C++20 harness:

#include "cpp20generators.h"

simple_generator<int> fib(int max) {
    int a{1}, b{1}, n{0};
    for(int i=0; i< max; i++) {
        co_yield a;
        n = a + b;
        a = b;
        b = n;
    }
}

void timeFibonnaci() {
    auto a = fib();
    for(int j=0; j<22; j++) {
        if (a.running()) {
            a.try_next();
            if (a.running()) {
                a.take();
            }
        }
    }
}
int main(int, char **) {
    for(long k=0; k<1000; k++) {
        for(long j=0; j<1000; j++) {
            for(long i=0; i<1000; i++) {
                timeFibonnaci();
            }
        }
    }
    return 0;
}

Finally, python harness:

def fib():
    n, a, b = 0, 1, 1
    while n<91: # Match C++, not really needed
        yield a
        a, b = b, a+b
        n +=1

def timeFibonnaci():
    f = fib()
    for i in range(22):
        f.__next__()

for k in range(1000):
    for j in range(1000):
        for i in range(1000):
            timeFibonnaci()

Benchmark Results

This is the output for each of the 3 test harness using the linux time command.

C++03 C++20 Python
real 1m47.686s 16m21.805s 54m27.302s
user 1m47.683s 16m21.680s 54m26.746s
sys 0m0.000s 0m0.048s 0m0.080s

Note:

This really surprised me. I expected the C++03 approach to be quicker because it does less, but not that much faster! Secondly, while I expected the C++20 code to be faster than python, I didn’t expect it to be just 3.3 times faster.

That said, I don’t think this is enough of a bad strike against the C++20 version to discount using C++20 coroutines.

However, it does give pause.

Discussion, some Conclusions

What matters most in implementations? There can be a lot of possible answers to this. These are a few that spring to might.

Out of those points, given coroutines are a mechanism for implementing concurrency, I think clarity and correctness are the most important. On each point:

On the upside, C++20 coroutines are still faster than python - just not as fast as we expected or would like really! We’re not measuring raw language speed, but rather the overhead of a certain kind of context switch. This simple benchmark isn’t great, however I’d really expected the C++20 code to be similar in speed to the C++03 code. I definitely expected it to be faster than it is. Maybe it can be, but nonetheless this was very surprising.

The C++03 code essentially uses an object version of Duff’s device to build it’s coroutines, rather than the C99 style. There’s definitely some advantages to clarity - since essentially you’re building a state machine and using pre-processor Macros to hold it together. This leaks through to the code created by someone using these snippets. This could be prone to error.

By contrast the C++20 code has a fair amount of heavyweight shenanigans going on under the hood - but the result in terms of code clarity can’t really be denied. If I was a maintainer, I might prefer the C++20 version to debug and understand over the C++03 version. It’s problematic for performance - which is an issue in a language primarily used for performance. (Not exclusively - embedded systems spring to mind)

If we were ever in the situation that required raw speed in the context switch, if we could (perhaps) provide a common API to both the C++03 and C++20 coroutines, then we have a mechanism for performing optimisations should they become necessary.

What this doesn’t mean

This doesn’t mean that C++20 is slow. It means that the infrastructure around co-routines in C++20 is more complex. C++20 defers more decisions to the user of the functionality. This benchmark is (I think, I hope) testing that infrastructure difference cost - and reflects the increased complexity.

It’s likely that real world performance between C++20 and C++03 coroutines with realistic workloads will be similar. After all, the bulk of someone’s code is likely to be application related activities rather than inside that infrastructure. That’s a guess though at this stage. This is fairly simple to check though: by building similar systems with both and then benchmarking at a systems level.

OK, that’s twice as much work as I intended but it should be useful and the results are likely to be interesting.

Moving Forward

Going forward when C++-ifying the mini-guild tutorial, I’ll probably use both languages when implementing the steps involved.

Furthermore, when taking this forward, it does suggest though is that going forward and implementing my guild API (not just miniguild) in C++ would be best done in C++03 and C++20 in parallel - simply because of this huge performance difference. If the difference was more like “within 10-30% between the two versions” that’d enable focus on just one implementation, but this is a bigger difference than expected.

I’ll probably come to this after (or maybe alongside) the other posts in this series about ModernC++ as and when I write them.

Anyway, that’s this mini-series over. I learnt a fair amount about C++20 writing the code for it and then explaining it - I hope it was of interest. Let me know?



Updated: 2023/09/15 21:12:35
Copyright © 2023 Michael Sparks, All Rights Reserved (unless stated otherwise!)