0: A short series on Coroutines and Modern C++

Coroutines: Basics, Py->C++20, Benchmarking

This is a mini-series leading up to implementing coroutines in C++20, (“what I learnt on my holiday”). It also has comparison with other implementations, as part of my ModernCPP explorations. (first post)

Introduction

This is the first in a mini-series of posts leading up to implementing coroutines in C++20.

This is something I did while on holiday, as were most of these notes, I’ve just editted this into some better shape. It’s really part of an infrequent series of posts on Modern C++. The reason this became a (sub) series of posts itself rather than one is because it became, frankly, huge and unwieldy. So instead, I’ve split it and this is an overview post which links to the individual posts in the headings - which go into greater depth.

First of all I cover what coroutines are, using python’s PEP 255 simple generators as an example. I then move on to what they can look like in both old style C++ and modern C++20 form.

Given the real focus of this series of posts is “Modern C++”, the post on the implementation of C++20 coroutines is remarkably lengthy, but stay with it. As a contrast, I include a post on implementing coroutines in C++03 - which is by contrast, remarkably short. I’m really undecided on which is actually “better” at this stage.

The reason I’m undecided is because after benchmarking, the performance of C++20 coroutines isn’t what I would expect. While I expected the C++03 version to have a difference, I didn’t expect it to be that much better. (However, the C++03 version is definitely a little … odd .. even from a C++ perspective)

The rest of this post gives a brief summary of each post under headings/links to each post in the series.

Where’s the code?

Update: 11/4/2023

Want to read and runcode, not the blog? I know the feeling. I recently created a github repo specifically for putting code from blog posts in. Though recognising that it might also have non-code stuff added at some point, I’ve called it blog-extras.

The blog-extras repo can be found here:

https://github.com/sparkslabs/blog-extras

The code specific to this series of short posts can be found here:

https://github.com/sparkslabs/blog-extras/tree/main/by-date/2023/Coroutines/0

That repo will get fleshed out with code/examples from other posts in this blog too.

What are coroutines?

Coroutines are subroutines that can return and continue. Consider the following simple function:

def printing_fibonnaci():
    a, b = 1,1
    while True:
        print("Fib:", a)
        a,b = b, a+b

When run, this prints the fibonacci sequence. If we have another function that wants to use this sequence to calculate the golden ratio, it can’t do this:

# NOTE: This is not valid, it won't work
last = 1
for i in printing_fibonnaci(): # NOTE: This won't work
    gr = i/last
    print("Golden Ratio:", gr)
    last = i

If however we replace “print” with “yield”, we can do this:

def fibonnaci():
    a, b = 1,1
    while True:
        yield a
        a,b = b, a+b

last = 1
for i in fibonnaci():
    gr = i/last
    print("Golden Ratio:", gr)
    last = i

The reason for this is because the “yield” statement turns fibonacci from being a function (subroutine) to a generator (simplified coroutine). In particular calling the function returns an object that can be iterated over to yield new values.

>>> x = fibonnaci()
>>> x.__next__()
1
>>> x.__next__()
1
>>> x.__next__()
2
>>> x.__next__()
3
>>> x.__next__()
5

From this simple mechanism you can compose all sorts of useful systems and it underlies an awful lot of the modern async systems. For some 20 year tech in this arena see my work on Kamaelia and in particular mini-axon.

However, the whole async world in Python over the past 20 years has grown significantly beyond PEP255 generators - so this just scratches the surface.

What CAN C++ Coroutines look like?

C++ doesn’t enforce a particular style of coroutine. So any discussion of coroutines tends to include discussion of the implementation of the coroutine library. I think this is a bit of mistake in C++20 because it replicates the problems of memory management prior to the existence of unique_ptr/shared_ptr. You get the pieces to build things, but you’re expected to DIY a working coroutine system. (or pick someone else’s idea of what they should look like)

C++20 Based generator

This can look similar to the Python version:

// Generator definition:
simple_generator<int> fibs(int max) {
    int a{1}, b{1}, n{0};
    for(int i=0; i< max; i++) {
        co_yield a;
        n = a + b;
        a = b;
        b = n;
    }
}

// Usage:
for(auto i : fibs(20) ) {
    print("FIB : {}\n", i);
}

Pros: Looks clean, not too dissimilar to other languages, looks “C++”-y
Cons:
- Implementation overhead (cognitive and performance)
- What the lines above actually mean is NOT standardised. You might think they mean specific things, and behaviour but you have no guarantees.
- In particular, there’s no hint that this is dependent entirely on the implementation of generators that this code is relying on. (It’s essentially no better than pseudocode without specifying that implementation)

C++03 Based generator

This builds on some Duff’s device style trickery and doesn’t look too bad, but has some oddities. (The implementation this uses is about 18 1/2 years old and somewhat naive C++)

// Generator definition
class Fibonnaci: public Generator {
    int a, b, s;
public:
    Fibonnaci() : a(1), b(1), s(0)  { }
    int next() {
    GENERATOR_CODE_START
        while ( true ) {
            YIELD(a);
            s = a + b;
            a = b;
            b = s;
        };
    GENERATOR_CODE_END
    };
};

// Usage:
void runFibonnaci() {
    Fibonnaci a;
    for(int j=0; j<22; j++) {
        try {
            std::cout << "Yield in Fibonnaci:"<< j<< " value:" << a.next() << std::endl;
        } catch(StopIteration null){
            std::cout << " Exception Caught" << "...\n";
            break;
        }
    };
}

Pros:
- The usage here is similar to how python’s works under the hood
- Lightweight
- Much higher performance
- Clear route to serialisation/pausing of state
- Relies on a feature of the language that won’t go away
- Oddness means you’re more likely to look up what this actually does?
- Not likely to be implemented differently by a “similar” library
Cons:
- A bit odd even for C++
- “Local” variables are explicitly attributes of the generator object leading to initialisation of local variables being a bit “odd”.
- Relies on preprocessor macros to function.
- Probably more around the oddity of implementation
- Without enabling exception handling in emscripton this will not work in a webassembly compiled version of this. (we’d need to use a sentinels instead) (See emscripton:C++ Exceptions Support)

Implementing C++20 and C++03 Generators

The implementation of these is somewhat out of scope in an overview like this. I can provide some headlines here though.

C++03 Implementation

The C++03 version is fundamentally a combination of:

Duff’s device - to implement “yield”
The __LINE__ magic variable that C compilers auto populate.
Classes - to provide the equivalent of a stack frame
Pre-processor macros - to clean up the code

The actual code for this is 24 lines of relatively low complexity code. It’s also 18 1/2 years old, and not really needed changes in that time. (I also use this code in Pyxie - a sometimes worked on py-/-C++ compiler of mine)

It’s also loosely based on a post by Simon Tatham about coroutines in C as used by the Putty ssh client.

C++20 Implementation

The C++20 version has the following core attributes:

You define a base generator class. Part of this has subclass which defines how the coroutine operates
This s templated to allow changing the return type.
This class provides an external API for people implementing application specific coroutines (like fibonacci) to use.
Your coroutine class contains a subclass that defines an internal API. This is hooked into by the compiler when specific things happen at runtime (like co_yield, co_return, etc)
A second external API to enable use in for loops.

The overview of this looks like this:

template<typename T>
class simple_generator {
public:
    class promise_type; // Forward decl - to be defined lower inside this class
    using handle_type = std::coroutine_handle<promise_type>;
private:
    handle_type mCoro;

public:
    // External API implementation

    class promise_type { } // Internal coroutine API
    class iterator { } // Actual iterator impl
    // iterator external API
    // TBD - see below
}

The details of this are somewhat involved and definitely a lot longer than 24 lines, and there’s definitely a higher cognitive complexity involved! I’m going to leave discussing implementation there in this post, and encourage you to look at the post that covers this instead.

This leads to one of the downsides of C++ coroutines: any you write are very specific to the particular brand of coroutine or generator implementation libraries. On the surface of things, this leads to people going “Oh, just use Lewis Baker’s cppcoro library` – except that’s now not maintained.

IMO, C++20 coroutines as a result feel a bit like how smart pointers used to be before unique_ptr/shared_ptr came into existence/gained traction. (Do you use your own, do you use boost’s or …)

Benchmarking Generators

Not to put too fine a point on it, performance follows underlying implementation complexity. The benchmarking post also “only” really measures the “context switch” time.

The code that’s being run in the body of the co-routine will run at the same speed for the language so the benchmarks are a little pathological.

As a result it’s not a spoiler to note that the Python code is the slowest and the C++03 code is the fastest. The margin of difference between C++03 and C++20 does give me pause.

Discussion

The key comments in the benchmarking post contain a brief discussion of the results. However, in summary, I think the key points are:

The context switch in C++ is faster than in python
The C++20 coroutines/generators are clearer than the C++03 equivalents and more idiomatic for the language and look closer to the python equivalents
The C++03 implementation is dramatically faster than both python and C++20.
If you need raw performance for context switching, switching to a C++03 style implementation is perhaps a good idea.

If you’re interested and writing systems in C++, the C++20 approach is probably easier to start off with at least. However if you’re generating C++ code from another system - like I do with Pyxie (my slowly evolving python->C++ compiler) sticking with the C++03 coroutines has real benefits.

I do find C++20 coroutines a little disappointing considering the very naive C++ code I wrote 18 1/2 years ago performs so much better. However, the clarity of the C++20 code cannot be underestimated - especially when debugging systems.

NEXT POST: The first post in this short series asks (and answers!) “What are coroutines?”

Updated: 2023/09/15 21:12:30

Sparks

C++20 Coroutines: a short series