For the past four months, I’ve been actively contributing to a new Rust-C++ project. Through this process, I’ve gained valuable insights and lessons. While I can’t disclose many project details, there are numerous technical challenges worth elaborating.

The linking problem

How to support TLS?

Thread or coroutine?

Benefits of using coroutine:

Smaller memory cost, so we can create more coroutines.
Context switch is faster because there is no syscall.

Pitfalls of using tokio:

Runtime can only be created outside the “async context” of tokio. So if we need to use tuned Runtimes, we have to create them in advance. This involves lots of refactors.

Shared Memory or Actor?

If we use the coroutine runtime, we may need to decide how to handle race conditions.

Why are “deadlock”s so hard to diagnose when using coroutines?

There is neither a wait-for graph in the coroutine runtime nor one in the OS
await does not block a thread, so we can’t find anything whti gdb/strace/perf.
Meanwhile, these “deadlocks” are hard to be detected because they appears that there is no CPU, no blocking thread, and the program is in a “vegetative state”.
Coroutine frameworks like tokio provides some o11y tools, however, they are hard to use, and have performance overhead.
No actual “deadlock”
These stalls are mostly “waiting for a train at a bus stop” errors. For example, we may read from a channel which will never be written, which is an easy mistake when we bail on an error without calling .send() first.
So we recommend to send a Result<T>, and implement a Drop trait that automatically sends Err(Error::DropWithoutReport) as a last-minute remedy.
No actual “stack”
Coroutines don’t carry a real stack. When they hit an await they yield a continuation, and that continuation may be resumed on the same or a different thread.

Actor-Model Pitfalls

We must pay attention to panics inside the actor’s message loop: the handler, whether a thread or a coroutine, will only surface the panic when it is eventually joined, by which time the failure may have gone unnoticed for too long. What I recommend is to:

Employ the panic_hook to capture the exact scene where things go wrong.

panic::set_hook(Box::new(|info| {
    eprintln!("Task panicked: {}", info);
    println!("Task panicked: {}", info);
}));

Eliminate unwraps and expects

1 2	#![cfg_attr(not(test), deny(clippy::unwrap_used))] #![cfg_attr(not(test), deny(clippy::expect_used))]

Online config change

There are some ways:

For every actor, introduce a new UpdateConfig event, and handle it in the message loop.
Using arc_swap