For the past four months, I’ve been actively contributing to a new Rust-C++ project. Through this process, I’ve gained valuable insights and lessons. While I can’t disclose many project details, there are numerous technical challenges worth elaborating.
The linking problem
How to support TLS?
Thread or coroutine?
Benefits of using coroutine:
- Smaller memory cost, so we can create more coroutines.
- Context switch is faster because there is no syscall.
Pitfalls of using tokio:
Runtimecan only be created outside the “async context” of tokio. So if we need to use tunedRuntimes, we have to create them in advance. This involves lots of refactors.
Shared Memory or Actor?
If we use the coroutine runtime, we may need to decide how to handle race conditions.
Why are “deadlock”s so hard to diagnose when using coroutines?
- There is neither a wait-for graph in the coroutine runtime nor one in the OS
awaitdoes not block a thread, so we can’t find anything whti gdb/strace/perf.
Meanwhile, these “deadlocks” are hard to be detected because they appears that there is no CPU, no blocking thread, and the program is in a “vegetative state”.
Coroutine frameworks liketokioprovides some o11y tools, however, they are hard to use, and have performance overhead. - No actual “deadlock”
These stalls are mostly “waiting for a train at a bus stop” errors. For example, we may read from a channel which will never be written, which is an easy mistake when we bail on an error without calling .send()first.
So we recommend to send aResult<T>, and implement aDroptrait that automatically sendsErr(Error::DropWithoutReport)as a last-minute remedy. - No actual “stack”
Coroutines don’t carry a real stack. When they hit an await they yield a continuation, and that continuation may be resumed on the same or a different thread.
Actor-Model Pitfalls
We must pay attention to panics inside the actor’s message loop: the handler, whether a thread or a coroutine, will only surface the panic when it is eventually joined, by which time the failure may have gone unnoticed for too long. What I recommend is to:
Employ the
panic_hookto capture the exact scene where things go wrong.1
2
3
4panic::set_hook(Box::new(|info| {
eprintln!("Task panicked: {}", info);
println!("Task panicked: {}", info);
}));Eliminate
unwraps andexpects1
2
Online config change
There are some ways:
- For every actor, introduce a new UpdateConfig event, and handle it in the message loop.
- Using
arc_swap