Load Balancing in Wasm Libraries
In implementing Trealla Prolog's Go embedding I took the approach of porting the interpreter to WebAssembly and making a wrapper around that instead of calling it natively.
This has a few advantages. We avoid the pain of cross-compiling cgo by letting wasmtime-go take care of that for us. Another one is isolation, C oopsies like buffer overflows won't leak into our host program. The big disadvantage is speed: it runs about twice as slow as the native version for me. There's a lot more to say on this topic, but I'll focus on an interesting approach to reduce lock contention that we can take advantage of with Wasm.
One cool thing about Wasm1 is that you can work with a guest program's memory on a higher level. Want to reduce startup times? Snapshot the guest's memory and copy it to a fresh one later on2. There are a lot of neat optimizations we can do with this.
A pool of interpreters
Let's say we're using Prolog to store some information we want to query from many different concurrently running goroutines. This is a common scenario when you're making a web service, for example, where every request is running concurrently.
Trealla, and Wasm in general at the moment, is inherently single-threaded. We have to protect the interpreter with a mutex to prevent multiple queries from stepping on each other. With many concurrent queries, this can lead to lock contention and negatively impact performance.
However, we can run separate interpreters in parallel with no issue. What if we created a bunch of identical interpreters and distributed read requests to them? This is not a revolutionary idea. Indeed, the venerable Apache's prefork model is the same idea, and so are read replicas in the database world. Essentially, you're trading resources (in our case, mainly memory) for better parallelism.
I took a shot at implementing an interpreter pool like this for trealla-prolog/go. In a rough benchmark of 100 concurrent requests on my 12-core computer, the pool of 12 interpreters was about 4x faster than a single interpreter (average query returning in 0.18ms). Only 4x seems a bit low to me, so I'm sure there's room for further improvement.
The implementation
I experimented with a few different approaches, but I settled with one that was very simple and still provided a decent performance boost for few-writes many-reads scenarios.
- When a pool is created, it instantiates a leader interpreter and a configurable number of replicas.
- The pool is protected by a
*sync.RWLock
. - Write transactions grab the write lock and write to the leader interpreter. Afterwards, the leader's memory is copied to all of the replicas.
- Read transactions grab a read lock and are executed against one of the replica (follower) interpreters. The replica is chosen by receiving from a buffered channel of idle interpreters, and is reinserted when the transaction is finished.
All in all, it was about 300 lines of relatively straightforward code.
The interesting part about all of this is that none of the Pool code relies on adding functionality to the interpreter. The implementation is a naive one that can work with any kind of Wasm module and should survive upstream changes to Trealla. Of course, we could reduce the amount of memory we need to copy by adding specialization to the interpreter, but I think that it's pretty cool we can get decent results with such a simple approach. Perhaps in the future we can take advantage of multiple memories in Wasm and put the "good stuff" in a memory separate from the rest of the program and only copy that.
Maybe in the future we'll have some kind of layered memory, copy-on-write situation. I am curious if that could be made efficient.
The Numbers
Each benchmark iteration awaits 100 concurrent queries. Pool of 12 vs single interpreter.
BenchmarkPoolCPUs-12 258 4574233 ns/op
BenchmarkContendedMutex-12 60 19036911 ns/op
I look forward to exploring further optimizations in later posts.
P.S. I am looking for Wasm-related jobs (remote). Please get in cotact if you're hiring.