To get the best performance, I recommend:
- enable vnethdr
- enable offloads (TSO and USO)
- consider spreading the load across multiple queues and CPUs with multi queue
- consider syscall batching for additional gain of maybe 10%, perhaps try io_uring
- consider customizing the steering algorithm
```