I implemented a distributed messages System/API a long time (10+ years) ago on S...

I implemented a distributed messages System/API a long time (10+ years) ago on SMP, AMP, x86 CPU that were completely no-lock, none-blocking. The APIs/system on both userspace and Linux kernel space.

One thing the APIs depended on was atomic-add. I tried to get to 10 millions msg/second between process/thread withing a SMP CPU group at that time. For 10 millions msg/s, the APIs had 100ns to routed and distributed the msg. The main issue was none-cache memory access latency especially for &atomic-add variables. The none-cache memory latency was 50+ns on DDR2 at that time when I measured that on 1.2GHz Xeon. It was hard to get that performance.

I even considered adding and FPGA on PCI/PCIe which can mmap to a physical/virtual address that will auto increment on every read access to get a very high performance atomic_add.

If that same FPGA is mapped to 128,256,1024 cores, one can easily build a very high speed distributed sync message system. Hopefully for 10+ millions / second for 1024 cores.

That would be cool!