The title is a bit misleading. This nice product is basically a programmatic wireshark in Lua, that is, a packet processor, so you are getting "40 million packets per second".
Once you do some meaningful work (say, HTTP protocol decoding), this figure will be a lot lower.
If it's easy enough to use, it might be interesting for other too. To give an example: my own project is a message processing system as a service, intended for (among others) the massive amounts of data that come from IoT gateways or devices. While we're not quite there yet, we do intend to be able to handle loads that would require this kind of performance. In some applications, millions of messages per second aren't out of the ordinary.
If this kind of library can help us do that, that could save us a lot of time and effort. Time we could use to work on the next bottleneck :)
It is pretty easy to use, as it is largely scripted in Lua, and the code is small and easy to follow. It does depend what you want to do with the data once you receive it though.
That sounds like a pretty easy application then, just rewrite some addresses and put back out on the wire again (hardware should do checksum offload). Probably a bit more to it but definitely worth giving Snabb a go for that.
Quite a bit more to it, but it does sound like it could be a fit. We'll give it a try when we get to the point where our regular intake methods aren't enough :)
As usual when there is a word I recognise, I go looking for a Swedish reference in the docs/founders/etc. And sure enough the guy is currently a Swedish citizen.
The author, Luke Gorrie, is a true hacker's hacker who built a good chunk of Emacs' Slime environment for Common Lisp. He ended up in Sweden to be part of the first successful Erlang-based business Bluetail, which got sold to Nortel IIRC.
(I imagine "Snabb" is closely related to "snappy" in English.)
This pretty common in the financial trading space. It's usually referred to as kernel bypass. It's how HFT's are able to achieve < 100 microsecond tick to trade response times.
There are a lot of parallels in serving hundreds of thousands of small HTTP requests a second and HFT. This, and the ability to instrument the onboard FPGA to mitigate against DDoS attacks is one of the reasons I speced SolarFlare for CloudFlare. http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-...
Thing is, userspace networking is a lot like the GPU business. Often times, the software is free or even OSS, but the hardware is proprietary.
I know Intel has software for their NIC's to enable userpace networking, but I don't think any HFT uses it. It may be all right for hobbyist experimentation. I left the industry about a year ago, so my information is a bit out-of-date, but my former employer used either Solarflare cards with the OpenOnload stack (very good cards and awesome software) or the Mellanox CX series with VMA stack (amazing hardware with mediocre software. VMA is now open-source, so perhaps that situation has improved).
Note that these cards are a few thousand dollars each, so out-of-reach for hobbyists.
I can't comment on HFT because I have no experience on that, and their focus is latency rather than throughput.
For the latter (which matters in routers, for instance), netmap runs on everything (either natively or with some emulation). The intel 10g cards (which DPDK of course supports) are around 3-400$ i think, the 1G cards start around a few 10's of dollars (not that you need userspace networking at 1G).
This stack may be open source, but it is handcrafted to work only with the Solarflare NICs. Having looked at it seriously, we decided that it wasn't worth trying to port this to our own devices and therefore wrote yet another one. This time, without the pretence of "open".
There are one or two. The most prominent is Netmap (https://code.google.com/p/netmap/) another is the Intel Dataplane Development Kit (DPDK - http://www.dpdk.org/. Both of these allow you to handle millions of real packets, per second from standard software. No lua coding required.
The 10G interface is presented as 64 bits of data every clock cycle, so if you can express your algorithm entirely as a byte-by-byte state machine you can parse messages with minimal delay.
Extremely well; 170ns is the length of the incoming market data packet, and the reply starts immediately on end of incoming packet. It takes more than that just to DMA a packet into main memory from PC gbit interface.
That is very cool. Where would you go from there? It seems if you're already so fast that you can start replying as soon as all the data is received, there isn't much more you can do. Or are there techniques for going beyond that?
The bit of the video where he talks about writing the order to the wire and then flubbing the checksum if he doesn't actually want to send the order was nuts. I work in the industry and I had heard of people using FPGAs on switches but not that particular technique. Thanks for posting that.
This is usually the problem, almost every single option I've seen is based on dedicated hardware. This is fine for many, but not for those of us running in cloud hosts and other general-purpose locations.
You can't possibly have a de-layered system on virtualised cloud hardware!
Besides, dedicated hosting for a 1U or 2U rack machine is not expensive. If you have enough traffic that it's worth building your own TCP solution, you already have enough servers that you're spending a considerable amount of money on.
Sure you can: as a random example, AWS supports SR-IOV today: you can directly communicate with a NIC from a userland linux process if you set everything up right.
The general idea is to have hardware with direct virtualization support (which is increasingly available on commodity hardware), then have a 'control plane' of layered, virtualized syscall APIs that configure a 'data plane' of virtualization-aware hardware. Permitted I/O operations occur just as if they were on bare metal, with asymptotically zero performance overhead, because you can process an arbitrary amount of data without invoking any code at the OS/hypervisor/cloud provider layer.
For example, my rented, virtualization-aware CPU allows me to run any non-privileged code that stays within a certain block of address space; my rented, virtualization-aware NIC allows me to send and receive any ethernet frames that match certain header bits; and my rented, virtualization-aware disk allows me to read and write to a certain range of LBAs. The nth-layer OS or cloud host or whatever can come in and alter these permissions at will but it need not examine every single syscall to see if it conforms to policy.
I'm aware of that, but if we could reduce dedicated hosting to a single machine through simpler code, we can feed the information from there into a cloud system. It helps a lot if the machine can be a general purpose one, since that reduces hosting costs, maintenance overhead and risk.
Definitely -- +1 to general purpose machine and doing this on normal hardware without special NIC's. However, as soon as you dump the OS, the machine would no longer be suitable as a shared tenancy/cloud host. It might be useful for some sort of dedicated service offering (that would be a cool AWS feature and allow things like Vyatta that're tied right into the kernel and SDN), but not for general purpose cloud hosting. The hypervisor/container/OS are still needed to enforce roles, manage resources, etc.
I agree, though as bcoates indicated, cloud hosts are adapting to this demand. While bare-metal access is indeed very unlikely to happen in a shared tenancy environment, there will be at least some efforts towards lower-level access.
Snabb switch has been working on zero copy drivers for kvm, for openstack. But you are unlikely to see that on generic cloud vms soon. If you want performance you need control of hardware.
Also, just in case the author of Snabb Switch is reading this, it would be really helpful to provide a short example program of how to use this. A brief look over the documentation didn't show anything. (It's possibly I just missed it. In which case it would be great if you linked to it in the README or otherwise made it prominent).
loadgen is a good first example program. That is a load generator that can transmit an arbitrarily large amount of traffic (hundreds of Gbps) from a trace file using a negligible amount of CPU. This is practical and being used every day.
This is a very strange article, since the quoted text appears to have nothing at all to do with Snabb except for linking to the website. Instead it's describing some other closed source system that's using a similar networking layer, and that's also written in Lua.
On of the serious drawbacks of Netmap is that it amortizes, but does not mitigate system call latencies. This means that although it can handle high packet rates, latencies for processing are dominated by the system call overhead (typically about 5us). This may not sound like much, but on a 10Gb/s network (1 bit = 0.1ns), 5us = 50,000bits or roughly 5KB. For small packets, this means that you have to batch very aggressively (about 150 packets at a time) in order to keep up.
Netmap provides an efficient syscall API and mostly reuses existing kernel drivers. DPDK and Snabb bypass the kernel and just take control of the NIC from userspace.
I'd like to know how the 26ns wire-to-wire latency was measured. As far as I know, just handling the PHY layer on NIC takes at least ~300ns. Likely the authors are inferring latency by using 1/<throughput>, which is mistaken at these levels.
Typical capture cards have a resolution of 6-7ns. 300ns PHY is an out of date value. These guys can do ~425ns wire to application (or 850ns round trip to software) http://exablaze.com/exanic-x4
I watched the short introduction video [1], where they say that to get the fastest throughput, you need to pull up everything from the kernel and into user space. This is what they are doing with the network drivers. Won't you still need to talk to the kernel when you actually want to read/write from the network? Or is part of this that you take complete control of the NIC?
(Disclaimer: All I know of Snabb is from reading the web site and a few discussion group posts)
Because Snabb is specialized around a few specific NICs and virtual IO interfaces with specific features, it is able to do things like set up the network hardware to be memory mapped. As in, after some setup, Snabb can bang on the bits of a specific memory range and that will change bits in the network card state without the OS or even the driver being involved. This means you are not calling send() or select(), you are dealing directly with the Intel NIC hardware interface.
It sounds a lot more like game console engine programming than node.js programming. As a game console engine programmer, that's pretty interesting to me :)
The project is new: I and other open source contributors are currently under contract to build a Network Functions Virtualization platform for Deutsche Telekom's TeraStream project [1] [2]. This is called Snabb NFV [3] and it's going to be totally open source and integrated with OpenStack.
Currently we are doing a lot of virtualization work from the "outside" of the VM: implementing Intel VMDq hardware acceleration and providing zero-copy Virtio-net to the VMs. So the virtual machine will see normal Virtio-net but we will make that operate really fast.
Inside the VMs we can either access a hardware NIC directly (via IOMMU "PCI passthrough") or we can write a device driver for the Virtio-net device.
So, early days, first major product being built, and lots of potential both inside and outside VMs, lots of fantastic products to build with nobody yet building them :-)"
How does Snabb Switch compare to the Linux kernel's openvswitch/openflow combination? Are there any advantages to Snabb for forwarding/counting appliances or is it just for packet-consuming applications?
The disparity between what the OS can do and what the hardware is capable of delivering is off by a few orders of magnitude right now. It's downright ridiculous how much performance we're giving up for supposed "convenience" today.
The operating system APIs (sockets) basically cannot delivery to/from userspace anything like what the hardware can do. The classic example is 10Gb ethernet, where if you want to do something like packet switching, you cannot do anything like line rate, due to cost of context switches, memory copies, scheduling, cache misses in the stack layers and so on.
Given that servers are getting 10Gb, 40Gb or more, this is becoming more and more of an issue. The hardware is capable of it, but the abstractions are getting in the way.
There is an OS overhead for sure (although you can mitigate that with things like Zero Copy solutions), but it generally imposes a couple of percentage points of overhead. I was thrown off by the several orders of magnitude claim.
Reading it again, it sounds more holistic -- that a simplified, monolithic application with a local, lightweight database is much higher performance than, for instance, an n-tier, distributed database type solution.
It is not just a few percent for applications where userspace needs to look at each packet. There are some performance comparisons on the netmap page for example http://info.iet.unipi.it/~luigi/netmap/
That comparison is of questionable merit. It compares bulk, unmetered packet generation (in an OS, worth noting, just as a kernel module) against netsend, which is intentionally a rate limited generator (using busy waits, as an aside, which means that the CPU probably was at 100%...doing nothing), with the significant overhead that entails.
where userspace needs to look at each packet
That benchmark is like those naive "look how fast my web server is when it returns just status code 200" comments. Even if we accepted that the overhead was anywhere close to the linked, which it isn't, the moment you actually do something with the packets those savings disappear into rounding errors.
netsend and other apps used in that comparison were not rate limited. Surely that test only measures system's overheads, but i put that disclaimer very clearly in all papers and talks (btw "which it isn't" suggests that you have different numbers so please let us know). Surely, sometimes these per-packet savings are irrelevant, but there are a number of use cases where this kind of savings matter a lot. This is true for netmap, dpdk, DNA and all other network-stack-bypass frameworks.
On passing, people typically use the term 'os-bypass' but netmap relies on the OS for protection, synchronization, memory management etc -- all things that the OS does well and i find no reason to reinvent.
Unless you used a specialized, customized version of netsend, yes it was rate limited -- it by default waits on the clock interval, and warns you if you try to call it in defiance of that. Further it is endlessly calculating the time and calling system functions to get the time.
As a test of max throughput, it is a horrible test. I don't have the motivation to prove it, but I would be surprised if more than 5% of the CPU load actually went towards networking, the rest time calculations and interval tests.
Yea, Im not sure thats a good/accurate bashing. This is a highly optimized tool for a very specific task. OSes have very very wide use cases they have to support.
It's a good and accurate bashing, irrespective of what this tool can do. Even more so once you remember that the networking equivalent of Moore's law tapered off about 10 years ago.
Once you do some meaningful work (say, HTTP protocol decoding), this figure will be a lot lower.