Snabb Switch – A toolkit for solving novel problems in networking

fasteo · on July 9, 2014

The title is a bit misleading. This nice product is basically a programmatic wireshark in Lua, that is, a packet processor, so you are getting "40 million packets per second".

Once you do some meaningful work (say, HTTP protocol decoding), this figure will be a lot lower.

jvermillard · on July 9, 2014

For HTTP/TCP it's not helping much but it could be very interesting for the IoT protocols like CoAP (RFC7252) which is UDP based.

Anyway I suppose the main target of this project is to help developers of packet switching and load balancing software.

chton · on July 9, 2014

If it's easy enough to use, it might be interesting for other too. To give an example: my own project is a message processing system as a service, intended for (among others) the massive amounts of data that come from IoT gateways or devices. While we're not quite there yet, we do intend to be able to handle loads that would require this kind of performance. In some applications, millions of messages per second aren't out of the ordinary.

If this kind of library can help us do that, that could save us a lot of time and effort. Time we could use to work on the next bottleneck :)

justincormack · on July 9, 2014

It is pretty easy to use, as it is largely scripted in Lua, and the code is small and easy to follow. It does depend what you want to do with the data once you receive it though.

chton · on July 9, 2014

Most likely, split it up and feed it to a distributed internal system that can handle the data with slower processing than the single machine can.

justincormack · on July 9, 2014

That sounds like a pretty easy application then, just rewrite some addresses and put back out on the wire again (hardware should do checksum offload). Probably a bit more to it but definitely worth giving Snabb a go for that.

chton · on July 9, 2014

Quite a bit more to it, but it does sound like it could be a fit. We'll give it a try when we get to the point where our regular intake methods aren't enough :)

wingerlang · on July 9, 2014

As usual when there is a word I recognise, I go looking for a Swedish reference in the docs/founders/etc. And sure enough the guy is currently a Swedish citizen.

Snabb means fast/quick in Swedish.

sanityinc · on July 10, 2014

The author, Luke Gorrie, is a true hacker's hacker who built a good chunk of Emacs' Slime environment for Common Lisp. He ended up in Sweden to be part of the first successful Erlang-based business Bluetail, which got sold to Nortel IIRC.

(I imagine "Snabb" is closely related to "snappy" in English.)

signa11 · on July 10, 2014

and for erlang folks on emacs, distel is god sent...

SnorkelTan · on July 9, 2014

This pretty common in the financial trading space. It's usually referred to as kernel bypass. It's how HFT's are able to achieve < 100 microsecond tick to trade response times.

mikemoka · on July 9, 2014

this is a very interesting remark, it would be nice to find some more explorations of this concept applied to current web servers

LinuXY · on July 10, 2014

There are a lot of parallels in serving hundreds of thousands of small HTTP requests a second and HFT. This, and the ability to instrument the onboard FPGA to mitigate against DDoS attacks is one of the reasons I speced SolarFlare for CloudFlare. http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-...

chton · on July 9, 2014

Indeed. I'd also like to know if there are any alternatives that offer similar performance in an OSS wrapper.

ramLlama · on July 9, 2014

Thing is, userspace networking is a lot like the GPU business. Often times, the software is free or even OSS, but the hardware is proprietary.

I know Intel has software for their NIC's to enable userpace networking, but I don't think any HFT uses it. It may be all right for hobbyist experimentation. I left the industry about a year ago, so my information is a bit out-of-date, but my former employer used either Solarflare cards with the OpenOnload stack (very good cards and awesome software) or the Mellanox CX series with VMA stack (amazing hardware with mediocre software. VMA is now open-source, so perhaps that situation has improved).

Note that these cards are a few thousand dollars each, so out-of-reach for hobbyists.

lrizzo · on July 10, 2014

I can't comment on HFT because I have no experience on that, and their focus is latency rather than throughput.

For the latter (which matters in routers, for instance), netmap runs on everything (either natively or with some emulation). The intel 10g cards (which DPDK of course supports) are around 3-400$ i think, the 1G cards start around a few 10's of dollars (not that you need userspace networking at 1G).

mattgodbolt · on July 9, 2014

http://www.openonload.org/ is an open source kernel bypass stack, although out of the box it only works with (somewhat) expensive Solarflare NICs

deadgrey19 · on July 10, 2014

This stack may be open source, but it is handcrafted to work only with the Solarflare NICs. Having looked at it seriously, we decided that it wasn't worth trying to port this to our own devices and therefore wrote yet another one. This time, without the pretence of "open".

deadgrey19 · on July 10, 2014

There are one or two. The most prominent is Netmap (https://code.google.com/p/netmap/) another is the Intel Dataplane Development Kit (DPDK - http://www.dpdk.org/. Both of these allow you to handle millions of real packets, per second from standard software. No lua coding required.

rasur · on July 9, 2014

Not sure about OSS offerings, but the larger Arista switches come with FPGAs, which I hear is used by some HFT firms to achieve blinding speeds.

pjc50 · on July 9, 2014

I was involved in one of these; interesting project. http://www.argondesign.com/case-studies/2013/sep/18/high-per...

The 10G interface is presented as 64 bits of data every clock cycle, so if you can express your algorithm entirely as a byte-by-byte state machine you can parse messages with minimal delay.

chton · on July 9, 2014

that is a very interesting project. How does the 170ns latency compare to others? This isn't really my area of expertise :p

pjc50 · on July 9, 2014

Extremely well; 170ns is the length of the incoming market data packet, and the reply starts immediately on end of incoming packet. It takes more than that just to DMA a packet into main memory from PC gbit interface.

chton · on July 9, 2014

That is very cool. Where would you go from there? It seems if you're already so fast that you can start replying as soon as all the data is received, there isn't much more you can do. Or are there techniques for going beyond that?

pjc50 · on July 9, 2014

You can push the start of the reply earlier than the end of the input with some tricks. Beyond that you're limited by causality.

minimax · on July 9, 2014

The bit of the video where he talks about writing the order to the wire and then flubbing the checksum if he doesn't actually want to send the order was nuts. I work in the industry and I had heard of people using FPGAs on switches but not that particular technique. Thanks for posting that.

chton · on July 9, 2014

Damn that causality. High performance would be so much easier if causality wasn't there :p

chton · on July 9, 2014

This is usually the problem, almost every single option I've seen is based on dedicated hardware. This is fine for many, but not for those of us running in cloud hosts and other general-purpose locations.

pjc50 · on July 9, 2014

You can't possibly have a de-layered system on virtualised cloud hardware!

Besides, dedicated hosting for a 1U or 2U rack machine is not expensive. If you have enough traffic that it's worth building your own TCP solution, you already have enough servers that you're spending a considerable amount of money on.

bcoates · on July 9, 2014

Sure you can: as a random example, AWS supports SR-IOV today: you can directly communicate with a NIC from a userland linux process if you set everything up right.

The general idea is to have hardware with direct virtualization support (which is increasingly available on commodity hardware), then have a 'control plane' of layered, virtualized syscall APIs that configure a 'data plane' of virtualization-aware hardware. Permitted I/O operations occur just as if they were on bare metal, with asymptotically zero performance overhead, because you can process an arbitrary amount of data without invoking any code at the OS/hypervisor/cloud provider layer.

For example, my rented, virtualization-aware CPU allows me to run any non-privileged code that stays within a certain block of address space; my rented, virtualization-aware NIC allows me to send and receive any ethernet frames that match certain header bits; and my rented, virtualization-aware disk allows me to read and write to a certain range of LBAs. The nth-layer OS or cloud host or whatever can come in and alter these permissions at will but it need not examine every single syscall to see if it conforms to policy.

chton · on July 9, 2014

I'm aware of that, but if we could reduce dedicated hosting to a single machine through simpler code, we can feed the information from there into a cloud system. It helps a lot if the machine can be a general purpose one, since that reduces hosting costs, maintenance overhead and risk.

_hyn3 · on July 9, 2014

Definitely -- +1 to general purpose machine and doing this on normal hardware without special NIC's. However, as soon as you dump the OS, the machine would no longer be suitable as a shared tenancy/cloud host. It might be useful for some sort of dedicated service offering (that would be a cool AWS feature and allow things like Vyatta that're tied right into the kernel and SDN), but not for general purpose cloud hosting. The hypervisor/container/OS are still needed to enforce roles, manage resources, etc.

chton · on July 9, 2014

I agree, though as bcoates indicated, cloud hosts are adapting to this demand. While bare-metal access is indeed very unlikely to happen in a shared tenancy environment, there will be at least some efforts towards lower-level access.

justincormack · on July 9, 2014

Snabb switch has been working on zero copy drivers for kvm, for openstack. But you are unlikely to see that on generic cloud vms soon. If you want performance you need control of hardware.

jeffreyrogers · on July 9, 2014

Shouldn't this link directly to GitHub instead? (https://github.com/SnabbCo/snabbswitch)

Also, just in case the author of Snabb Switch is reading this, it would be really helpful to provide a short example program of how to use this. A brief look over the documentation didn't show anything. (It's possibly I just missed it. In which case it would be great if you linked to it in the README or otherwise made it prominent).

lukego · on July 9, 2014

loadgen is a good first example program. That is a load generator that can transmit an arbitrarily large amount of traffic (hundreds of Gbps) from a trace file using a negligible amount of CPU. This is practical and being used every day.

https://github.com/SnabbCo/snabbswitch/tree/master/src/desig...

Here is the driver hack that makes it possible: https://github.com/SnabbCo/snabbswitch/blob/master/src/apps/...

jsnell · on July 9, 2014

This is a very strange article, since the quoted text appears to have nothing at all to do with Snabb except for linking to the website. Instead it's describing some other closed source system that's using a similar networking layer, and that's also written in Lua.

dang · on July 9, 2014

Yes, and the quote was from HN itself, which doesn't make for a new HN post. We changed the url to that of the project.

chatmasta · on July 10, 2014

Probably because the direct link was already submitted a month ago.

lukego · on July 9, 2014

Snabb NFV is the main software being built on the Snabb Switch base right now: http://snabb.co/nfv.html

nnx · on July 9, 2014

How does this compare to netmap?

http://info.iet.unipi.it/~luigi/netmap/

netmap / VALE is a framework for high speed packet I/O.

Implemented as a kernel module for FreeBSD and Linux enabling memory mapped access to network devices.

deadgrey19 · on July 10, 2014

On of the serious drawbacks of Netmap is that it amortizes, but does not mitigate system call latencies. This means that although it can handle high packet rates, latencies for processing are dominated by the system call overhead (typically about 5us). This may not sound like much, but on a 10Gb/s network (1 bit = 0.1ns), 5us = 50,000bits or roughly 5KB. For small packets, this means that you have to batch very aggressively (about 150 packets at a time) in order to keep up.

wmf · on July 9, 2014

Netmap provides an efficient syscall API and mostly reuses existing kernel drivers. DPDK and Snabb bypass the kernel and just take control of the NIC from userspace.

anarcticpuffin · on July 9, 2014

I'd like to know how the 26ns wire-to-wire latency was measured. As far as I know, just handling the PHY layer on NIC takes at least ~300ns. Likely the authors are inferring latency by using 1/<throughput>, which is mistaken at these levels.

deadgrey19 · on July 10, 2014

Typical capture cards have a resolution of 6-7ns. 300ns PHY is an out of date value. These guys can do ~425ns wire to application (or 850ns round trip to software) http://exablaze.com/exanic-x4

bajsejohannes · on July 9, 2014

I watched the short introduction video [1], where they say that to get the fastest throughput, you need to pull up everything from the kernel and into user space. This is what they are doing with the network drivers. Won't you still need to talk to the kernel when you actually want to read/write from the network? Or is part of this that you take complete control of the NIC?

[1] Look for snabb switch here: https://cast.switch.ch/vod/channels/2i5k459xe3

corysama · on July 9, 2014

(Disclaimer: All I know of Snabb is from reading the web site and a few discussion group posts)

Because Snabb is specialized around a few specific NICs and virtual IO interfaces with specific features, it is able to do things like set up the network hardware to be memory mapped. As in, after some setup, Snabb can bang on the bits of a specific memory range and that will change bits in the network card state without the OS or even the driver being involved. This means you are not calling send() or select(), you are dealing directly with the Intel NIC hardware interface.

It sounds a lot more like game console engine programming than node.js programming. As a game console engine programmer, that's pretty interesting to me :)

justincormack · on July 9, 2014

No you talk to the NIC directly, not via the OS.

There is also a podcast here http://blog.ipspace.net/2014/06/snabb-switch-and-nfv-on-open...

Ecio78 · on July 10, 2014

Relevant comment from author (lukego) in this thread of few months ago: https://news.ycombinator.com/item?id=7250505#up_7251722

"I'm the Snabb Switch originator.

The project is new: I and other open source contributors are currently under contract to build a Network Functions Virtualization platform for Deutsche Telekom's TeraStream project [1] [2]. This is called Snabb NFV [3] and it's going to be totally open source and integrated with OpenStack.

Currently we are doing a lot of virtualization work from the "outside" of the VM: implementing Intel VMDq hardware acceleration and providing zero-copy Virtio-net to the VMs. So the virtual machine will see normal Virtio-net but we will make that operate really fast.

Inside the VMs we can either access a hardware NIC directly (via IOMMU "PCI passthrough") or we can write a device driver for the Virtio-net device.

So, early days, first major product being built, and lots of potential both inside and outside VMs, lots of fantastic products to build with nobody yet building them :-)"

bcoates · on July 9, 2014

How does Snabb Switch compare to the Linux kernel's openvswitch/openflow combination? Are there any advantages to Snabb for forwarding/counting appliances or is it just for packet-consuming applications?

haddr · on July 9, 2014

Does it wrap around some sort of userspace TCP stack? I see that performance measures are really matching these kind of systems...

justincormack · on July 9, 2014

There was some work done, see https://groups.google.com/forum/#!msg/snabb-devel/2yF5LZ-VS1... - its not the primary focus but I am likely to do some work on it later.

personZ · on July 9, 2014

The disparity between what the OS can do and what the hardware is capable of delivering is off by a few orders of magnitude right now. It's downright ridiculous how much performance we're giving up for supposed "convenience" today.

What are they talking about?

justincormack · on July 9, 2014

The operating system APIs (sockets) basically cannot delivery to/from userspace anything like what the hardware can do. The classic example is 10Gb ethernet, where if you want to do something like packet switching, you cannot do anything like line rate, due to cost of context switches, memory copies, scheduling, cache misses in the stack layers and so on.

Given that servers are getting 10Gb, 40Gb or more, this is becoming more and more of an issue. The hardware is capable of it, but the abstractions are getting in the way.

personZ · on July 9, 2014

There is an OS overhead for sure (although you can mitigate that with things like Zero Copy solutions), but it generally imposes a couple of percentage points of overhead. I was thrown off by the several orders of magnitude claim.

Reading it again, it sounds more holistic -- that a simplified, monolithic application with a local, lightweight database is much higher performance than, for instance, an n-tier, distributed database type solution.

justincormack · on July 9, 2014

It is not just a few percent for applications where userspace needs to look at each packet. There are some performance comparisons on the netmap page for example http://info.iet.unipi.it/~luigi/netmap/

personZ · on July 9, 2014

That comparison is of questionable merit. It compares bulk, unmetered packet generation (in an OS, worth noting, just as a kernel module) against netsend, which is intentionally a rate limited generator (using busy waits, as an aside, which means that the CPU probably was at 100%...doing nothing), with the significant overhead that entails.

where userspace needs to look at each packet

That benchmark is like those naive "look how fast my web server is when it returns just status code 200" comments. Even if we accepted that the overhead was anywhere close to the linked, which it isn't, the moment you actually do something with the packets those savings disappear into rounding errors.

lrizzo · on July 9, 2014

netsend and other apps used in that comparison were not rate limited. Surely that test only measures system's overheads, but i put that disclaimer very clearly in all papers and talks (btw "which it isn't" suggests that you have different numbers so please let us know). Surely, sometimes these per-packet savings are irrelevant, but there are a number of use cases where this kind of savings matter a lot. This is true for netmap, dpdk, DNA and all other network-stack-bypass frameworks.

On passing, people typically use the term 'os-bypass' but netmap relies on the OS for protection, synchronization, memory management etc -- all things that the OS does well and i find no reason to reinvent.

personZ · on July 9, 2014

Unless you used a specialized, customized version of netsend, yes it was rate limited -- it by default waits on the clock interval, and warns you if you try to call it in defiance of that. Further it is endlessly calculating the time and calling system functions to get the time.

As a test of max throughput, it is a horrible test. I don't have the motivation to prove it, but I would be surprised if more than 5% of the CPU load actually went towards networking, the rest time calculations and interval tests.

parhamn · on July 9, 2014

Yea, Im not sure thats a good/accurate bashing. This is a highly optimized tool for a very specific task. OSes have very very wide use cases they have to support.

fulafel · on July 9, 2014

It's a good and accurate bashing, irrespective of what this tool can do. Even more so once you remember that the networking equivalent of Moore's law tapered off about 10 years ago.

dang · on July 9, 2014

Url changed from http://highscalability.com/blog/2014/2/13/snabb-switch-skip-..., which points to this.