It's always TCP_NODELAY

ironman1478 · 2024-05-09T19:58:17 1715284697

I've fixed multiple latency issues due to nagle's multiple times in my career. It's the first thing I jump to. I feel like the logic behind it is sound, but it just doesn't work for some workloads. It should be something that an engineer needs to be forced to set while creating a socket, instead of letting the OS choose a default. I think that's the main issue. Not that it's a good / bad option but that there is a setting that people might not know about that manipulates how data is sent over the wire so aggressively.

nh2 · 2024-05-09T23:51:25 1715298685

Same here. I have a hobby that on any RPC framework I encounter, I file a Github issue "did you think of TCP_NODELAY or can this framework do only 20 calls per second?".

So far, it's found a bug every single time.

Some examples: https://cloud-haskell.atlassian.net/browse/DP-108 or https://github.com/agentm/curryer/issues/3

I disagree on the "not a good / bad option" though.

It's a kernel-side heuristic for "magically fixing" badly behaved applications.

As the article states, no sensible application does 1-byte network write() syscalls. Software that does that should be fixed.

It makes sense only in the case when you are the kernel sysadmin and somehow cannot fix the software that runs on the machine, maybe for team-political reasons. I claim that's pretty rare.

For all other cases, it makes sane software extra complicated: You need to explicitly opt-out of odd magic that makes poorly-written software have slightly more throughput, and that makes correctly-written software have huge, surprising latency.

John Nagle says here and in linked threads that Delayed Acks are even worse. I agree. But the Send/Send/Receive receive pattern that Nagle's Algorithm degrades is a totally valid and common use case, including anything that does pipelined RPC over TCP.

Both Delayed Acks and Nagle's Algorithm should be opt-in, in my opinion. It should be called TCP_DELAY, which you can opt-into if you can't be asked to implement basic userspace buffering.

People shouldn't /need/ to know about these. Make the default case be the unsurprising one.

pzs · 2024-05-10T04:36:58 1715315818

"As the article states, no sensible application does 1-byte network write() syscalls." - the problem that this flag was meant to solve was that when a user was typing at a remote terminal, which used to be a pretty common use case in the 80's (think telnet), there was one byte available to send at a time over a network with a bandwidth (and latency) severely limited compared to today's networks. The user was happy to see that the typed character arrived to the other side. This problem is no longer significant, and the world has changed so that this flag has become a common issue in many current use cases.

Was terminal software poorly written? I don't feel comfortable to make such judgement. It was designed for a constrained environment with different priorities.

Anyway, I agree with the rest of your comment.

SoftTalker · 2024-05-10T04:58:36 1715317116

> when a user was typing at a remote terminal, which used to be a pretty common use case in the 80's

Still is for some. I’m probably working in a terminal on an ssh connection to a remote system for 80% of my work day.

underdeserver · 2024-05-10T06:13:43 1715321623

If you're working on a distributed system, most of the traffic is not going to be your SSH session though.

dgoldstein0 · 2024-05-10T05:07:59 1715317679

sure, but we do so with much better networks than in the 80s. The extra overhead is not going to matter when even a bad network nowadays is measured in megabits per second per user. The 80s had no such luxury.

SomeoneFromCA · 2024-05-11T13:17:02 1715433422

First world thinking.

CamperBob2 · 2024-05-13T17:09:24 1715620164

Not really. Buildout in less-developed areas tends to be done with newer equipment. (E.g., some areas in Africa never got a POTS network, but went straight to wireless.)

atq2119 · 2024-05-11T16:12:57 1715443977

Yes, but isn't the effect on the network a different one now? With encryption and authentication, your single character input becomes amplified significantly long before it reaches the TCP stack. Extra overhead from the TCP header is still there, but far less significant in percentage terms, so it's best to address the problem at the application layer.

adgjlsfhk1 · 2024-05-10T16:29:37 1715358577

the difference is that with kb/s speed, 40x of 10 characters per second overhead mattered. now, humans aren't nearly fast enough to contest a network.

admax88qqq · 2024-05-11T02:43:19 1715395399

Why? What do you do?

gallier2 · 2024-05-11T19:39:35 1715456375

It was not just a bandwidth issue. I remember my first encounter with the Internet was on a HP workstation in Germany connected to South-Africa with telnet. The connection went over a Datex-P (X25) 2400 Baud line. The issue with X25 nets was that it was expensive. The monthly rent was around 500 DM and each packet sent also had to been paid a few cents. You would really try to optimize the use of the line and interactive rsh or telnet trafic was definitely not ideal.

klabb3 · 2024-05-10T01:47:13 1715305633

> As the article states, no sensible application does 1-byte network write() syscalls. Software that does that should be fixed.

Yes! And worse, those that do are not gonna be “fixed” by delays either. In this day and age with fast internets, a syscall per byte will bottleneck the CPU way before it’ll saturate the network path. The cpu limit when I’ve been tuning buffers have been somewhere in the 4k-32k range for 10Gbps ish.

> Both Delayed Acks and Nagle's Algorithm should be opt-in, in my opinion.

Agreed, it causes more problems than it solves and is very outdated. Now, the challenge is rolling out such a change as smoothly as possible, which requires coordination and a lot of trivia knowledge of legacy systems. Migrations are never trivial.

oefrha · 2024-05-10T04:07:23 1715314043

I doubt the libc default in established systems can change now, but newer languages and libraries can learn the lesson and do the right thing. For instance, Go sets TCP_NODELAY by default: https://news.ycombinator.com/item?id=34181846

jandrese · 2024-05-10T01:01:40 1715302900

The problem with making it opt in is that the point of the protocol was to fix apps that, while they perform fine for the developer on his LAN, would be hell on internet routers. So the people who benefit are the ones who don't know what they are doing and only use the defaults.

kazinator · 2024-05-10T17:10:38 1715361038

It's very easy to end up with small writes. E.g.

  1. Write four bytes (length of frame)
  2. Write the frame (write the frame itself)

The easiest fix in C code, with the least chance of introduce a buffer overflow or bad performance is to keep these two pieces of information in separate buffers, and use writev. (How portable is that compared to send?)

If you have to combine the two into one flat frame, you're looking at allocating and copying memory.

Linux has something called corking: you can "cork" a socket (so that it doesn't transmit), write some stuff to it multiple times and "uncork". It's extra syscalls though, yuck.

You could use a buffered stream where you control flushes: basically another copying layer.

a_t48 · 2024-05-10T00:38:48 1715301528

Thanks for the reminder to set this on the new framework I’m working on. :)

imp0cat · 2024-05-10T11:57:06 1715342226

     I have a hobby that on any RPC framework I encounter, I file a Github issue "did you think of TCP_NODELAY or can this framework do only 20 calls per second?".

So true. Just last month we had to apply the TCP_NODELAY fix to one of our libraries. :)

hgomersall · 2024-05-10T06:43:15 1715323395

Would one not also get clobbered by all the sys calls for doing many small packets? It feels like coalescing in userspace is a much better strategy all round if that's desired, but I'm not super experienced.

carterschonwald · 2024-05-10T00:05:56 1715299556

Oh hey! It’s been a while how’re you?!

Sebb767 · 2024-05-09T22:06:30 1715292390

> It should be something that an engineer needs to be forced to set while creating a socket, instead of letting the OS choose a default.

If the intention is mostly to fix applications with bad `write`-behavior, this would make setting TCP_DELAY a pretty exotic option - you would need a software engineer to be both smart enough to know to set this option, but not smart enough to distribute their write-calls well and/or not go for writing their own (probably better fitted) application-specific version of Nagles.

Bluecobra · 2024-05-09T20:20:49 1715286049

I agree, it has been fairly well known to disable Nagle's Algorithm in HFT/low latency trading circles for quite some time now (like > 15 years). It's one of the first things I look for.

Scubabear68 · 2024-05-09T20:39:50 1715287190

I was setting TCP_NODELAY at Bear Stearns for custom networking code circa 1994 or so.

kristjansson · 2024-05-10T04:04:08 1715313848

This is why I love this place

Reason077 · 2024-05-10T03:25:54 1715311554

Surely serious HFT systems bypass TCP altogether now days. In that world, every millisecond of latency can potentially cost a lot of money.

These are the guys that use microwave links to connect to exchanges because fibre-optics have too much latency.

jm__87 · 2024-05-11T22:17:40 1715465860

They still need to send their orders to an exchange, which is often done with FIX protocol over TCP (some exchanges have binary protocols which are faster than FIX, but the ones I'm aware of still use TCP)

mcoliver · 2024-05-09T21:13:28 1715289208

Same in M&E / vfx

hinkley · 2024-05-09T20:13:09 1715285589

What you really want is for the delay to be n microseconds, but there’s no good way to do that except putting your own user space buffering in front of the system calls (user space works better, unless you have something like io_uring amortizing system call times)

mjevans · 2024-05-10T02:24:05 1715307845

It'd probably be amazing how many poorly coded games would work better if something like...

TCP_60FPSBUFFER

Would wait for ~16mS after the first packet is queued and batch the data stream up.

dishsoap · 2024-05-10T03:36:44 1715312204

Most games use UDP.

Chaosvex · 2024-05-10T07:25:24 1715325924

Adding delay to multiplayer games? That's worse.

jnordwick · 2024-05-10T18:10:18 1715364618

linux has auto-corking (and I know of no way to disable it) that will do these short delays on small packets even if the dev doesn't want it

bobmcnamara · 2024-05-10T00:05:06 1715299506

I'd rather have portable TCP_CORK

hinkley · 2024-05-10T16:46:04 1715359564

Cork is probably how you’d implement this in userspace so why not both?

nsguy · 2024-05-09T21:11:23 1715289083

The logic is really for things like Telnet sessions. IIRC that was the whole motivation.

bobmcnamara · 2024-05-10T00:04:08 1715299448

And for block writes!

The Nagler turns a series of 4KB pages over TCP into a stream of MTU sized packets, rather than a short packet aligned to the end of each page.

inopinatus · 2024-05-10T01:16:34 1715303794

With some vendors you have to solve it like a policy problem, via a LD_PRELOAD shim.

kazinator · 2024-05-10T17:02:10 1715360530

> that an engineer needs to be forced to set while creating a socket

Because there aren't enough steps in setting up sockets! Haha.

I suspect that what would happen is that many of the programming language run-times in the world which have easier-to-use socket abstractions would pick a default and hide it from the programmer, so as not to expose an extra step.

pjc50 · 2024-05-10T09:22:36 1715332956

> I feel like the logic behind it is sound, but it just doesn't work for some workloads.

The logic is only sound for interactive plaintext typing workloads. It should have been turned off by default 20 years ago, let alone now.

p_l · 2024-05-10T10:42:25 1715337745

Remember that IPv4 original "target replacement date" (as it was only an "experimental" protocol) was 1990...

And a common thing in many more complex/advanced protocols was to explicitly delineate "messages", which avoids the issue of Nagle's algorithm altogether.

ww520 · 2024-05-10T02:40:45 1715308845

Same here. My first job out of college was at a database company. Queries at the client side of the client-server based database were slow. It was thought the database server was slow as hardware back then was pretty pathetic. I traced it down to the network driver and found out the default setting of TCP_NODELAY was off. I looked like a hero when turning on that option and the db benchmarks jumped up.

immibis · 2024-05-10T15:38:15 1715355495

Not when creating a socket - when sending data. When sending data, you should indicate whether this data block prefers high throughput or low latency.

nailer · 2024-05-09T21:52:14 1715291534

You’re right re: making delay explicit, but also crappy use the space networking tools don’t show whether no_delay is enabled on sockets.

Last time I had to do some Linux stuff, maybe 10 years ago you had to write a systemtap program. I guess it’s EBNF now. But I bet the userspace tools still suck.

nailer · 2024-05-10T01:47:34 1715305654

> use the space

Userspace. Sorry, was using voice dictation.

0xbadcafebee · 2024-05-09T23:34:12 1715297652

The takeaway is odd. Clearly Nagle's Algorithm was an attempt at batched writes. It doesn't matter what your hardware or network or application or use-case or anything is; in some cases, batched writes are better.

Lots of computing today uses batched writes. Network applications benefit from it too. Newer higher-level protocols like QUIC do batching of writes, effectively moving all of TCP's independent connection and error handling into userspace, so the protocol can move as much data into the application as fast as it can, and let the application (rather than a host tcp/ip stack, router, etc) worry about the connection and error handling of individual streams.

Once our networks become saturated the way they were in the old days, Nagle's algorithm will return in the form of a QUIC modification, probably deeper in the application code, to wait to send a QUIC packet until some criteria is reached. Everything in technology is re-invented once either hardware or software reaches a bottleneck (and they always will as their capabilities don't grow at the same rate).

(the other case besides bandwidth where Nagle's algorithm is useful is if you're saturating Packets Per Second (PPS) from tiny packets)

p_l · 2024-05-10T10:54:56 1715338496

The difference between QUIC and TCP is the original sin of TCP (and its predecessor) - that of emulating an async serial port connection, with no visible messaging layer.

It meant that you could use a physical teletypewriter to connect to services (simplified description - slap a modem on a serial port, dial into a TIP, write host address and port number, voila), but it also means that TCP has no idea of message boundaries, and while you can push some of that knowledge now the early software didn't.

In comparison, QUIC and many other non-TCP protocols (SCTP, TP4) explicitly provide for messaging boundaries - your interface to the system isn't based on emulated serial ports but on messages that might at most get reassembled.

utensil4778 · 2024-05-10T17:22:13 1715361733

It's kind of incredible to think how many things in computers and electronics turn out to just be a serial port.

One day, some future engineer is going to ask why their warp core diagnostic port runs at 9600 8n1.

rini17 · 2024-05-13T13:14:07 1715606047

Because of Roman chariot horses ass width or something like that XD

Spivak · 2024-05-10T03:22:54 1715311374

Yes but it seems this particular implementation is using a heuristic for how to batch that made some assumptions that didn't pan out.

adgjlsfhk1 · 2024-05-10T17:23:49 1715361829

batching needs to be application controlled rather than protocol controlled. the protocol doesn't have enough context to batch correctly.

somat · 2024-05-09T20:21:52 1715286112

What about the opposite, disable delayed acks.

The problem is the pathological behavior when tinygram prevention interacts with delayed acks. There is an exposed option to turn off tinygram prevention(TCP_NODELAY), how would you tun off delayed acks instead? Say if you wanted to benchmark all four combinations and see what works best.

doing a little research I found:

linux has the TCP_QUICKACK socket option but you have to set it every time you receive. there is also /proc/sys/net/ipv4/tcp_delack_min and /proc/sys/net/ipv4/tcp_ato_min

freebsd has net.inet.tcp.delayed_ack and net.inet.tcp.delacktime

mjb · 2024-05-09T21:13:33 1715289213

TCP_QUICKACK does fix the worst version of the problem, but doesn't fix the entire problem. Nagles algorithm will still wait for up to one round-trip time before sending data (at least as specified in the RFC), which is extra latency with nearly no added value.

Animats · 2024-05-09T20:36:24 1715286984

> linux has the TCP_QUICKACK socket option but you have to set it every time you receive

Right. What were they thinking? Why would you want it off only some of the time?

batmanthehorse · 2024-05-09T20:51:45 1715287905

In CentOS/RedHat you can add `quickack 1` to the end of a route to tell it to disable delayed acks for that route.

rbjorklin · 2024-05-10T01:23:22 1715304202

And with systemd >= 253 you can set it as part of the network config to have it be applied automatically. https://github.com/systemd/systemd/issues/25906

Culonavirus · 2024-05-10T07:14:34 1715325274

Apparently you have time to "do a little research" but not to read the entire article you're reacting to? It specifically mentions TCP_QUICKACK.

pclmulqdq · 2024-05-09T19:19:35 1715282375

In a world where bandwidth was limited, and the packet size minimum was 64 bytes plus an inter-frame gap (it still is for most Ethernet networks), sending a TCP packet for literally every byte wasted a huge amount of bandwidth. The same goes for sending empty acks.

On the other hand, my general position is: it's not TCP_NODELAY, it's TCP.

metadaemon · 2024-05-09T19:24:54 1715282694

I'd just love a protocol that has a built in mechanism for realizing the other side of the pipe disconnected for any reason.

toast0 · 2024-05-09T19:52:30 1715284350

That's possible in circuit switched networking with various types of supervision, but packet switched networking has taken over because it's much less expensive to implement.

Attempts to add connection monitoring usually make things worse --- if you need to reroute a cable, and one or both ends of the cable will detect a cable disconnection and close user sockets, that's not great, now you do a quick change with a small period of data loss but otherwise minor interruption; all of the established connections will be dropped.

01HNNWZ0MV43FF · 2024-05-09T22:37:04 1715294224

To re-word everyone else's comments - "Disconnected" is not well-defined in any network.

dataflow · 2024-05-10T00:15:02 1715300102

> To re-word everyone else's comments - "Disconnected" is not well-defined in any network.

Parent said disconnected pipe, not network. It's sufficiently well-definable there.

Spivak · 2024-05-10T03:27:32 1715311652

I think it's a distinction without a difference in this case. You can't know if the reason your water stopped is because the water is shut off, the pipe broke, or it's just slow.

When all you have to go on is "I stopped getting packets" the best you can do is give up after a bit. TCP keepsalives do kinda suck and are full of interesting choices that don't seem to have passed the test of time. But they are there and if you control both sides of the connection you can be sure they work.

dataflow · 2024-05-10T03:37:50 1715312270

There's a crucial difference in fact, which is that the peer you're defining connectedness to is a single well-defined peer that is directly connected to you, which "The Network" is not.

As for the analogy, uh, this ain't water. Monitor the line voltage or the fiber brightness or something, it'll tell you very quickly if the other endpoint is disconnected. It's up to the physical layer to provide a mechanism to detect disconnection, but it's not somehow impossible or rocket science...

Spivak · 2024-05-10T14:18:21 1715350701

In a packet-switched network there isn't one connection between you and your peer. Even if you had line monitoring that wouldn't be enough on its own to tell you that your packet can't get there -- "routing around the problem" isn't just a turn of phrase. On the opposite end networks are best-effort so even if the line is up you might get stuck in congestion which to you might as well be dropped.

You can get the guarantees you want with a circuit switched network but there's a lot of trade-offs namely bandwidth and self-healing.

umanwizard · 2024-05-10T04:57:50 1715317070

Well, isn't that already how it works? If I physically unplug my ethernet cable, won't TCP-related syscalls start failing immediately?

pjc50 · 2024-05-10T09:31:16 1715333476

Last time I looked the behavior differed; some OSs will immediately reset TCP connections which were using an interface when it goes away, others will wait until a packet is attempted.

iudqnolq · 2024-05-10T14:14:36 1715350476

I ran into this with websockets. At least under certain browser/os pairs you won't ever receive a close event if you disconnect from wifi. I guess you need to manually monitor ping/pong messages and close it yourself after a timeout?

dataflow · 2024-05-10T05:11:00 1715317860

Probably, but I don't know how the physical layers work underneath. But regardless, it's trivial to just monitor something constantly to ensure the connection is still there, you just need the hardware and protocol support.

pjc50 · 2024-05-10T09:29:55 1715333395

Modern Ethernet has what it calls "fast link pulses"; every 16ms there's some traffic to check. It's telephones that use voltage for hook detection.

However, that only applies to the two ends of that cable, not between you and the datacentre on the other side of the world.

oasisaimlessly · 2024-05-12T21:34:04 1715549644

What if I don't want all my SSH connections to drop when my WiFi stutters for a second when I open my microwave?

remram · 2024-05-10T13:44:13 1715348653

> I don't know how it works... it's trivial

Come on now...

And it is easy to monitor, it is just an application concern not a L3-L4 one.

pjc50 · 2024-05-10T09:27:18 1715333238

See https://news.ycombinator.com/item?id=40316922 : "pipe" is L3, the network links are L2.

sophacles · 2024-05-09T20:06:54 1715285214

That's really really hard. For a full, guaranteed way to do this we'd need circuit switching (or circuit switching emulation). It's pretty expensive to do in packet networks - each flow would need to be tracked by each middle box, so a lot more RAM at every hop, and probably a lot more processing power. If we go with circuit establishment, its also kind of expensive and breaks the whole "distributed, decentralized, self-healing network" property of the Internet.

It's possible to do better than TCP these days, bandwidth is much much less constrained than it was when TCP was designed, but it's still a hard problem to do detection of pipe disconnected for any reason other than timeouts (which we already have).

pclmulqdq · 2024-05-09T22:35:51 1715294151

Several of the "reliable UDP" protocols I have worked on in the past have had a heartbeat mechanism that is specifically for detecting this. If you haven't sent a packet down the wire in 10-100 milliseconds, you will send an extra packet just to say you're still there.

It's very useful to do this in intra-datacenter protocols.

jallmann · 2024-05-09T22:45:09 1715294709

These types of keepalives are usually best handled at the application protocol layer where you can design in more knobs and respond in different ways. Otherwise you may see unexpected interactions between different keepalive mechanisms in different parts of the protocol stack.

koverstreet · 2024-05-09T19:29:20 1715282960

Like TCP keepalives?

mort96 · 2024-05-09T20:20:36 1715286036

If the feature already technically exists in TCP, it's either broken or disabled by default, which is pretty much the same as not having it.

voxic11 · 2024-05-09T20:43:22 1715287402

keepalives are an optional TCP feature so they are not necessarily supported by all TCP implementations and therefor default to off even when supported.

dilyevsky · 2024-05-09T23:28:34 1715297314

Where is it off? Most linux distros have it on it’s just the default kickoff timer is ridiculously long (like 2 hours iirc). Besides, TCP keepalives won't help with the issue at hand and were put in for totally different purpose (gc'ing idle connections). Most of the time you don't even need them because the other side will send RST packet if it already closed the socket.

halter73 · 2024-05-10T01:06:46 1715303206

AFAIK, all Linux distros plus Windows and macOS have TCP keepalives off by default as mandated by the RFC 1122. Even when they are optionally turned on using SO_KEEPALIVE, the interval defaults to two hours because that is the minimum default interval allowed by spec. That can then be optionally reduced with something like /proc/sys/net/ipv4/tcp_keepalive_time (system wide) or TCP_KEEPIDLE (per socket).

By default, completely idle TCP connections will stay alive indefinitely from the perspective of both peers even if their physical connection is severed.

            Implementors MAY include "keep-alives" in their TCP
            implementations, although this practice is not universally
            accepted.  If keep-alives are included, the application MUST
            be able to turn them on or off for each TCP connection, and
            they MUST default to off.

            Keep-alive packets MUST only be sent when no data or
            acknowledgement packets have been received for the
            connection within an interval.  This interval MUST be
            configurable and MUST default to no less than two hours.

[0]: https://datatracker.ietf.org/doc/html/rfc1122#page-101

dilyevsky · 2024-05-10T01:17:57 1715303877

OK you're right - it's coming back to me now. I've been spoiled by software that enables keep-alive on sockets.

mort96 · 2024-05-10T07:17:10 1715325430

So we need a protocol with some kind of non-optional default-enabled keepalive.

josefx · 2024-05-10T07:43:50 1715327030

Now your connections start to randomly fail in production because the implementation defaults to 20ms and your local tests never caught that.

mort96 · 2024-05-10T08:50:33 1715331033

I'm sure there's some middle ground between "never time out" and "time out after 20ms" that works reasonably well for most use cases

hi-v-rocknroll · 2024-05-10T05:02:33 1715317353

You're conflating all optional TCP features of all operating systems, network devices, and RFCs together. This lack of nuance fails to appreciate that different applications have different needs for how they use TCP: ( server | client ) x ( one way | chatty bidirectional | idle tinygram | mixed ). If a feature needs to be used on a particular connection, then use it. ;)

the8472 · 2024-05-09T19:36:58 1715283418

If a socket is closed properly there'll be a FIN and the other side can learn about it by polling the socket.

If the network connection is lost due to external circumstances (say your modem crashes) then how would that information propagate from the point of failure to the remote end on an idle connection? Either you actively probe (keepalives) and risk false positives or you wait until you hear again from the other side, risking false negatives.

sophacles · 2024-05-09T20:09:01 1715285341

It gets even worse - routing changes causing traffic to blackhole would still be undetectable without a timeout mechanism, since probes and responses would be lost.

dataflow · 2024-05-10T00:25:15 1715300715

> If the network connection is lost due to external circumstances (say your modem crashes) then how would that information propagate from the point of failure to the remote end on an idle connection?

Observe the line voltage? If it gets cut then you have a problem...

> Either you actively probe (keepalives) and risk false positives

What false positives? Are you thinking there's an adversary on the other side?

pjc50 · 2024-05-10T09:26:40 1715333200

This is a L2 vs L3 thing.

Most network links absolutely will detect that the link has gone away; the little LED will turn off and the OS will be informed on both ends of that link.

But one of the link ends is a router, and these are (except for NAT) stateless. The router does not know what TCP connections are currently running through it, so it cannot notify them - until a packet for that link arrives, at which point it can send back an ICMP packet.

A TCP link with no traffic on it does not exist on the intermediate routers.

(Direct contrast to the old telecom ATM protocol, which was circuit switched and required "reservation" of a full set of end-to-end links).

ncruces · 2024-05-10T13:53:05 1715349185

For a given connection, (most) packages might go through (e.g.) 10 links. If one link goes down (or is saturated and dropping packets) the connection is supposed to route around it.

So, except for the links on either of end going down (one end really, if the other is on a “data center” the TCP connection is likely terminated in a “server” with redundant networking) you wouldn't want to have a connection terminated just because a link died.

That's explicitly against the goal of a packed switched network.

noselasd · 2024-05-09T20:05:10 1715285110

SCTP has hearbeats to detect that.

drb999 · 2024-05-10T11:47:13 1715341633

What you’re looking for is: https://datatracker.ietf.org/doc/html/rfc5880

BFD, it’s used for millisecond failure detection and typically combined with BGP sessions (tcp based) to ensure seamless failover without packet drops.

niutech · 2024-05-09T21:36:12 1715290572

Shouldn't QUIC (https://en.wikipedia.org/wiki/QUIC) solve the TCP issues like latency?

klabb3 · 2024-05-09T23:48:50 1715298530

As someone who needed high throughput and looked to QUIC because of control of buffers, I recommend against it at this time. It’s got tons of performance problems depending on impl and the API is different.

I don’t think QUIC is bad, or even overengineered, really. It delivers useful features, in theory, that are quite well designed for the modern web centric world. Instead I got a much larger appreciation for TCP, and how well it works everywhere: on commodity hardware, middleboxes, autotuning, NIC offloading etc etc. Never underestimate battletested tech.

In that sense, the lack of TCP_NODELAY is an exception to the rule that TCP performs well out of the box (golang is already doing this by default). As such, I think it’s time to change the default. Not using buffers correctly is a programming error, imo, and can be patched.

supriyo-biswas · 2024-05-10T06:55:42 1715324142

Was this ever implemented though? I found [1] but it was frozen due to age and was never worked on, it seems.

(Edit: doing some more reading, it seems TCP_NODELAY was always the default in Golang. Enable TCP_NODELAY => "disable Nagle's algorithm")

[1] https://github.com/golang/go/issues/57530

bboreham · 2024-05-10T07:14:51 1715325291

Yes. That issue is confusingly titled, but consists solely of a quote from the author of the code talking about what they were thinking at the time they did it.

jallmann · 2024-05-09T23:35:44 1715297744

The specific issues that this article discusses (eg Nagle's algorithm) will be present in most packet-switched transport protocols, especially ones that rely on acknowledgements for reliability. The QUIC RFC mentions this: https://datatracker.ietf.org/doc/html/rfc9000#section-13

Packet overhead, ack frequency, etc are the tip of the iceberg though. QUIC addresses some of the biggest issues with TCP such as head-of-line blocking but still shares the more finicky issues, such as different flow and congestion control algorithms interacting poorly.

djha-skin · 2024-05-09T23:04:27 1715295867

Quic is mostly used between client and data center, but not between two datacenter computers. TCP is the better choice once inside the datacenter.

Reasons:

Security Updates

Phones run old kernels and new apps. So it makes a lot of sense to put something that needs updated a lot like the network stack into user space, and quic does well here.

Data center computers run older apps on newer kernels, so it makes sense to put the network stack into the kernel where updates and operational tweaks can happen independent of the app release cycle.

Encryption Overhead

The overhead of TLS is not always needed inside a data center, where it is always needed on a phone.

Head of Line Blocking

Super important on a throttled or bad phone connection, not a big deal when all of your datacenter servers have 10G connections to everything else.

In my opinion TCP is a battle hardened technology that just works even when things go bad. That it contains a setting with perhaps a poor default is a small thing in comparison to its good record for stability in most situations. It's also comforting to know I can tweak kernel parameters if I need something special for my particular use case.

mjb · 2024-05-09T23:39:32 1715297972

Many performance-sensitive in-datacenter applications have moved away from TCP to reliable datagram protocols. Here's what that looks like at AWS: https://ieeexplore.ieee.org/document/9167399

theamk · 2024-05-09T18:20:11 1715278811

I don't by the reasoning for never needing Nagle anymore. Sure, telnet isn't a thing today, but I bet there are still plenty of apps which do equivalent of:

     write(fd, "Host: ")
     write(fd, hostname)
     write(fd, "\r\n")
     write(fd, "Content-type: ")
     etc...

this may not be 40x overhead, but it'd still 5x or so.

temac · 2024-05-09T18:43:24 1715280204

Fix the apps. Nobody expect magical perf if you do that when writing to files, even though the OS also has its own buffers. There is no reason to expect otherwise when writing to a socket and actually nagle already doesn't save you from syscall overhead.

toast0 · 2024-05-09T18:55:54 1715280954

Nagle doesn't save the derpy side from syscall overhead, but it would save the other side.

It's not just apps doing this stuff, it also lives in system libraries. I'm still mad at the Android HTTPS library for sending chunked uploads as so many tinygrams. I don't remember exactly, but I think it's reasonable packetization for the data chunk (if it picked a reasonable size anyway), then one packet for \r\n, one for the size, and another for another \r\n. There's no reason for that, but it doesn't hurt the client enough that I can convince them to avoid the system library so they can fix it and the server can manage more throughput. Ugh. (It might be that it's just the TLS packetization that was this bogus and the TCP packetization was fine, it's been a while)

If you take a pcap for some specific issue, there's always so many of these other terrible things in there. </rant>

citrin_ru · 2024-05-10T08:03:39 1715328219

I agree that such code should be fixed but having hard time persuading developers to fix their code. Many of them don't know what is a syscall, how making a syscall triggers sending of an IP packet, how a library call translates to a syscall e. t. c. Worse they don't want to know this, they write say Java code (or some other high level language) and argue that libraries/JDK/kernel should handle all 'low level' stuff.

To get optimal performance for request-response protocols like HTTP one should send a full request which includes a request line, all headers and a POST body using a single write syscall (unless POST body is large and it make sense to write it in chunks). Unfortunately not all HTTP libraries work this way and a library user cannot fix this problem without switching a library which is: 1. not always easy 2. it is not widely known which libraries are efficient and which are not. Even if you have an own HTTP library it's not always trivial to fix: e. g. in Java a way to fix this problem while keeping code readable and idiomatic is too wrap socket into BufferedOutputStream which adds one more memory-to-memory copy for all data you are sending on top of at least one memory-to-memory copy you already have without a buffered stream; so it's not an obvious performance win for an application which already saturates memory bandwidth.

bjourne · 2024-05-09T23:04:33 1715295873

> Fix the apps. Nobody expect magical perf if you do that when writing to files,

We write to files line-by-line or even character-by-character and expect the library or OS to "magically" buffer it into fast file writes. Same with memory. We expect multiple small mallocs to be smartly coalesced by the platform.

PaulDavisThe1st · 2024-05-10T05:15:49 1715318149

If you expect a POSIX-y OS to buffer write(2) calls, you're sadly misguided. Whether or not that happens depends on nature of the device file you're writing to.

OTOH, if you're using fwrite(3), as you likely should be actual file I/O, then your expectation is entirely reasonable.

Similarly with memory. If you expect brk(2) to handle multiple small allocations "sensibly" you're going to be disappointed. If you use malloc(3) then your expectation is entirely reasonable.

bjourne · 2024-05-10T14:35:48 1715351748

Whether buffering is part of POSIX or not is beside the point. Any modern OS you'll find will buffer write calls in one way or the other. Similarly with memory. Linux waits until accesses page faults before reserving any memory pages for you. My point is that various forms of buffering is everywhere and in practice we do rely on it a whole lot.

PaulDavisThe1st · 2024-05-10T14:49:11 1715352551

> Any modern OS you'll find will buffer write calls in one way or the other.

This is simply not true as a general rule. It depends on the nature of the file descriptor. Yes, if the file descriptor refers to the file system, it will in all likelihood be buffered by the OS (not with O_DIRECT, however). But on "any modern OS", file descriptors can refer to things that are not files, and the buffering situation there will vary from case to case.

bjourne · 2024-05-10T20:00:05 1715371205

You're right, Linux does not buffer writes to file descriptors for which buffering has no performance benefit...

_carbyau_ · 2024-05-10T02:21:55 1715307715

True to a degree. But that is a singular platform wholly controlled by the OS.

Once you put packets out into the world you're in a shared space.

I assume every conceivable variation of argument has been made both for and against Nagles at this point but it essentially revolves around a shared networking resource and what policy is in place for fair use.

Nagles fixes a particular case but interferes overall. If you fix the "particular case app" the issue goes away.

eru · 2024-05-10T00:52:28 1715302348

Yes, your libraries should fix that. The OS (as in the kernel) should not try to do any abstraction.

Alas, kernels really like to offer abstractions.

jeroenhd · 2024-05-10T01:52:30 1715305950

Everybody expects magical perf if you do that when writing files. We have RAM buffers and write caches for a reason, even on fast SSDs. We expect it so much that macOS doesn't flush to disk even when you call fsync() (files get flushed to the disk's write buffer instead).

There's some overhead to calling write() in a loop, but it's certainly not as bad as when a call to write() would actually make the data traverse whatever output stream you call it on.

meinersbur · 2024-05-09T19:23:16 1715282596

Those are the apps are quickly written and do not care if they unnecessarily congest the network. The ones that do get properly maintained can set TCP_NODELAY. Seems like a reasonable default to me.

blahgeek · 2024-05-10T01:19:20 1715303960

We actually have the similar behavior when writing to files: contents are buffered in page cache and are written to disk later in batch, unless user explicitly call "sync".

ale42 · 2024-05-09T20:26:35 1715286395

Apps can always misbehave, you never know what people implement, and you don't always have source code to patch. I don't think the role of the OS is to let the apps do whatever they wish, but it should give the possibility of doing it if it's needed. So I'd rather say, if you know you're properly doing things and you're latency sensitive, just TCP_NODELAY on all your sockets and you're fine, and nobody will blame you about doing it.

josefx · 2024-05-10T10:24:00 1715336640

I would love to fix the apps, can you point me to the github repo with all the code written the last 30 years so I can get started?

rwmj · 2024-05-09T18:42:44 1715280164

The comment about telnet had me wondering what openssh does, and it sets TCP_NODELAY on every connection, even for interactive sessions. (Confirmed by both reading the code and observing behaviour in 'strace').

c0l0 · 2024-05-09T18:53:09 1715280789

Especially for interactive sessions, it absolutely should! :)

syncsynchalt · 2024-05-09T21:36:50 1715290610

Ironic since Nagle's Algorithm (which TCP_NODELAY disables) was invented for interactive sessions.

It's hard to imagine interactive sessions making more than the tiniest of blips on a modern network.

eru · 2024-05-10T00:52:59 1715302379

Isn't video calling an interactive session?

semi · 2024-05-10T03:19:15 1715311155

I think that's more two independent byte streams. You want low latency but what is transfered doesnt really impact the other side, you just constantly want to push the next frame

eru · 2024-05-10T03:22:02 1715311322

Thanks, that makes sense!

It's interesting that it's very much an interactive experience for the end-user. But for the logic of the computer, it's not interactive at all.

You can make the contrast even stronger: if both video streams are transmitted over UDP, you don't even need to sent ACKs etc. To be truly one-directional from a technical point of view.

Then compare that to transferring a file via TCP. For the user this is as one-directional and non-interactive as it gets, but the computers constantly talk back and forth.

TorKlingberg · 2024-05-10T16:10:01 1715357401

Video calls indeed almost always use UDP. TCP retransmission isn't really useful since by the time a retransmitted packet arrives it's too old to display. Worse, a single lost packet will block a TCP stream. Sometimes TCP is the only way to get through a firewall, but the experience is bad if there's any packet loss at all.

VC systems do constantly send back packet loss statistics and adjust the video quality to avoid saturating a link. Any buffering in routers along the way will add delay, so you want to keep the bitrate low enough to keep buffers empty.

saagarjha · 2024-05-13T12:15:06 1715602506

You don't want ACKs on every frame but video streaming can be made more efficient if you do checkpoints (which informs the encoder that the other side has some data, which means it can compress more efficiently).

syncsynchalt · 2024-05-10T22:01:34 1715378494

You're right! (I'm ignoring the reply thread).

I'm so used to a world where "interactive" was synonymous with "telnet" and "person on keyboard".

asveikau · 2024-05-09T19:32:53 1715283173

I don't think that's actually super common anymore when you consider that doing asynchronous I/O, the only sane way to do that is put it into a buffer rather than blocking at every small write(2).

Then you consider that asynchronous I/O is usually necessary both on server (otherwise you don't scale well) and client (because blocking on network calls is terrible experience, especially in today's world of frequent network changes, falling out of network range, etc.)

silisili · 2024-05-09T18:55:27 1715280927

We shouldn't penalize the internet at large because some developers write terrible code.

littlestymaar · 2024-05-09T19:40:13 1715283613

Isn't it how SMTP is working though?

leni536 · 2024-05-09T21:19:14 1715289554

grishka · 2024-05-09T18:49:16 1715280556

And they really shouldn't do this. Even disregarding the network aspect of it, this is still bad for performance because syscalls are kinda expensive.

otterley · 2024-05-09T18:39:53 1715279993

Marc addresses that: “That’s going to make some “write every byte” code slower than it would otherwise be, but those applications should be fixed anyway if we care about efficiency.”

jrockway · 2024-05-09T18:53:11 1715280791

Does this matter? Yes, there's a lot of waste. But you also have a 1Gbps link. Every second that you don't use the full 1Gbps is also waste, right?

tedunangst · 2024-05-09T19:00:51 1715281251

This is why I always pad out the end of my html files with a megabyte of  . A half empty pipe is a half wasted pipe.

arp242 · 2024-05-09T21:06:55 1715288815

I am finally starting to understand some of these OpenOffice/LibreOffice commit messages like https://github.com/LibreOffice/core/commit/a0b6744d3d77

dessimus · 2024-05-09T21:05:51 1715288751

Just be sure HTTP Compression is off though, or you're still half-wasting the pipe.

Better to just dump randomized uncompressible data into html comments.

jrockway · 2024-05-11T06:16:39 1715408199

I think that's an unfair comparison. By using Nagle's algorithm for interactive work, you save bytes, but the software you're interacting with is that much less responsive. (If the client was responsible for echoing typed characters, then it wouldn't matter. But ssh and telnet don't work like that, unfortunately.)

So by saving bytes and leaving your pipe empty, you just suffer in user experience. Why not use something you're already paying for to make your life better?

(In the end, it seems like SSH agrees with me, and just wastes the bytes by enabling TCP_NODELAY.)

Arnt · 2024-05-09T18:40:32 1715280032

Those aren't the ones you debug, so they won't be seen by OP. Those are the ones you don't need to debug because Nagle saves you.

jabl · 2024-05-10T09:08:50 1715332130

Even if you do nothing 'fancy' like Nagle, corking, or userspace building up the complete buffer before writing etc., at the very least the above should be using a vectored write (writev() ).

sophacles · 2024-05-09T20:19:44 1715285984

TCP_CORK handles this better than nagle tho.

Too · 2024-05-10T09:28:37 1715333317

Shouldn’t that go through some buffer? Unless you fflush() between each write?

eatonphil · 2024-05-09T18:54:00 1715280840

I imagine the write calls show up pretty easily as a bottleneck in a flamegraph.

wbl · 2024-05-09T18:59:54 1715281194

They don't. Maybe if you're really good you notice the higher overhead but you expect to be spending time writing to the network. The actual impact shows up when the bandwidth consumption is way up on packet and TCP headers which won't show on a flamegraph that easily.

tptacek · 2024-05-09T19:13:19 1715281999

The discussion here mostly seems to miss the point. The argument is to change the default, not to eliminate the behavior altogether.

the8472 · 2024-05-09T19:31:34 1715283094

shouldn't autocorking help with even without nagle?

loopdoend · 2024-05-09T19:05:07 1715281507

Ah yeah I fixed this exact bug in net-http in Ruby core a decade ago.

batmanthehorse · 2024-05-09T18:56:28 1715280988

Does anyone know of a good way to enable TCP_NODELAY on sockets when you don't have access to the source for that application? I can't find any kernel settings to make it permanent, or commands to change it after the fact.

I've been able to disable delayed acks using `quickack 1` in the routing table, but it seems particularly hard to enable TCP_NODELAY from outside the application.

I've been having exactly the problem described here lately, when communicating between an application I own and a closed source application it interacts with.

coldpie · 2024-05-09T19:02:10 1715281330

Would some kind of LD_PRELOAD interception for socket(2) work? Call the real function, then do setsockopt or whatever, and return the modified socket.

cesarb · 2024-05-09T19:17:28 1715282248

> Would some kind of LD_PRELOAD interception for socket(2) work?

That would only work if the call goes through libc, and it's not statically linked. However, it's becoming more and more common to do system calls directly, bypassing libc; the Go language is infamous for doing that, but there's also things like the rustix crate for Rust (https://crates.io/crates/rustix), which does direct system calls by default.

zbowling · 2024-05-09T19:50:56 1715284256

And go is wrong for doing that, at least on Linux. It bypasses optimizations in the vDSO in some cases. On Fuchsia, we made direct syscalls not through the vDSO illegal and it was funny the hacks to go that required. The system ABI of Linux really isn't the syscall interface, its the system libc. That's because the C ABI (and the behaviors of the triple it was compiled for) and its isms for that platform are the linga franca of that system. Going around that to call syscalls directly, at least for the 90% of useful syscalls on the system that are wrapped by libc, is asinine and creates odd bugs, makes crash reporters heuristical unwinders, debuggers, etc all more painful to write. It also prevents the system vendor from implementing user mode optimizations that avoid mode and context switches when necessary. We tried to solve these issues in Fuchsia, but for Linux, Darwin, and hell, even Windows, if you are making direct syscalls and it's not for something really special and bespoke, you are just flat-out wrong.

JoshTriplett · 2024-05-09T19:58:15 1715284695

> The system ABI of Linux really isn't the syscall interface, its the system libc.

You might have reasons to prefer to use libc; some software has reason to not use libc. Those preferences are in conflict, but one of them is not automatically right and the other wrong in all circumstances.

Many UNIX systems did follow the premise that you must use libc and the syscall interface is unstable. Linux pointedly did not, and decided to have a stable syscall ABI instead. This means it's possible to have multiple C libraries, as well as other libraries, which have different needs or goals and interface with the system differently. That's a useful property of Linux.

There are a couple of established mechanism on Linux for intercepting syscalls: ptrace, and BPF. If you want to intercept all uses of a syscall, intercept the syscall. If you want to intercept a particular glibc function in programs using glibc, or for that matter a musl function in a program using musl, go ahead and use LD_PRELOAD. But the Linux syscall interface is a valid and stable interface to the system, and that's why LD_PRELOAD is not a complete solution.

zbowling · 2024-05-09T20:04:01 1715285041

It's true that Linux has a stable-ish syscall table. What is funny is that this caused a whole series of Samsung Android phones to reboot randomly with some apps because Samsung added a syscall at the same position someone else did in upstream linux and folks staticly linking their own libc to avoid boionc libc were rebooting phones when calling certain functions because the Samsung syscall causing kernel panics when called wrong. Goes back to it being a bad idea to subvert your system libc. Now, distro vendors do give out multiple versions of a libc that all work with your kernel. This generally works. When we had to fix ABI issues this happened a few times. But I wouldn't trust building our libc and assuming that libc is portable to any linux machine to copy it to.

cesarb · 2024-05-09T20:43:00 1715287380

> It's true that Linux has a stable-ish syscall table.

It's not "stable-ish", it's fully stable. Once a syscall is added to the syscall table on a released version of the official Linux kernel, it might later be replaced by a "not implemented" stub (which always returns -ENOSYS), but it will never be reused for anything else. There's even reserved space on some architectures for the STREAMS syscalls, which were AFAIK never on any released version of the Linux kernel.

The exception is when creating a new architecture; for instance, the syscall table for 32-bit x86 and 64-bit x86 has a completely different order.

withinboredom · 2024-05-09T22:27:09 1715293629

I think what they meant (judging by the example you ignored) is that the table changes (even if append-only) and you don't know which version you actually have when you statically compile your own version. Thus, your syscalls might be using a newer version of the table but it a) not actually be implemented, or b) implemented with something bespoke.

cesarb · 2024-05-10T03:11:08 1715310668

> Thus, your syscalls might be using a newer version of the table but it a) not actually be implemented,

That's the same case as when a syscall is later removed: it returns -ENOSYS. The correct way is to do the call normally as if it were implemented, and if it returns -ENOSYS, you know that this syscall does not exist in the currently running kernel, and you should try something else. That is the same no matter whether it's compiled statically or dynamically; even a dynamic glibc has fallback paths for some missing syscalls (glibc has a minimum required kernel version, so it does not need to have fallback paths for features introduced a long time ago).

> or b) implemented with something bespoke.

There's nothing you can do to protect against a modified kernel which does something different from the upstream Linux kernel. Even going through libc doesn't help, since whoever modified the Linux kernel to do something unexpected could also have modified the C library to do something unexpected, or libc could trip over the unexpected kernel changes.

One example of this happening is with seccomp filters. They can be used to make a syscall fail with an unexpected error code, and this can confuse the C library. More specifically, a seccomp filter which forces the clone3 syscall to always return -EPERM breaks newer libc versions which try the clone3 syscall first, and then fallback to the older clone syscall if clone3 returned -ENOSYS (which indicates an older kernel that does not have the clone3 syscall); this breaks for instance running newer Linux distributions within older Docker versions.

withinboredom · 2024-05-10T07:10:01 1715325001

Every kernel I’ve ever used has been different from an upstream kernel, with custom patches applied. It’s literally open source, anyone can do anything to it that they want. If you are using libc, you’d have a reasonable expectation not to need to know the details of those changes. If you call the kernel directly via syscall, then yeah, there is nothing you can do about someone making modifications to open source software.

saagarjha · 2024-05-13T12:20:43 1715602843

Generally, if you're patching the kernel you should coordinate with upstream if you care about your ABI.

withinboredom · 2024-05-13T13:23:03 1715606583

I don’t think anyone is asking Linus for permission to backport security fixes to their own kernels… that’s ridiculous.

saagarjha · 2024-05-13T13:40:55 1715607655

If your security fix alters the ABI it is almost always the case that this is done

tedunangst · 2024-05-09T20:27:44 1715286464

The complication with the linux syscall interface is that it turns the worse is better up to 11. Like setuid works on a per thread basis, which is seriously not what you want, so every program/runtime must do this fun little thread stop and start and thunk dance.

JoshTriplett · 2024-05-09T20:59:31 1715288371

Yeah, agreed. One of the items on my long TODO list is adding `setuid_process` and `setgid_process` and similar, so that perhaps a decade later when new runtimes can count on the presence of those syscalls, they can stop duplicating that mechanism in userspace.

pie_flavor · 2024-05-09T22:43:31 1715294611

You seem to be saying 'it was incorrect on Fuchsia, so it's incorrect on Linux'. No, it's correct on Linux, and incorrect on every other platform, as each platform's documentation is very clear on. Go did it incorrectly on FreeBSD, but that's Go being Go; they did it in the first place because it's a Linux-first system and it's correct on Linux. And glibc does not have any special privilege, the vdso optimizations it takes advantage of are just as easily taken advantage of by the Go compiler. There's no reason to bucket Linux with Windows on the subject of syscalls when the Linux manpages are very clear that syscalls are there to be used and exhaustively documents them, while MSDN is very clear that the system interface is kernel32.dll and ntdll.dll, and shuffles the syscall numbers every so often so you don't get any funny ideas.

toast0 · 2024-05-09T20:00:54 1715284854

> The system ABI of Linux really isn't the syscall interface, its the system libc.

Which one? The Linux Kernel doesn't provide a libc. What if you're a static executable?

Even on Operating Systems with a libc provided by the kernel, it's almost always allowed to upgrade the kernel without upgrading the userland (including libc); that works because the interface between userland and kernel is syscalls.

That certainly ties something that makes syscalls to a narrow range of kernel versions, but it's not as if dynamically linking libc means your program will be compatible forever either.

jimmaswell · 2024-05-09T20:27:56 1715286476

> That certainly ties something that makes syscalls to a narrow range of kernel versions

I don't think that's right, wouldn't it be the earliest kernel supporting that call and onwards? The Linux ABI intentionally never breaks userland.

toast0 · 2024-05-09T21:08:11 1715288891

In the case where you're running an Operating System that provides a libc and is OK with removing older syscalls, there's a beginning and an end to support.

Looking at FreeBSD under /usr/include/sys/syscall.h, there's a good number of retired syscalls.

On Linux under /usr/include/x86_64-linux-gnu/asm/unistd_32.h I see a fair number of missing numbers --- not sure what those are about, but 222, 223, 251, 285, and 387-392 are missing. (on Debian 12.1 with linux-image-6.1.0-12-amd64 version 6.1.52-1, if it matters)

LegionMammal978 · 2024-05-09T23:04:15 1715295855

> And go is wrong for doing that, at least on Linux. It bypasses optimizations in the vDSO in some cases.

Go's runtime does go through the vDSO for syscalls that support it, though (e.g., [0]). Of course, it won't magically adapt to new functions added in later kernel versions, but neither will a statically-linked libc. And it's not like it's a regular occurrence for Linux to new functions to the vDSO, in any case.

[0] https://github.com/golang/go/blob/master/src/runtime/time_li...

asveikau · 2024-05-09T22:52:01 1715295121

Linux doesn't even have consensus on what libc to use, and ABI breakage between glibc and musl is not unheard of. (Probably not for syscalls but for other things.)

assassinator42 · 2024-05-09T20:18:10 1715285890

The proliferation of Docker containers seems to go against that. Those really only work well since the kernel has a stable syscall ABI. So much so that you see Microsoft switching to a stable syscall ABI with Windows 11.

vilunov · 2024-05-12T17:48:25 1715536105

Source about Microsoft switching to stable syscall ABI due to containers?

assassinator42 · 2024-05-13T21:08:13 1715634493

https://learn.microsoft.com/en-us/virtualization/windowscont...

"Decoupling the User/Kernel boundary in Windows is a monumental task and highly non-trivial, however, we have been working hard to stabilize this boundary across all of Windows to provide our customers the flexibility to run down-level containers"

sophacles · 2024-05-09T20:18:37 1715285917

Linux is also weird because there are syscalls not supported in most (any?) libc - things like io_uring, and netlink fall into this.

gpderetta · 2024-05-09T20:46:35 1715287595

Futex for a very long time was only accessible via syscall.

saagarjha · 2024-05-13T12:21:21 1715602881

To be fair for users in glibc itself this wasn't much of a problem.

leni536 · 2024-05-09T21:12:31 1715289151

It should be possible to use vDSO without libc, although probably a lot of work.

LegionMammal978 · 2024-05-09T23:08:07 1715296087

It's not that much work; after all, every libc needs to have its own implementation. The kernel maps the vDSO into memory for you, and gives you the base address as an entry in the auxiliary vector.

But using it does require some basic knowledge of the ELF format on the current platform, in order to parse the symbol table. (Alongside knowledge of which functions are available in the first place.)

intelVISA · 2024-05-09T23:53:57 1715298837

It's hard work to NOT have the damn vDSO invade your address space. Only kludge part of Linux, well, apart from Nagle's, dlopen, and that weird zero copy kernel patch that mmap'd -each- socket recv(!) for a while.

LegionMammal978 · 2024-05-11T01:03:20 1715389400

It's possible, but tedious: if you disable ASLR to put the stack at the top of virtual memory, then use ELF segments to fill up everything from the mmap base downward (or upward, if you've set that), then the kernel will have nowhere left to put the vDSO, and give up.

(I investigated vDSO placement quite a lot for my x86-64 tiny ELF project: I had to rule out the possibility of a tiny ELF placing its entry point in the vDSO, to bounce back out somewhere else in the address space. It can be done, but not in any ELF file shorter than one that enters its own mapping directly.)

Curiously, there are undocumented arch_prctl() commands to map any of the three vDSOs (32, 64, x32) into your address space, if they are not already mapped, or have been unmapped.

Thaxll · 2024-05-09T20:30:27 1715286627

Those are very strong words...

jdadj · 2024-05-09T21:36:51 1715290611

Depending on the specifics, you might be able to add socat in the middle.

Instead of: your_app —> server

you’d have: your_app -> localhost_socat -> server

socat has command line options for setting tcp_nodelay. You’d need to convince your closed source app to connect to localhost, though. But if it’s doing a dns lookup, you could probably convince it to connect to localhost with an /etc/hosts entry

Since your app would be talking to socat over a local socket, the app’s tcp_nodelay wouldn’t have any effect.

praptak · 2024-05-09T19:15:22 1715282122

Attach debugger (ptrace), call setsockopt?

the8472 · 2024-05-09T19:19:11 1715282351

opening `/proc/<pid>/fd/<fd number>` and setting the socket option may work (not tested)

tedunangst · 2024-05-09T18:58:42 1715281122

LD_PRELOAD.

batmanthehorse · 2024-05-09T19:16:53 1715282213

Thank you, found this: https://github.com/sschroe/libnodelay

Too · 2024-05-10T09:33:08 1715333588

Is it possible to set it as a global OS setting, inside a container?

tuetuopay · 2024-05-09T21:06:15 1715288775

you could try ebpf and hook on the socket syscall. might be harder than LD_PRELOAD as suggested by other commenters though

resonious · 2024-05-10T07:59:16 1715327956

~15 years ago I played an MMO that was very real-time, and yet all of the communication was TCP. Literally you'd click a button, and you would not even see your action play out until a response packet came back.

All of the kids playing this game (me included) eventually figured out you could turn on TCP_NODELAY to make the game buttery smooth - especially for those in California close to the game servers.

jonathanlydall · 2024-05-10T10:57:28 1715338648

Not sure if you're talking about WoW, but around that time ago an update to the game did exactly this change (and possibly more).

An interesting side-effect of this was that before the change if something stalled the TCP stream, the game would hang for a while then very quickly replay all the missed incoming events (which was very often you being killed). After the change you'd instead just be disconnected.

mst · 2024-05-10T17:47:16 1715363236

I think I have a very vague memory of the "hang, hang, hang, SURPRISE! You're dead" thing happening in Diablo II but it's been so long I wouldn't bet on having remembered correctly.

zengid · 2024-05-09T19:21:18 1715282478

Relevant Oxide and Friends podcast episode https://www.youtube.com/watch?v=mqvVmYhclAg

matthavener · 2024-05-09T20:42:52 1715287372

This was a great episode and the really drove home the importance of visualization.

rsc · 2024-05-09T19:44:54 1715283894

Not if you use a modern language that enables TCP_NODELAY by default, like Go. :-)

andrewfromx · 2024-05-09T19:50:06 1715284206

https://news.ycombinator.com/item?id=34179426

https://github.com/golang/go/issues/57530

huh, TIL.

silverwind · 2024-05-10T00:36:58 1715301418

Node.js also does this since at least 2020.

Sammi · 2024-05-10T12:05:58 1715342758

Since 2022 v.18.

PR: https://github.com/nodejs/node/pull/42163

Changelog entry: https://github.com/nodejs/node/blob/main/doc/changelogs/CHAN...

eru · 2024-05-10T00:59:04 1715302744

Why do you need a whole language for that? Couldn't you just use a 'modern' networking library?

rsc · 2024-05-10T01:20:59 1715304059

Sure, like the one in https://9fans.github.io/plan9port/. :-)

obelos · 2024-05-09T18:43:16 1715280196

Not every time. Sometimes it's DNS.

p_l · 2024-05-09T19:11:51 1715281911

Once it was a failing line card in router zeroing last bit in IPv4 addresses, resulting in ticket about "only even IPv4 addresses are accessible" ...

jcgrillo · 2024-05-09T19:21:34 1715282494

For some reason this reminded me of the "500mi email" bug [1], maybe a similar level of initial apparent absurdity?

[1] https://www.ibiblio.org/harris/500milemail.html

chuckadams · 2024-05-09T19:43:57 1715283837

The most absurd thing to me about the 500 mile email situation is that sendmail just happily started up and soldiered on after being given a completely alien config file. Could be read as another example of "be liberal in what you accept" going awry, but sendmail's wretched config format is really a volume of war stories all its own...

rincebrain · 2024-05-09T23:13:31 1715296411

My favorite example of that was a while ago, "vixie-cron will read a cron stanza from a core dump written to /etc/cron.d" when you could convince it to write a core dump there. The other crons wouldn't touch that, but vixie-cron happily chomped through the core dump for "* * * * * root chmod u+s /tmp/uhoh" etc.

jcgrillo · 2024-05-09T20:11:49 1715285509

Configuration changes are one of those areas where having some kind of "are you sure? (y/n)" check can really pay off. It wouldn't have helped in this case, because there wasn't really any change management process to speak of, but we haven't fully learned the lesson yet.