Hacker News new | past | comments | ask | show | jobs | submit login

I'd just love a protocol that has a built in mechanism for realizing the other side of the pipe disconnected for any reason.



That's possible in circuit switched networking with various types of supervision, but packet switched networking has taken over because it's much less expensive to implement.

Attempts to add connection monitoring usually make things worse --- if you need to reroute a cable, and one or both ends of the cable will detect a cable disconnection and close user sockets, that's not great, now you do a quick change with a small period of data loss but otherwise minor interruption; all of the established connections will be dropped.


To re-word everyone else's comments - "Disconnected" is not well-defined in any network.


> To re-word everyone else's comments - "Disconnected" is not well-defined in any network.

Parent said disconnected pipe, not network. It's sufficiently well-definable there.


I think it's a distinction without a difference in this case. You can't know if the reason your water stopped is because the water is shut off, the pipe broke, or it's just slow.

When all you have to go on is "I stopped getting packets" the best you can do is give up after a bit. TCP keepsalives do kinda suck and are full of interesting choices that don't seem to have passed the test of time. But they are there and if you control both sides of the connection you can be sure they work.


There's a crucial difference in fact, which is that the peer you're defining connectedness to is a single well-defined peer that is directly connected to you, which "The Network" is not.

As for the analogy, uh, this ain't water. Monitor the line voltage or the fiber brightness or something, it'll tell you very quickly if the other endpoint is disconnected. It's up to the physical layer to provide a mechanism to detect disconnection, but it's not somehow impossible or rocket science...


In a packet-switched network there isn't one connection between you and your peer. Even if you had line monitoring that wouldn't be enough on its own to tell you that your packet can't get there -- "routing around the problem" isn't just a turn of phrase. On the opposite end networks are best-effort so even if the line is up you might get stuck in congestion which to you might as well be dropped.

You can get the guarantees you want with a circuit switched network but there's a lot of trade-offs namely bandwidth and self-healing.


Well, isn't that already how it works? If I physically unplug my ethernet cable, won't TCP-related syscalls start failing immediately?


Last time I looked the behavior differed; some OSs will immediately reset TCP connections which were using an interface when it goes away, others will wait until a packet is attempted.


I ran into this with websockets. At least under certain browser/os pairs you won't ever receive a close event if you disconnect from wifi. I guess you need to manually monitor ping/pong messages and close it yourself after a timeout?


Probably, but I don't know how the physical layers work underneath. But regardless, it's trivial to just monitor something constantly to ensure the connection is still there, you just need the hardware and protocol support.


Modern Ethernet has what it calls "fast link pulses"; every 16ms there's some traffic to check. It's telephones that use voltage for hook detection.

However, that only applies to the two ends of that cable, not between you and the datacentre on the other side of the world.


What if I don't want all my SSH connections to drop when my WiFi stutters for a second when I open my microwave?


> I don't know how it works... it's trivial

Come on now...

And it is easy to monitor, it is just an application concern not a L3-L4 one.


See https://news.ycombinator.com/item?id=40316922 : "pipe" is L3, the network links are L2.


That's really really hard. For a full, guaranteed way to do this we'd need circuit switching (or circuit switching emulation). It's pretty expensive to do in packet networks - each flow would need to be tracked by each middle box, so a lot more RAM at every hop, and probably a lot more processing power. If we go with circuit establishment, its also kind of expensive and breaks the whole "distributed, decentralized, self-healing network" property of the Internet.

It's possible to do better than TCP these days, bandwidth is much much less constrained than it was when TCP was designed, but it's still a hard problem to do detection of pipe disconnected for any reason other than timeouts (which we already have).


Several of the "reliable UDP" protocols I have worked on in the past have had a heartbeat mechanism that is specifically for detecting this. If you haven't sent a packet down the wire in 10-100 milliseconds, you will send an extra packet just to say you're still there.

It's very useful to do this in intra-datacenter protocols.


These types of keepalives are usually best handled at the application protocol layer where you can design in more knobs and respond in different ways. Otherwise you may see unexpected interactions between different keepalive mechanisms in different parts of the protocol stack.


Like TCP keepalives?


If the feature already technically exists in TCP, it's either broken or disabled by default, which is pretty much the same as not having it.


keepalives are an optional TCP feature so they are not necessarily supported by all TCP implementations and therefor default to off even when supported.


Where is it off? Most linux distros have it on it’s just the default kickoff timer is ridiculously long (like 2 hours iirc). Besides, TCP keepalives won't help with the issue at hand and were put in for totally different purpose (gc'ing idle connections). Most of the time you don't even need them because the other side will send RST packet if it already closed the socket.


AFAIK, all Linux distros plus Windows and macOS have TCP keepalives off by default as mandated by the RFC 1122. Even when they are optionally turned on using SO_KEEPALIVE, the interval defaults to two hours because that is the minimum default interval allowed by spec. That can then be optionally reduced with something like /proc/sys/net/ipv4/tcp_keepalive_time (system wide) or TCP_KEEPIDLE (per socket).

By default, completely idle TCP connections will stay alive indefinitely from the perspective of both peers even if their physical connection is severed.

            Implementors MAY include "keep-alives" in their TCP
            implementations, although this practice is not universally
            accepted.  If keep-alives are included, the application MUST
            be able to turn them on or off for each TCP connection, and
            they MUST default to off.

            Keep-alive packets MUST only be sent when no data or
            acknowledgement packets have been received for the
            connection within an interval.  This interval MUST be
            configurable and MUST default to no less than two hours.
[0]: https://datatracker.ietf.org/doc/html/rfc1122#page-101


OK you're right - it's coming back to me now. I've been spoiled by software that enables keep-alive on sockets.


So we need a protocol with some kind of non-optional default-enabled keepalive.


Now your connections start to randomly fail in production because the implementation defaults to 20ms and your local tests never caught that.


I'm sure there's some middle ground between "never time out" and "time out after 20ms" that works reasonably well for most use cases


You're conflating all optional TCP features of all operating systems, network devices, and RFCs together. This lack of nuance fails to appreciate that different applications have different needs for how they use TCP: ( server | client ) x ( one way | chatty bidirectional | idle tinygram | mixed ). If a feature needs to be used on a particular connection, then use it. ;)


If a socket is closed properly there'll be a FIN and the other side can learn about it by polling the socket.

If the network connection is lost due to external circumstances (say your modem crashes) then how would that information propagate from the point of failure to the remote end on an idle connection? Either you actively probe (keepalives) and risk false positives or you wait until you hear again from the other side, risking false negatives.


It gets even worse - routing changes causing traffic to blackhole would still be undetectable without a timeout mechanism, since probes and responses would be lost.


> If the network connection is lost due to external circumstances (say your modem crashes) then how would that information propagate from the point of failure to the remote end on an idle connection?

Observe the line voltage? If it gets cut then you have a problem...

> Either you actively probe (keepalives) and risk false positives

What false positives? Are you thinking there's an adversary on the other side?


This is a L2 vs L3 thing.

Most network links absolutely will detect that the link has gone away; the little LED will turn off and the OS will be informed on both ends of that link.

But one of the link ends is a router, and these are (except for NAT) stateless. The router does not know what TCP connections are currently running through it, so it cannot notify them - until a packet for that link arrives, at which point it can send back an ICMP packet.

A TCP link with no traffic on it does not exist on the intermediate routers.

(Direct contrast to the old telecom ATM protocol, which was circuit switched and required "reservation" of a full set of end-to-end links).


For a given connection, (most) packages might go through (e.g.) 10 links. If one link goes down (or is saturated and dropping packets) the connection is supposed to route around it.

So, except for the links on either of end going down (one end really, if the other is on a “data center” the TCP connection is likely terminated in a “server” with redundant networking) you wouldn't want to have a connection terminated just because a link died.

That's explicitly against the goal of a packed switched network.


SCTP has hearbeats to detect that.


What you’re looking for is: https://datatracker.ietf.org/doc/html/rfc5880

BFD, it’s used for millisecond failure detection and typically combined with BGP sessions (tcp based) to ensure seamless failover without packet drops.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: