id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc
28	need to improve duplicate-connection handling	Brian Warner		"At Allmydata we've been running into a significant problem with Foolscap's
duplicate-connection handling. In our scenario, there is a central server
(called the ""queen"") and a number of clients who connect to it (using a
Reconnector to handle connection failures). These clients want to have a
connection to the queen up all of the time.

The clients are frequently stuck behind NAT boxes.

The symptom we're seeing is that a client will connect ok, then the client
gets rebooted or there's some sort of network flap. After that point, the
client is unable to connect to the queen for a long time (where ""long"" is
defined as 15 minutes). Each time the Reconnector fires, the queen accepts
their connection, then hangs up on them. The server-side logs indicate that
it is dropping a ""duplicate connection"" each time.

The root of the problem is that the queen and the client disagree about the
connection state. The client has rebooted, so it *knows* that it does not
have an active connection open. The queen, on the other hand, still thinks it
has a valid connection from this client: the reboot or network flap was so
sudden that the old client didn't have a chance to send a FIN packet to
terminate the TCP connection. When a client silently disappears like this,
the queen will not be able to detect the loss until it tries to send data to
the client. (foolscap brokers send a PING message to the far end if they
haven't heard from it in a while, and PINGs are supposed to provoke PONGs in
response: this mechanism ensures that at least some data is flowing every 10
minutes or so).

In some circumstances, TCP will tell the queen that the connection is closed
very quickly. The best case is that the remote host is fully connected to the
internet, and was merely rebooted. Once the host comes back up, the queen's
TCP data packets will be delivered to that host, which will notice that they
are for a socket that isn't in use. In this case, the client host will
immediately send back a TCP ""RST"" message (telling the sending end to ReSeT
the connection), and the queen-side TCP stack will inform the application of
the closed connection.

Another fast-close scenario is if the last-hop router notices that the packet
cannot be delivered (usually due to an ARP timeout), or if a major routing
change results in the client machine's IP address becomine completely
unrouteable. In this case, the router will send an ICMP ""Host Unreachable""
message, which also tells TCP to shut down the connection.

But unfortunately, there are a number of fairly common slow-close scenarios.
When the client reboots behind a NAT box, the client may get a new IP
address, resulting in a stale NAT table entry. When the queen sends packets
through this entry, the NAT box forwards the packets to the old IP address,
on which nobody is listening. Nobody sends an error back, and eventually TCP
gives up. But TCP waits a *long* time, on the order of 5 to 15 minutes
(because it was designed to be robust against intermittent intermediate
hardware failures, like cables being pulled out and like).

In the meanwhile, the new client instance is trying to connect to the queen,
who thinks it has a valid (although mysteriously quiet) connection to them.
Foolscap has code to handle duplicate connections, which is designed to make
sure that there is never more than one connection to any given peer (to make
sure that our ordering guarantees can be maintained). This code grants the
side with the higher TubID the right to decide which connection gets used;
this side is called the ""master"". When a new connection is being negotiated,
the master looks in their connection table for a pre-existing one to the same
TubID. The algorithm used in 0.1.7 says that the first connection wins, so
any duplicate connections are dropped. The master is supposed to send an
error message over the connection that's about to be dropped, but I suspect
that this message doesn't make it all the way into the client's logfile.

In the problem that we're seeing, the queen winds up as the master, and it
refuses to accept new connections from the recently-rebooted (or
network-flapped) until the PING-provoked traffic triggers the TCP timeout.
This can prevent clients from connecting for upwards of 15 minutes, which is
really annoying.

== Solutions ==

Our idea to make this time shorter is to use the existence of the second
(""duplicate"") connection as an indication that perhaps the first connection
is suspect. Instead of always rejecting the new connection in favor of the
old one, we could do something different:

 * Option A: the old connection is always dropped, and the new connection is
   always accepted.

 * Option B: the old connection is marked as ""suspect"", and the new
   connection is dropped. If any data arrives over the old connection, the
   ""suspect"" flag is cleared. If a third connection is established and
   discovers an existing ""suspect"" connection, *then* the suspect connection
   is dropped and the new connection accepted.

 * Option C: both connections are dropped. The client is expected to use
   a Reconnector to trigger a new connection attempt.

To avoid a race condition between the old connection going away and the new
connection being established, an approach which drops both connections is
probably safer.

Multiple nearly-simultaneous connections are fairly common when the origin
and destination of the connection are two processes running on the same host.
It is common for FURLs to contain both a 127.0.0.1 connection hint and a hint
for a globally-routeable address. If both connection hints are useable, then
the recipient will see two connections occurring in rapid sequence. As a
result, simply dropping both connections will cause a continuously-failing
retry loop.

The option I'm considering is instead:

 * Option D: if the old connection is less than 60 seconds old, drop the new
   connection and stick with the old one (as in 0.1.7). If the new connection
   is older than 60 seconds, drop both connections.

This approach should avoid the make-before-break race condition (which would
probably violate the expectations of code which uses notifyOnDisconnect), and
still handle normal connection establishment properly.

I'm a bit uneasy about the arbitrary 60-second timeout, but it is worth
noting that there are two existing 60-second connection-setup timers already
present in Foolscap. The first is TubConnector.CONNECTION_TIMEOUT: if an
outbound TCP session has not been resolved into a negotiated connection
within 60 seconds, the client abandons it. The other is
Negotiation.SERVER_TIMEOUT: once you connect to a server, you have 60 seconds
to complete negotiation before they hang up on you.

Testing this is a nuisance:
 1. set up two computers, one as the client, the other as the server. Make
    sure the server gets the higher TubID, so it winds up being the master.
 1. have the client use a Reconnector to establish a connection to the server
 1. Configure iptables to silently drop packets for the established connection.
    Simply matching on the TCP port of the client side should be sufficient.
    Make sure that packets in both directions are dropped.
 1. Allow the connection to exist for at least 60 seconds.
 1. Stop the client. The FIN packet that gets sent by the kernel when the
    now-dead client's sockets are closed should be discarded by iptables.
    The activity-based timeout is now running.
 1. Start a new instance of the client. The Reconnector should attach to
    the server, and the server should either hang up on them or accept the
    new connection.
 1. Watch to see how long the new client instance's Reconnector takes to
    establish a new connection. It should be reasonably quick (<10 seconds).

I don't know how to automate this sort of test.
"	defect	closed	major	0.2.0	negotiation	0.1.7	fixed	negotiation