Opened 17 years ago

Closed 16 years ago

#28 closed defect (fixed)

need to improve duplicate-connection handling

Reported by: Brian Warner Owned by:
Priority: major Milestone: 0.2.0
Component: negotiation Version: 0.1.7
Keywords: negotiation Cc:

Description

At Allmydata we've been running into a significant problem with Foolscap's duplicate-connection handling. In our scenario, there is a central server (called the "queen") and a number of clients who connect to it (using a Reconnector to handle connection failures). These clients want to have a connection to the queen up all of the time.

The clients are frequently stuck behind NAT boxes.

The symptom we're seeing is that a client will connect ok, then the client gets rebooted or there's some sort of network flap. After that point, the client is unable to connect to the queen for a long time (where "long" is defined as 15 minutes). Each time the Reconnector fires, the queen accepts their connection, then hangs up on them. The server-side logs indicate that it is dropping a "duplicate connection" each time.

The root of the problem is that the queen and the client disagree about the connection state. The client has rebooted, so it *knows* that it does not have an active connection open. The queen, on the other hand, still thinks it has a valid connection from this client: the reboot or network flap was so sudden that the old client didn't have a chance to send a FIN packet to terminate the TCP connection. When a client silently disappears like this, the queen will not be able to detect the loss until it tries to send data to the client. (foolscap brokers send a PING message to the far end if they haven't heard from it in a while, and PINGs are supposed to provoke PONGs in response: this mechanism ensures that at least some data is flowing every 10 minutes or so).

In some circumstances, TCP will tell the queen that the connection is closed very quickly. The best case is that the remote host is fully connected to the internet, and was merely rebooted. Once the host comes back up, the queen's TCP data packets will be delivered to that host, which will notice that they are for a socket that isn't in use. In this case, the client host will immediately send back a TCP "RST" message (telling the sending end to ReSeT the connection), and the queen-side TCP stack will inform the application of the closed connection.

Another fast-close scenario is if the last-hop router notices that the packet cannot be delivered (usually due to an ARP timeout), or if a major routing change results in the client machine's IP address becomine completely unrouteable. In this case, the router will send an ICMP "Host Unreachable" message, which also tells TCP to shut down the connection.

But unfortunately, there are a number of fairly common slow-close scenarios. When the client reboots behind a NAT box, the client may get a new IP address, resulting in a stale NAT table entry. When the queen sends packets through this entry, the NAT box forwards the packets to the old IP address, on which nobody is listening. Nobody sends an error back, and eventually TCP gives up. But TCP waits a *long* time, on the order of 5 to 15 minutes (because it was designed to be robust against intermittent intermediate hardware failures, like cables being pulled out and like).

In the meanwhile, the new client instance is trying to connect to the queen, who thinks it has a valid (although mysteriously quiet) connection to them. Foolscap has code to handle duplicate connections, which is designed to make sure that there is never more than one connection to any given peer (to make sure that our ordering guarantees can be maintained). This code grants the side with the higher TubID the right to decide which connection gets used; this side is called the "master". When a new connection is being negotiated, the master looks in their connection table for a pre-existing one to the same TubID. The algorithm used in 0.1.7 says that the first connection wins, so any duplicate connections are dropped. The master is supposed to send an error message over the connection that's about to be dropped, but I suspect that this message doesn't make it all the way into the client's logfile.

In the problem that we're seeing, the queen winds up as the master, and it refuses to accept new connections from the recently-rebooted (or network-flapped) until the PING-provoked traffic triggers the TCP timeout. This can prevent clients from connecting for upwards of 15 minutes, which is really annoying.

Solutions

Our idea to make this time shorter is to use the existence of the second ("duplicate") connection as an indication that perhaps the first connection is suspect. Instead of always rejecting the new connection in favor of the old one, we could do something different:

  • Option A: the old connection is always dropped, and the new connection is always accepted.
  • Option B: the old connection is marked as "suspect", and the new connection is dropped. If any data arrives over the old connection, the "suspect" flag is cleared. If a third connection is established and discovers an existing "suspect" connection, *then* the suspect connection is dropped and the new connection accepted.
  • Option C: both connections are dropped. The client is expected to use a Reconnector to trigger a new connection attempt.

To avoid a race condition between the old connection going away and the new connection being established, an approach which drops both connections is probably safer.

Multiple nearly-simultaneous connections are fairly common when the origin and destination of the connection are two processes running on the same host. It is common for FURLs to contain both a 127.0.0.1 connection hint and a hint for a globally-routeable address. If both connection hints are useable, then the recipient will see two connections occurring in rapid sequence. As a result, simply dropping both connections will cause a continuously-failing retry loop.

The option I'm considering is instead:

  • Option D: if the old connection is less than 60 seconds old, drop the new connection and stick with the old one (as in 0.1.7). If the new connection is older than 60 seconds, drop both connections.

This approach should avoid the make-before-break race condition (which would probably violate the expectations of code which uses notifyOnDisconnect), and still handle normal connection establishment properly.

I'm a bit uneasy about the arbitrary 60-second timeout, but it is worth noting that there are two existing 60-second connection-setup timers already present in Foolscap. The first is TubConnector?.CONNECTION_TIMEOUT: if an outbound TCP session has not been resolved into a negotiated connection within 60 seconds, the client abandons it. The other is Negotiation.SERVER_TIMEOUT: once you connect to a server, you have 60 seconds to complete negotiation before they hang up on you.

Testing this is a nuisance:

  1. set up two computers, one as the client, the other as the server. Make sure the server gets the higher TubID, so it winds up being the master.
  2. have the client use a Reconnector to establish a connection to the server
  3. Configure iptables to silently drop packets for the established connection. Simply matching on the TCP port of the client side should be sufficient. Make sure that packets in both directions are dropped.
  4. Allow the connection to exist for at least 60 seconds.
  5. Stop the client. The FIN packet that gets sent by the kernel when the now-dead client's sockets are closed should be discarded by iptables. The activity-based timeout is now running.
  6. Start a new instance of the client. The Reconnector should attach to the server, and the server should either hang up on them or accept the new connection.
  7. Watch to see how long the new client instance's Reconnector takes to establish a new connection. It should be reasonably quick (<10 seconds).

I don't know how to automate this sort of test.

Attachments (1)

implementation.diff (13.2 KB) - added by Brian Warner 16 years ago.
first pass at implementing this approach

Download all attachments as: .zip

Change History (19)

comment:1 Changed 17 years ago by Zooko

What's the problem with option A? And what's the potential "race condition between the old connection going away and the new connection being established"?

Thanks,

Zooko

comment:2 Changed 16 years ago by Zooko

I can guess that a race condition would be if a message that was sent over the old connection was accepted by the recipient *after* a message that was sent over the new connection, where the sender had intended that all messages on the new connection were newer than all messages on the old connection.

This race condition seems easily solvable on the receiver side -- make sure that the old connection is completely dead and can't deliver any more message bytes to you before you start accepting message bytes from the new connection.

What am I missing?

comment:3 Changed 16 years ago by Brian Warner

Rob reminded me of a good reason to prefer the first connection over any later ones: quality of the resulting connection. One general motivation of the use-all-connection-hints-in-parallel plus first-one-wins approach is that some connection hints will be more direct than others, and we'd like to use the most direct one that works. For example, once we get relayers implemented, many FURLs will have a on-the-same-host 127.0.0.1 hint, plus a behind-the-same-NAT-box 10.net hint, plus maybe a UPnP-tunnelled-maybe-it-will-work hint, plus a world-visible-relayer hint that ought to be used only as a last resort.

A duplicate-supression strategy that always drops the old connection in favor of the newest one works exactly counter to this first-one-is-probably-best rule. Worse yet, the application would probably see the faster connection succeed, send a message or two on it, then see that connection dropped, then see a new connection appear (the slower one), etc. The lame-duck connection is just going to waste time and cause confusion.

OTOH, we need to figure out an answer to this thing.. it sounds like more and more use cases involve hosts dropping off the net suddenly, and without warning. Laptops get powered off faster than kernels can advertise their departure. As Zooko and I have often discussed, connections are a (useful) fiction.. and now it seems like actual network behavior is starting to catch up to this state of affairs.

comment:4 Changed 16 years ago by Zooko

Notes from talks with Brian and with Amber this evening:

  • It is easy for foolscap to guarantee ordering within the context of a given reference (by relying on SSL ordering guarantees and making sure that a given reference never sends messages over more than one SSL connection in the reference's lifetime).
  • It is hard for foolscap to guarantee ordering across SSL connections. When Brian and I consciously noticed the question of that guarantee, Brian quickly decided he didn't want to even try.
  • It is not too hard for foolscap to guarantee ordering across multiple references that all use the same SSL connection, but it is too hard for programmers to understand how to use that guarantee safely. (For one things, it is easy to mistake it for the aforementioned across-connections ordering guarantee. For another thing, it is easy to be mistaken about which foolscap refs point to objects in the same tub.)
  • If you are willing to allow multiple SSL connections to exist, even between the same pair of peers, then it is easy to ensure the per-reference ordering guarantee, it is easy to avoid unnecessarily breaking references (of which more below), and it is fairly easy to overcome all sorts of weird networking problems. Examples of weird networking problems include:
    • one side thinking that connection 1 is still fine while the other side has closed connection 1 and forgotten all of its associated state
    • "race conditions" or surprising orderings of events when peers simultaneously try to connect to each other, possibly using multiple addresses in parallel
    • and much more...
  • What if you are not willing to allow multiple SSL connections to exist? Assume that foolscap is going to at least provide the ordering guarantee with respect to a given reference. This means that you are going to break references whenever you close an SSL connection that the references have sent (application-layer) messages on. If you are willing to break references that didn't have to die, then you can always successfully cut down to at most one connection (the master can always close connections until there is only one left).
  • If you are not willing to allow multiple SSL connections to exist and you are not willing to break references unnecessarily, and you want to handle real-world networking weirdness, then the issue becomes tricky. The part about "not breaking references unnecessarily" means, in the context of foolscap connection negotiation, that foolscap delays sending the first message on a reference (the first message that has ever been sent on that particular live ref) until it has decided that the connection is not about to be superceded by another connection. There is a sense in which this is only heuristic -- you can't know that the connection you currently have to Bob is not going to soon be superceded by a different SSL connection to Bob -- even when your current SSL connection to Bob is going to continue to appear "up" from your point of view. So from that argument you should always delay sending the first message on a new reference, just in case. :-) I guess the question that Brian and I were asking each other earlier tonight was how foolscap should handle the case that you've already sent or received a packet indicating that you or Bob might be in the process of setting up a new connection before you've sent the first message (or at least the first one for a given live ref) on the current SSL connection.

Frankly these two conversations (one with Brian then one with Amber) followed by the process of writing down this comment for this ticket, has persuaded me that there is a high cost of trying to constrain to at most once simultaneous connection between two tubs. This cost comes in some combination of complexity, unnecessarily broken live refs, and/or other aberrant behavior in the face of real-world network weirdness.

Note that it is not hard to implement nice optimizations which 'reduce' the chances of having extra connections. It is also not too hard to implement other optimizations such as preferring to use faster connections over slower ones, and doing things that minimize the chances of getting stuck due to "real world network weirdness". It only gets hard if you try to simultaneously guarantee at most one active connection, zero unnecessarily broken references, and good operation in the face of real world network weirdness. Choose two.

One final note from our earlier conversations: Brian mentioned that if foolscap broadens its outlook on networking to deal with multiple live connections to the same peer, this might make it easier to later add things like prioritization, relay, and so on.

comment:5 Changed 16 years ago by Brian Warner

I had another thought over dinner last night: we could have each connection attempt (specifically each instance of negotiate.TubConnector) get a sequence number. Each negotiation offer will include the seqnum. If a new connection arrives and it has the same seqnum as the current one, we stick with the old one. If the new connection has a higher seqnum than the old one, we drop the old connection and switch to the new one.

I think this would give us the desired behavior without requiring an arbitrary timeout to distinguish between "new" and "old" connection attempts.

The offer would actually need to have a (program-incarnation, seqnum) tuple, where each time the program is started, it gets a new unique (random) incarnation number (I added the incarnation number already, for the logging code). This is needed to make sure that a client can be shut down and restarted and the new process has a chance of establishing a new connection.

The seqnum comparison should be: if the incarnation numbers are different, prefer the new connection. If they're the same: prefer the higher seqnum. If the seqnums are the same: prefer the older connection.

One problem with the incarnation number is that if anyone violates the rules and sets up a pair of identical Tubs (with the same private key), and those two Tubs both attempt to Tub.connectTo() to a third one, then the connections will keep displacing each other and cause an infinite flurry of reconnections. To mitigate this, the server could remember a brief history of the incarnation numbers that have connected recently and cry foul if it ever sees an interleaving:

    earlier_incarnations = self.incarnations[remote_tubid]  # list
    if not earlier_incarnations:
        earlier_incarnations.append(new_incarnation)
        return
    if new_incarnation in earlier_incarnations[:-1]:
        raise ReconnectorDeathMatch
    if earlier_incarnations[-1] != new_incarnation:
        earlier_incarnations.append(new_incarnation)
    return

However I'm not sure the complexity of implementing and testing this check is worth it.

I think that this "connection attempt number" would make it easier to implement the other multi-connection things we talked about: the negotiation offer would explain to the receiving end that this particular connection attempt is trying to establish the "high-priority" connection, so that it ought to be treated differently than an existing "low-priority" connection.

Implementation details:

  • negotiation has three phases:
    • client-vs-server: the client initiates the TCP connection, while the server accepts it. The client sends the HTTP-like tubid request, and the server sends the "change protocols" message
    • symmetric offers: both ends send an rfc822-formatted dictionary of negotiation offer messages
    • master-vs-slave: the end with the higher tubid (the "master") decides upon the connection parameters and sends the decision to the slave. The slave accepts the parameters or disconnects.
  • the incarnation number should be sent in both offers.. it might eventually be useful in the server-to-client direction too
  • the connection attempt number will only appear in one offer, that sent by the client side (since the server side doesn't even have a TubConnector? object, and didn't initiate a connection).
  • this number will only be used by the master side, when comparing it against any existing connections. This means that it might not be used at all, if the client happens to have the higher tubid.
  • disconnecting the old connection is easy. Accepting a new connection is easy. Doing both is hard. It's very important to make sure that all side-effects of dropping the old connection (notifyOnDisconnect, probably Reconnector even though that'd be pretty weird) are run or at least scheduled before the side-effects of accepting the new connection (mainly the getReference callback) are run or scheduled. Ordering is important here. Calling transport.loseConnection doesn't cause protocol.connectionLost to be fired synchronously; in fact, it won't fire until all lingering data has been written out the socket, or the TCP timeouts expire and the socket is marked as closed. Therefore it isn't sufficient to do eventually(acceptNewConnection). Worst case we might need to reject the new connection, drop the old one, and assume that a Reconnector will try again.

comment:6 Changed 16 years ago by Brian Warner

I'm trying to understand more about what Zooko meant by "breaking refs unnecessarily".

For reference, here's the sequence of events when Alice does tub.getReference(FURL):

  • Alice's Tub establishes a connection to Bob's Tub
    • This gives Alice a Broker instance that holds the TCP transport object
  • Alice synthesizes a RemoteReference? to Bob's Tub (using CLID=0)
  • Alice does tub_rref.callRemote("getReferenceByName", swissnum)
  • Bob does a lookup, returns a Referenceable, which gets serialized with a CLID
  • Alice receives the CLID, and creates a RemoteReference? instance (actually a RemoteReferenceTracker?) that remembers the Broker and the CLID

Once Alice creates the RemoteReference?, it is bound to the connection. If the connection goes away, the RemoteReference? must die.

So the note about "delay sending the first message" would really need to mean "delay sending the getReferenceByName", since once Alice makes the RemoteReference?, it's too late.

comment:7 Changed 16 years ago by Brian Warner

Some random notes gathered from the linux kernel networking documentation (linux-2.6.22.6/Documentation/networking/ip-sysctl.txt)

  • /proc/sys/net/ipv4/tcp_keepalive_time = 7200 (two hours)
  • tcp_keepalive_probes = 9 (send 9 probes before killing the connection)
  • tcp_keepalive_intvl = 75 (wait 75 seconds between probes)

i.e. if you turn on SO_KEEPALIVE, then once every two hours, the TCP stack will send a keepalive. If they can't get through for 9*75 = ~11 minutes, the connection will be dropped.

Also:

  • tcp_retries1 = 3

How many times to retry before deciding that something is wrong and it is necessary to report this suspicion to network layer. Minimal RFC value is 3, it is default, which corresponds to ~3sec-8min depending on RTO.

  • tcp_retries2 = 15

How may times to retry before killing alive TCP connection. RFC1122 says that the limit should be longer than 100 sec. It is too small number. Default value 15 corresponds to ~13-30min depending on RTO.

I don't understand either of these two.

Experimental results (using iptables to drop packets between two hosts) show a client who is doing callRemote every 5 seconds notices the lost connection 15m46s after the last successful call. The server (who was doing callRemote at a similar interval) saw the connection go away two minutes later.

comment:8 Changed 16 years ago by Zooko

(summarized from IRC discussion)

Here's a story to illustrate what I meant about breaking RemoteReferences? unnecessarily.

Suppose you have an SSL connection S1 and a RemoteReference? r1, and you have bound r1 to S1.

Now suppose your peer tries to set up another SSL connection S2 with you.

If you're allowed to have at most one SSL connection, then you have to decide now whether to break r1 or to reject your peer's new connection attempt.

If you're allowed to (at least temporarily or rarely) have more than one, then you can go ahead and accept the new connection, and decide later what to do about S1/r1, such as timing S1/r1 out.

Brian explained that this sort of laziness about breaking connections and their RRs wouldn't be too hard to implement, as each RR has a Broker, and the Broker maintains the SSL connection, but an RR object would not notice if he had a different Broker than another RR object had.

The Tub has a table mapping from peer-tub-id to Broker, and there is at most one Broker per peer-tub-id in that table. This table could be interpreted as "the known-good broker to use for newly created RRs", instead of "the only broker that can be used for this peer by any RR".

It is my intuition that this change would make it easier to handle weird cases such as connections that are dead from A's perspective but live from B's perspective, and cross-connect race-conditions, and so on, but neither Brian nor I came up with an example of where this would make something work better.

Conceptually, at least, it seems nice if you can consider the requirement of "make sure there is at least one working connection between the two of you" separately from the requirement of "tear down connections which are no good".

comment:9 Changed 16 years ago by Brian Warner

So one realization that rob had while we were talking about this was that if we allow multiple connections to the same remote Tub to exist, then we can wind up with multiple RemoteReferences? to the same object (each using a different Broker).

This impacts EQ and reference exchange. Should these two RemoteReferences? appear to be the same? They can both be used to send messages to the same place, but of course the ordering guarantees are only upheld for messages sent through a single RemoteReference?, not between multiple rrefs.

If we have rrefs A and B that point to the same object (through different brokers), and you pass me an object through the return value of one of them (call this Ca), I might send it back to you through B, if I think A and B are basically interchangeable. But this delivery is a third-party reference, since Ca is tied to A's broker. My Tub will compute the FURL of Ca and send that to B, who will then establish a connection to.. themselves, and ask themselves for the named reference. The target will wind up with a RemoteReference? to their own object, whereas if they'd received the object through rref A they would have gotten the original Referenceable. So A and B aren't quite as interchangeable as we might hope.

I'm not sure this is a critical problem, but it'd be nice to avoid the confusion somehow.

comment:10 Changed 16 years ago by Brian Warner

I've got two potential solutions in my notebook right now. In both cases, each Tub creates a per-instance incarnation number (IN), as described earlier.

  • A: the approach described above: each connection request (the offer) includes an (IN, seqnum) tuple, s.t. each TubConnector? get a higher seqnum than the last one. (the Tub could store a single counter which is shared among TubConnectors? for all remote tubs, there's no need to keep separate values for each one). Inbound connections have their tuple compared against the tuple for the existing connection, and they're accepted if the INs are different, or if the INs are the same but the incoming seqnum is higher.
  • B: vector clocks. Each Tub maintains a seqnum, which is incremented each time it is read. Each Tub also remembers the other tub's seqnums: it maintains a mapping, keyed by tubid, which contains the seqnum that was used in the last established connection to that tub. The offer contains the (IN, my-seqnum, your-seqnum) 3-tuple. The new offer is accepted if the INs are different, or if the your-seqnum is equal to that of the existing connection. It is rejected (in favor of the existing connection) if the your-seqnum is lower than that of the existing connection.

In A, it isn't clear how to compare an existing connection that was made from Alice to Bob with a potential new connection being made from Bob to Alice. The TubConnector? seqnum is only present in the initiating-side offer, since the receiving-side wasn't using a TubConnector? at all. So when Bob sees a new inbound connection from Alice that proposes to replace his existing connection (which Bob initiated), he doesn't know if Alice knew about the existing connection (and she intends to replace it), or if she didn't (and this is just a race).

In B, the intentionality is more clear: if Alice sends an offer that mentions the existing connection (by using a bob-seqnum equal to that of the existing connection), then she knew about that connection and is making a deliberate effort to connect anyway: Bob can conclude that she feels the old connection is no longer viable, so Bob ought to accept the new connection. If Alice's bob-seqnum is lower than the existing connection, then it means Alice doesn't know about that connection: perhaps there is a cross-connect race and Alice just has not yet received the recent decision, or Alice's network broke (perhaps some time ago) in the middle of negotiation and she never heard about the decision. In the former case we want to stick with the existing connection, in the latter we'd prefer to take the new connection, but we have no way to tell the difference. The result is that this sort of failure will make Bob unreachable for the 15 minutes it takes his Tub to TCP-timeout the old (now-broken) connection.

Still pondering..

comment:11 Changed 16 years ago by Zooko

Okay, I agree that having at-most-one connection per tub is doable, and desirable (for simplicity).

comment:12 Changed 16 years ago by Brian Warner

Long chat with robk, here's the current plan:

Each Tub has a (random and unique) Incarnation Record, aka "IR". There might be a portion of this that is sequential, if the Tub is given a place to increment it, but maybe not. The important thing is that it's unique to each run of the program, such that we can assume that two messages with the same IR are coming from the same process with the same state. There are sequence numbers that get reset each time the program is restarted, so we use the IR to decide if we can meaningfully compare these sequence numbers or not.

Each Tub maintains two tables, both keyed by tubid. The "master table" records the sequence number we last used for a connection for which we were the master: this is updated each time we send a decision to the slave side, by incrementing the number and storing+sending the new value. The "slave table" records the (master-IR,master-seqnum) record we last used for a connection in which we were the slave: it is updated each time we receive a decision from the master, by simply storing the value in the dictionary.

When we send a negotiation offer, if we see a value in the slave table, we send it, otherwise we don't (we just omit that key from the offer). We also send our Incarnation Record. Old clients don't know about incarnation records and don't send them.

OFFER:
 my-tub-id: abc123
 my-incarnation: 789bfe
 last-connection: 456def,8   # master-IR,master-seqnum

Each master-side Broker records the slave-IR value and the current-connection master-seqnum value that was used in the decision.

When we're the master, and we receive an offer for which we already have a Broker, we do the following tests:

  • does the inbound offer have a different my-incarnation value than the slave-IR recorded in the existing Broker? If so, the offer wins. This occurs when the slave has restarted since our last known connection: clearly they can't still be using the old connection that we know about.
    • When the offer wins: we shut down the Broker, then accept the offer. Every time we accept an offer, we increment the seqnum in our master table, send the new seqnum in the decision, and record it in the new Broker.
    • if the offer does not have a my-incarnation value, treat it as None. This means that we cannot distinguish between different runs of the remote program.
  • does the inbound offer's last-connection.master-IR value differ than our current IR? If so, the offer wins: this occurs when a decision message went missing, such that we decided upon a connection but the slave never heard about it, and the last connection that the slave *did* hear about was from a previous run of the master.
    • If the inbound offer doesn't have a last-connection header, skip this step. Old clients do not pay attention to our IR value, so this test is meaningless.
  • (compatibility improvement): if the inbound offer does not have an IR and the handle-old-clients flag is on, and the Broker was created more than a minute ago, then accept the offer.
  • (compatibility): if the inbound offer does not have an IR, reject the offer. This could be caused by several indistiguishable situations, but rejecting the offer at least gets us the same behavior as 0.1.7 .
  • if both IRs are the same, compare the offer's last-connection.master-seqnum against the current Broker's master-seqnum. If they are the same, accept the offer. This indicates that the client knows about the current connection wants a new connection anyways.
  • if the offer's last-connection.master-seqnum is smaller than the Broker's master-seqnum, reject the offer. This happens when the offer was delayed, or was part of a set of offers that were sent to multiple connection-hints at the same time, one of which has already won. To avoid connection flap, we reject the connection.
  • if the offer's last-connection.master-seqnum is larger than the Broker's, then there is some sort of bug: this should not be able to happen. Log it loudly, and reject the offer.

Note that if we're the master and there is no existing Broker, we always accept the new connection. Also, if we're the slave, we always replace any existing Broker with the new connection.

The decision contains the new-connection data:

DECISION:
 params..: ...
 current-connection: 456def,9     # master-IR,master-seqnum

The (compatibility) step is to help the AllMyData? MV case in which the central queen is using the latest foolscap but there are still clients running an old version (which does not have this change, and thus does not send my-incarnation or last-connection headers in the offer). In the original situation that prompted the creation of this ticket, laptop clients with low tubids connect to the queen, then get shut down abruptly, then are brought back up a few minutes later with a different IP address, and get Duplicate Connection Rejected messages until the queen's keepalives/TCP-timeouts finally mark the connection as down (which can take up to 35 minutes).

Updating both the clients and the queen will fix this, but (as with any deployed application) there will be plenty of old clients remaining, and it would be nice to handle them well. Old clients will continue to omit the my-incarnation and last-connection headers, so the master will record None for these values, making it look like both sides are never restarting, and removing our ability to compare sequence numbers. The best we can do is to guess (based upon how long the existing Broker has been established) whether the offer is related to the existing Broker, or if it's related to a brand new connection attempt. The heuristic is based upon the age of the Broker: if it was created more than a minute ago, we say that the offer is new (and should therefore be accepted), otherwise we say the offer is from the same connection attempt that got us the Broker (and should therefore be rejected).

comment:13 Changed 16 years ago by Zooko

Nice! Good work you two!

comment:14 Changed 16 years ago by Brian Warner

More experimental results: using iptables to silently drop packets results in TCP attempting retries (when it doesn't hear the ACK back quickly enough). The packet timestamps and inter-packet delays are:

  • 13:37:39.153611
  • 13:37:39.359778 (206ms)
  • 13:37:39.773700 (413ms)
  • 13:37:40.601547 (829ms)
  • 13:37:42.257241 (1.65s)
  • 13:37:45.568631 (3.3s)
  • 13:37:52.191409 (6.6s)
  • 13:38:05.436959 (13.2s)
  • 13:38:31.928060 (26.5s)
  • 13:39:24.910262 (53s)
  • 13:41:10.874660 (106s)

There were probably some more retries, but I went out to lunch and missed them. The client eventually succeeded at reconnecting 25 minutes after the network was disabled.

comment:15 Changed 16 years ago by Brian Warner

I have a first-pass implementation of this, and it appears to work. Patch is attached. I don't have any new tests for it, though, just a manual test using iptables to simulate silent network failure.

Changed 16 years ago by Brian Warner

Attachment: implementation.diff added

first pass at implementing this approach

comment:16 Changed 16 years ago by Brian Warner

I just pushed the code for this, along with unit tests that probably cover about 60% of the code paths.

Still todo:

  • cover the other code paths
  • cover the reverse case (I only wrote tests for client==slave, but not for client==master)
  • document the approach in docs/

I had to add the TubConnector?-id derived from approach "A" above, to allow the server to tell the difference between 1) an offer that is part of the same batch of connections as the existing one, and 2) the second connection attempt from a client who almost-but-not-quite succeeded on their first attempt (such that the master sent a decision but the message got lost).

In the current scheme, the client's offer looks like:

OFFER:
 my-incarnation: client-IR
 this-connection: connection-ID
 last-connection: master-IR master-seqnum

and each TubConnector? gets a unique "this-connection" string. The resulting decision has:

DECISION:
 current-connection: master-IR master-seqnum

and the master-side Broker records:

BROKER (master):
 current-slave-IR: copy of offer['my-incarnation']
 current-seqnum: copy of decision['current-connection'][1]
 current-attempt-id: copy of offer['this-connection']

comment:17 Changed 16 years ago by Brian Warner

Component: unknownnegotiation
Milestone: undecided0.2.0

comment:18 Changed 16 years ago by Brian Warner

Resolution: fixed
Status: newclosed

Ok, I'm satisfied with the new code. The seqnum is pretty much unused, since it turned out there is a lost-decision-message case that means it's slightly better to accept an old seqnum rather than reject it. The current algorithm compares incarnation-records first, then compares connection-attempt-ids. Pretty much the only way to get an offer rejected is for it to have the same attempt-id as the current one.

I haven't yet implemented the 60-second-old thing to better handle old clients.. that's in a different ticket (#34).

Note: See TracTickets for help on using tickets.