Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#104 closed defect (fixed)

'reference' tokens are occasionally broken

Reported by: Brian Warner Owned by:
Priority: critical Milestone: 0.4.0
Component: banana Version: 0.3.0
Keywords: Cc:

Description

While investigating Tahoe#541, it seems to me that the 'reference' token is just plain broken. Specifically, the calls to setObject are showing up with counter values which don't match up.

One test client did a remote call with the sequence:

  • OPEN(4) "call" 3 1 "allocate_buckets" OPEN(5) "arguments" 5 "arg1" "arg2" OPEN(6) "set" 1 CLOSE(6) ...

and on the server side I saw setObject(5) called with a set, and no call to setObject(6).

Something is screwy, it looks like somehow the counter value is corrupted or stale when it is passed in to setObject(). A quick glance suggests that setObject is being called at the right time and with the right objects, it's just that the counter= value is wrong (usually 1 smaller than it should be)

Change History (4)

comment:1 Changed 12 years ago by Brian Warner

Ah, I think I found it. So, for review:

  • OPEN tokens can carry a counter value. This is optional, and is used for debugging: the receiver should tolerate their absence. Each OPEN token gets a sequential number.
  • CLOSE tokens can also carry a counter, which is also optional, and helps a debugger correlate OPENs and CLOSEs.
  • 'reference' sequences use a number to indicate which original object they are referring to. For example, if the same tuple is sent three times in a row, you'll get a sequence like:
    • OPEN, "tuple", "a", "b", CLOSE, OPEN, "reference", 4, CLOSE, OPEN, "reference", 4, CLOSE
  • the 'reference' token's number is called the "object count", and it is subtly different than the OPEN/CLOSE token's number (which is known as the "open count"). Well, they ought to always be the same. But since the OPEN/CLOSE counter is optional, the receiving broker keeps track of the object counter by counting OPEN tokens: it increments a per-connection counter each time an OPEN token arrives.
  • the setObject() call is made using the object counter, and any subsequent getObject() call is made using the reference token's number (which ought to match)

Now, the problem I'm seeing here is due to the Violation that occurred earlier (in the case of Tahoe #541, it was because we started using a versioning scheme which invokes a method that doesn't exist on older servers). When a Violation is raised, Foolscap discards all tokens until the current unslicer frame is exited (this prevents a resource-exhaustion attack: discarding the tokens minimizes the memory and CPU that they can consume).

The specific problem is that this token-discarding also discards the intervening OPEN tokens, which means the object counter is not incremented. For every OPEN token that was in the discarded stream, the object counter and the open counter will drift by one. All subsequent setObject() calls will get the wrong number and will put the object in the wrong place. Any 'reference' tokens will call getObject() with the correct value, and since it won't match the setObject() value, getObject() will either get the wrong thing or it won't get anything at all.

So, the fix will be to count OPEN tokens (i.e. increment the 'object counter') even when we're discarding tokens because of a Violation. The unit test needs to do:

  • a call which triggers a Violation (due to a far-side remote interface which doesn't include the method being called), with a few layers of nested objects as arguments. Count the number of arguments to figure out the erroneous drift value. No arguments is probably a drift of one.
  • a second call which uses references. In particular, it should look like:
 t = ("tuple", 1,2)
 rref.callRemote("foo", [["list", 3,4], t, t])
  • the second 't' should use a 'reference' token, that ought to point at the previous 't'. If the counters are off by one, it will point at the ["list",3,4] list instead, and this will either trigger a Violation (if the schema is tight enough), or the code in remote_foo can detect that it gets a third argument of ["list",3,4] instead of the expected ("tuple",1,2)

comment:2 Changed 12 years ago by Brian Warner

Milestone: undecided0.3.3
Priority: majorcritical

comment:3 Changed 12 years ago by Brian Warner

Resolution: fixed
Status: newclosed

Fixed, in [a59482b3d45368220234f824d28d692c7ee343b7]. The deserialization code is gnarly, but I decided to refrain from a big refactoring project and instead just fix the bug with as little code as possible.

comment:4 Changed 12 years ago by Brian Warner

Milestone: 0.3.30.4.0
Note: See TracTickets for help on using tickets.