Opened 17 years ago

Closed 17 years ago

Last modified 17 years ago

#25 closed defect (fixed)

slow remote_foo methods stall all further deliveries

Reported by: Brian Warner Owned by:
Priority: critical Milestone: 0.1.7
Component: unknown Version: 0.1.6
Keywords: Cc:

Description

Much to my embarrassment, foolscap-0.1.6 has a major bug in the call-delivery code. This bug effectively breaks any application which takes a long time to execute a remote_ method (i.e. it returns a Deferred instead of completing synchronously).

The problem is in Broker.doNextCall, where we have a queue of InboundDelivery? objects, one per remote call. This queue exists to make sure that messages which take a long time to deserialize (because they contain gifts that must be claimed) do not race ahead of other messages which were sent later. If the top-most message is not yet ready (meaning that we've received all of the tokens for it, but we haven't finished deserializing if because of the presence of not-yet-claimed gifts), then we stall the queue until it becomes ready.

The specific bug is that this queue is also stalling inbound messages until the previous call has finished *running*. Each remote_foo() method is allowed to either return synchronously, or to return a Deferred that will fire with the message's result at some later point in the future. The caller (the remote side which runs rref.callRemote) always gets their own deferred, and that one doesn't fire until the message has been delivered and the response comes back.

But with the current queue logic, the remote_foo's Deferred also prevents subsequent inbound messages from being delivered. Allmydata had a method that failed to complete (or stalled for several hours, we're not sure which), and that stall caused significant secondary damage, since no further messages could be delivered. This also makes it a lot easier to deadlock, if two Tubs are sending slow messages at each other.

The fix will be to change the queueing logic, using the queue for the ready_deferred, but once that fires, dispatch each message+response onto its own chain. And of course the first step is always to write a unit test which reproduces the problem, in test_call.py

Change History (2)

comment:1 Changed 17 years ago by Brian Warner

Resolution: fixed
Status: newclosed

comment:2 Changed 17 years ago by Brian Warner

Milestone: undecided0.1.7
Note: See TracTickets for help on using tickets.