#25 closed defect (fixed)
slow remote_foo methods stall all further deliveries
Reported by: | Brian Warner | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | 0.1.7 |
Component: | unknown | Version: | 0.1.6 |
Keywords: | Cc: |
Description
Much to my embarrassment, foolscap-0.1.6 has a major bug in the call-delivery code. This bug effectively breaks any application which takes a long time to execute a remote_ method (i.e. it returns a Deferred instead of completing synchronously).
The problem is in Broker.doNextCall
, where we have a queue of
InboundDelivery? objects, one per remote call. This queue exists to make sure
that messages which take a long time to deserialize (because they contain
gifts that must be claimed) do not race ahead of other messages which were
sent later. If the top-most message is not yet ready (meaning that we've
received all of the tokens for it, but we haven't finished deserializing if
because of the presence of not-yet-claimed gifts), then we stall the queue
until it becomes ready.
The specific bug is that this queue is also stalling inbound messages until
the previous call has finished *running*. Each remote_foo() method is allowed
to either return synchronously, or to return a Deferred that will fire with
the message's result at some later point in the future. The caller (the
remote side which runs rref.callRemote
) always gets their own deferred,
and that one doesn't fire until the message has been delivered and the
response comes back.
But with the current queue logic, the remote_foo's Deferred also prevents subsequent inbound messages from being delivered. Allmydata had a method that failed to complete (or stalled for several hours, we're not sure which), and that stall caused significant secondary damage, since no further messages could be delivered. This also makes it a lot easier to deadlock, if two Tubs are sending slow messages at each other.
The fix will be to change the queueing logic, using the queue for the ready_deferred, but once that fires, dispatch each message+response onto its own chain. And of course the first step is always to write a unit test which reproduces the problem, in test_call.py
Change History (2)
comment:1 Changed 17 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:2 Changed 17 years ago by
Milestone: | undecided → 0.1.7 |
---|
fixed, with [bc745eddbf90fd4cbdbc4ca113edd439dfdec5be].