Opened 16 years ago

Closed 16 years ago

#72 closed defect (fixed)

survive logging overloads

Reported by: Brian Warner Owned by:
Priority: critical Milestone: 0.2.9
Component: logging Version: 0.2.5
Keywords: Cc:

Description

We isolated a problem on Tahoe to one node emitting a lot of log messages, and the log-gatherer being unable to keep up with the flow. The node was creating (and queueing) messages about six times as fast as the gatherer could accept them. The overload was piling up in RAM (probably in the Transport's write queue, waiting for the socket to accept the write() calls). In the span of three hours, the node's RSS size had grown to 3GB and python started throwing MemoryErrors. Eventually one of them hit the reactor, and it wrote a stack trace to TWISTD-CRASH.log and tried to exit (unfortunately it looks like the threadpool failed to get shut down, so the reactor thread quit but the others did not, and the headless node kept running).

The root problem is that foolscap's logport has no flow-control mechanism (no backpressure), mostly because foolscap doesn't make it easy to implement (callRemote is mostly send-and-forget). For logging (which is defined to be less important than regular operations), the response to this backpressure should be to drop log messages (and replace them with some lower-rate "log messages have been lost" message, perhaps simply implemented as a gap in the message sequence numbers).

So the task is to rework the logport to impose a limit on the memory footprint. This probably means creating a size-bounded queue (above the callRemote layer), and randomly discarding log message when the queue gets close to full. Other code should use a windowing protocol to pull messages from this queue and callRemote them to the subscriber, allowing some number to be in flight at any given time, ignoring errors. When a message is retired, it should be removed from the queue, allowing more messages to be added.

It might also be useful to create a "log overload gatherer", with a separate FURL, to which a message is sent (at most once per hour) when a log message must be dropped for lack of space.

Change History (3)

comment:1 Changed 16 years ago by Brian Warner

Milestone: undecided0.2.9

another thought is to pull log messages from the different priority queues separately, and have the overload message survive by virtue of being placed in the WEIRD or CURIOUS queue. But this is at odds with the goal that subscribers see messages arrive in order of generation.

comment:2 Changed 16 years ago by Brian Warner

This is probably fixed by [940654f85ad3cc02fe7c78583b1252e04e8179a0], which adds bounded queues to the publishing mechanism. It has unit tests, of course, but I need to do a real-world test in the same situations that we originally observed, and see how many messages get dropped. We should be able to figure out a typical maximum messages-per-second rate, and of course if the sustained rate goes above this, the bounded queue will eventually drop messages to cope.

comment:3 Changed 16 years ago by Brian Warner

Resolution: fixed
Status: newclosed

Updated to use discard-new instead of discard-random, in [6859c79f11b0aefb7f9570c159246fa366a7b7b2]. My thinking is that Random-Early-Discard is actually for making TCP's congestion-control algorithm from going insane (if you make big gaps in the TCP sequence numbers, the sender will believe that the link is really fast, then really slow, then really fast, etc; but if you discard a constant n% of the traffic, it will throttle down smoothly). Since we don't have anything this sophisticated in the sending algorithm, and since we're far away from the underlying TCP stream (our congestion is for CPU, not network), I don't think we need RED for this purpose.

We might want RED for other reasons: to preserve a different set of messages (if we are going to care about finding out why we got overloaded, is it better to get early messages or later ones?). But I don't have a good answer for that yet, so discard-new seems like the most likely to be useful. This is easy to change later if we find we learn differently. (in particular, if we want to switch to RED, just apply the inverse of [6859c79f11b0aefb7f9570c159246fa366a7b7b2].

Note: See TracTickets for help on using tickets.