﻿id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc
72	survive logging overloads	Brian Warner		"We isolated a problem on Tahoe to one node emitting a lot of log messages, and the log-gatherer being unable to keep up with the flow. The node was creating (and queueing) messages about six times as fast as the gatherer could accept them. The overload was piling up in RAM (probably in the Transport's write queue, waiting for the socket to accept the write() calls). In the span of three hours, the node's RSS size had grown to 3GB and python started throwing {{{MemoryErrors}}}. Eventually one of them hit the reactor, and it wrote a stack trace to TWISTD-CRASH.log and tried to exit (unfortunately it looks like the threadpool failed to get shut down, so the reactor thread quit but the others did not, and the headless node kept running).

The root problem is that foolscap's logport has no flow-control mechanism (no backpressure), mostly because foolscap doesn't make it easy to implement (callRemote is mostly send-and-forget). For logging (which is defined to be less important than regular operations), the response to this backpressure should be to drop log messages (and replace them with some lower-rate ""log messages have been lost"" message, perhaps simply implemented as a gap in the message sequence numbers).

So the task is to rework the logport to impose a limit on the memory footprint. This probably means creating a size-bounded queue (above the callRemote layer), and randomly discarding log message when the queue gets close to full. Other code should use a windowing protocol to pull messages from this queue and callRemote them to the subscriber, allowing some number to be in flight at any given time, ignoring errors. When a message is retired, it should be removed from the queue, allowing more messages to be added.

It might also be useful to create a ""log overload gatherer"", with a separate FURL, to which a message is sent (at most once per hour) when a log message must be dropped for lack of space.
"	defect	closed	critical	0.2.9	logging	0.2.5	fixed		
