Opened 14 years ago

Closed 13 years ago

Last modified 13 years ago

#144 closed defect (fixed)

incident writer must survive disk-full situations

Reported by: Brian Warner Owned by:
Priority: major Milestone: 0.6.0
Component: logging Version: 0.4.1
Keywords: Cc:

Description

tahoe#871 is about Tahoe surviving a disk-full condition gracefully. One requirement for that is for Foolscap to handle it too. Specifically, Tahoe wants to respond to a soft-limit excursion (i.e. space-available drops below reserved_limit) by emitting a high-severity foolscap log event, which will show up as an Incident and be available for operator inspection later.

We just need to make sure that the incident writer (log.setLogDir()) will survive IOError when it attempts to write an incident to disk. The incident should remain in memory, but the on-disk copy should be abandoned. Further logging calls (especially connections from flogtool tail) must continue to work normally.

Change History (3)

comment:1 Changed 14 years ago by Brian Warner

A survey of the code (both foolscap's and twisted.python.log) suggests that it should properly tolerate exceptions that occur when the logging code cannot write to disk. Exceptions during logging are logged, but the observer who caused the exception is disabled while its own exception is logged. So it should be ok.

But I'm not going to close this ticket until I've created a small+full filesystem and run a foolscap-based process in it to see what happens.

comment:2 Changed 13 years ago by Brian Warner

Resolution: fixed
Status: newclosed

Looks good. I used OS-X's "Disk Utility" to create a 40MB disk image, created a Tahoe node inside it, started it, attached a flogtool tail, then used dd if=/dev/zero of=FILLER to fill the disk, then hit Tahoe's "report an Incident" button. The tail reports the disk-full error, a small portion of it makes it into logs/twistd.log. Subsequent flogtool tail connections still work, and the Tahoe node itself keeps working (when it doesn't need to write to disk).

Foolscap's logging doesn't actually keep Incidents in memory, so that's not an issue.

An incident-gatherer (created with flogtool create-incident-gatherer) does not fetch Incidents after the log-full situation, although a log-gatherer (created with flogtool create-gatherer) collects all the individual events correctly. The incident-delivery code pulls them off of disk, so with no disk, there are no completed incidents to publish.

It'd be nice if the incident-gatherer still did something. I'll add a new ticket for that enhancement: keep the last Incident (as a single string) in RAM, and make it available to the incident-gatherer interface.

But this ticket can be closed: applications and logging will survive disk-full situations.

comment:3 Changed 13 years ago by Brian Warner

#168 is the one-Incident-in-RAM enhancement ticket

Note: See TracTickets for help on using tickets.