#144 closed defect (fixed)
incident writer must survive disk-full situations
Reported by: | Brian Warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 0.6.0 |
Component: | logging | Version: | 0.4.1 |
Keywords: | Cc: |
Description
tahoe#871 is about Tahoe surviving a disk-full condition gracefully. One requirement for that is for Foolscap to handle it too. Specifically, Tahoe wants to respond to a soft-limit excursion (i.e. space-available drops below reserved_limit) by emitting a high-severity foolscap log event, which will show up as an Incident and be available for operator inspection later.
We just need to make sure that the incident writer (log.setLogDir()
) will survive IOError
when it attempts to write an incident to disk. The incident should remain in memory, but the on-disk copy should be abandoned. Further logging calls (especially connections from flogtool tail
) must continue to work normally.
Change History (3)
comment:1 Changed 15 years ago by
comment:2 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Looks good. I used OS-X's "Disk Utility" to create a 40MB disk image, created a Tahoe node inside it, started it, attached a flogtool tail
, then used dd if=/dev/zero of=FILLER
to fill the disk, then hit Tahoe's "report an Incident" button. The tail reports the disk-full error, a small portion of it makes it into logs/twistd.log
. Subsequent flogtool tail
connections still work, and the Tahoe node itself keeps working (when it doesn't need to write to disk).
Foolscap's logging doesn't actually keep Incidents in memory, so that's not an issue.
An incident-gatherer (created with flogtool create-incident-gatherer
) does not fetch Incidents after the log-full situation, although a log-gatherer (created with flogtool create-gatherer
) collects all the individual events correctly. The incident-delivery code pulls them off of disk, so with no disk, there are no completed incidents to publish.
It'd be nice if the incident-gatherer still did something. I'll add a new ticket for that enhancement: keep the last Incident (as a single string) in RAM, and make it available to the incident-gatherer interface.
But this ticket can be closed: applications and logging will survive disk-full situations.
A survey of the code (both foolscap's and twisted.python.log) suggests that it should properly tolerate exceptions that occur when the logging code cannot write to disk. Exceptions during logging are logged, but the observer who caused the exception is disabled while its own exception is logged. So it should be ok.
But I'm not going to close this ticket until I've created a small+full filesystem and run a foolscap-based process in it to see what happens.