Opened 17 years ago

Closed 17 years ago

#61 closed task (fixed)

logging: implement Incident handling

Reported by: Brian Warner Owned by:
Priority: major Milestone: 0.2.6
Component: logging Version: 0.2.5
Keywords: Cc:

Description

It is time to finally implement the "triggered logging" feature that is the whole purpose of foolscap logging: dump the circular event buffers when something serious happens.

My plan is centered around the idea of an "Incident". There will be an "Incident Qualifier" which watches the event stream and gets to declare when an incident has occurred: the default one will fire when events above a certain severity are witnessed, but this can be overridden. Then there will be an "Incident Reporter" class which is instantiated when the qualifier fires, and is responsible for pulling events out of the buffers and writing them to a file.

The reporter needs access to a working directory. It will populate this with "TIMESTAMP.incident.flog" files, each of which is a pickled list of event dicts (just like the .flog files created by flogtool tail --save-to and the log-gatherer).

Things I haven't figured out yet:

  • remote access: I think that the flogport protocol (RILogPublisher) should have a way to ask about existing incidents, fetch their contents, and subscribe to hear about new ones
  • marking the triggering event: Each incident.flog file could have the event that triggered the incident marked specially. In general I think the flogfiles need an extension mechanism.. perhaps we should declare that the first pickled object in the file is a dictionary, with contents that are currently ignored. This would be a compatiblity-breaking change, but perhaps better now than later.
    • A non-breaking approach would be to put a synthetic event as the first one in the file, but I'm concerned that this would confuse tools that want to use the time or number of the first event to summarize the contents of the file.
    • an extension dict like this could also help with another problem: giving the remote incident publisher a way to describe the incidents to the subscriber: they might only wish to fetch recent incidents, rather than old ones that they already know about. Without some sort of metadata, the incident publisher has only the filename to go on (or it must examine the full contents of the flogfile to produce this summary).

The Incident Reporter should be able to record some number of events that occur after the trigger: this might capture the application's response to the problem. I'm thinking 100 events or 5 seconds, whichever comes first. I'd like the incident file to be compressed, and I don't want to depend upon external /usr/bin/bzip2 tools. This is complicated by the fact that we can't be sure that the app will continue running for much longer. So my plan is:

  • open two files: one compressed, one uncompressed.
  • sort the existing events, pickle them, dump them into both files, flush, but leave the filehandles open
  • the Reporter will stick around for 100 events or 5 seconds, subscribed to hear about subsequent events. Each event is written into the files.
  • when the post-trigger window closes, close both files. If the .bz2 close is successful, delete the non-compressed file.

That will give us an uncompressed file that will survive the app quitting quickly, or a compressed file if the app lasts long enough.

Change History (3)

comment:1 Changed 17 years ago by Brian Warner

We have a solution for the "marking the triggering event" issue. I just made a series of changes that add a "header dictionary" at the beginning of the pickle file, where metadata can live. This will be used to distinguish the incident report from other logfiles, and provides a place where we can stash the triggering event.

This causes a backwards compatiblity break, but not a forwards one: new logfiles will probably break older tools, but newer tools will handle old logfiles just fine. This seems to be the important compatiblity direction to maintain.

Next step is to define the IncidentQualifier.

comment:2 Changed 17 years ago by Brian Warner

[a92e8e2d00f932d2efc3dd21f490c95968d42792] gets most of the work done. We still need to define remote access: the current code records the "incident" into a logfile inside the logdir, but there is no way to subscribe to hear about them, or to retrieve them remotely.

Those might not be necessary for the 0.2.6 release, but they'd be nice to have.

comment:3 Changed 17 years ago by Brian Warner

Resolution: fixed
Status: newclosed

[58b32593c3eaaadae1f4ca77b3673d61f5165929] adds the remote access pieces. It is only polling (not subscription-oriented), and it may need some cleanup in the future, but I think it's good enough to close this ticket for now.

Note: See TracTickets for help on using tickets.