Opened 17 years ago
Closed 17 years ago
#61 closed task (fixed)
logging: implement Incident handling
Reported by: | Brian Warner | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 0.2.6 |
Component: | logging | Version: | 0.2.5 |
Keywords: | Cc: |
Description
It is time to finally implement the "triggered logging" feature that is the whole purpose of foolscap logging: dump the circular event buffers when something serious happens.
My plan is centered around the idea of an "Incident". There will be an "Incident Qualifier" which watches the event stream and gets to declare when an incident has occurred: the default one will fire when events above a certain severity are witnessed, but this can be overridden. Then there will be an "Incident Reporter" class which is instantiated when the qualifier fires, and is responsible for pulling events out of the buffers and writing them to a file.
The reporter needs access to a working directory. It will populate this with
"TIMESTAMP.incident.flog" files, each of which is a pickled list of event
dicts (just like the .flog files created by flogtool tail --save-to
and
the log-gatherer).
Things I haven't figured out yet:
- remote access: I think that the flogport protocol
(
RILogPublisher
) should have a way to ask about existing incidents, fetch their contents, and subscribe to hear about new ones - marking the triggering event: Each incident.flog file could have the
event that triggered the incident marked specially. In general I think the
flogfiles need an extension mechanism.. perhaps we should declare that the
first pickled object in the file is a dictionary, with contents that are
currently ignored. This would be a compatiblity-breaking change, but
perhaps better now than later.
- A non-breaking approach would be to put a synthetic event as the first one in the file, but I'm concerned that this would confuse tools that want to use the time or number of the first event to summarize the contents of the file.
- an extension dict like this could also help with another problem: giving the remote incident publisher a way to describe the incidents to the subscriber: they might only wish to fetch recent incidents, rather than old ones that they already know about. Without some sort of metadata, the incident publisher has only the filename to go on (or it must examine the full contents of the flogfile to produce this summary).
The Incident Reporter should be able to record some number of events that occur after the trigger: this might capture the application's response to the problem. I'm thinking 100 events or 5 seconds, whichever comes first. I'd like the incident file to be compressed, and I don't want to depend upon external /usr/bin/bzip2 tools. This is complicated by the fact that we can't be sure that the app will continue running for much longer. So my plan is:
- open two files: one compressed, one uncompressed.
- sort the existing events, pickle them, dump them into both files, flush, but leave the filehandles open
- the Reporter will stick around for 100 events or 5 seconds, subscribed to hear about subsequent events. Each event is written into the files.
- when the post-trigger window closes, close both files. If the .bz2 close is successful, delete the non-compressed file.
That will give us an uncompressed file that will survive the app quitting quickly, or a compressed file if the app lasts long enough.
Change History (3)
comment:1 Changed 17 years ago by
comment:2 Changed 17 years ago by
[a92e8e2d00f932d2efc3dd21f490c95968d42792] gets most of the work done. We still need to define remote access: the current code records the "incident" into a logfile inside the logdir, but there is no way to subscribe to hear about them, or to retrieve them remotely.
Those might not be necessary for the 0.2.6 release, but they'd be nice to have.
comment:3 Changed 17 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
[58b32593c3eaaadae1f4ca77b3673d61f5165929] adds the remote access pieces. It is only polling (not subscription-oriented), and it may need some cleanup in the future, but I think it's good enough to close this ticket for now.
We have a solution for the "marking the triggering event" issue. I just made a series of changes that add a "header dictionary" at the beginning of the pickle file, where metadata can live. This will be used to distinguish the incident report from other logfiles, and provides a place where we can stash the triggering event.
This causes a backwards compatiblity break, but not a forwards one: new logfiles will probably break older tools, but newer tools will handle old logfiles just fine. This seems to be the important compatiblity direction to maintain.
Next step is to define the
IncidentQualifier
.