Opened 9 years ago
Closed 9 years ago
#244 closed defect (fixed)
infinite incident loop when logging non-serializable event values
Reported by: | Brian Warner | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | 0.10.0 |
Component: | logging | Version: | 0.9.1 |
Keywords: | Cc: |
Description
A recent change to Twisted caused the loglevels passed into twisted.python.log.msg() to use special NamedConstant
objects instead of simple integers. When Foolscap is configured to bridge twisted logs into foolscap, it attempts to serialize these objects, which fails, which provokes another error, which fails, etc.
The consequence is that any error causes the process to go into an infinite loop and spam the logging system forever. If it is configured to write incidents to a directory, this will also eventually fill up the disk.
I think we need to translate these constants into the old-style integers in the bridge function.
Change History (10)
comment:1 Changed 9 years ago by
Version: | 0.7.0 → 0.9.1 |
---|
comment:3 Changed 9 years ago by
Digging in more:
- The first exception was the introducer "does not offer publish" one. This triggered an incident.
- While trying to dump the buffered log events for this incident (while inside
incident_declared()
), something unserializable was encountered. This exception happens after the triggering event has been written, and after an observer has been added to collect trailing events, but *before* we set a timer to stop recording trailing events. So the Incident does not yet have a.timer
attribute. - Some time later, another event is logged, and dispatched to all observers, including
Incident.trailing_event
.- If the event is unserializable, trailing_event raises an exception.
- Once TRAILING_EVENT_LIMIT (100) events are delivered, we don't try to serialize any more, and call
stop_recording(). However because the
.timerattribute was not set, this raises an exception, and fails to remove the observer, so subsequent events will still be delivered to
trailing_event`.
- If an exception happens in
trailing_event
, which is dispatched with an eventual-send (which is really inside a twisted callLater()), that exception is logged by Twisted log.err as an Unhandled Error.- It appears that
log.failure()
is used, because the resulting log_level is 50, which is equal toLogLevel.critical
- tahoe is bridging logs from twisted into foolscap, so this error gets logged by foolscap too.
- Moreover, the
LogLevel.critical
(50) is above the threshold (WEIRD=30) for declaring Incidents, so a second Incident is triggered - Since the original unserializable event is still in the buffer, the second incident is doomed too: it will fail in the same way, triggering a third Incident, and the process will repeat
- It might terminate at some point, if enough well-formed events are logged (into the same buffer) to push the unserializable event out. But if new instances of the unserializable one are created, the process will not terminate.
- It appears that
So the result is a infinite series of Incidents, none of which are ever closed, all of which are subscribed to hear about new events, each with some number of early events written to disk, but then stopping before the unserializable one.
comment:4 Changed 9 years ago by
Summary: | infinite failure loop when logging twisted NamedConstant loglevels → infinite incident loop when logging non-serializable event values |
---|
I think I understand this now:
- the root cause is actually Twisted-15.3.0, which changed
twisted.internet.factory.Factory.doStart
to use some new fancy logging - the new fancy log call uses
"Starting factory {factory!r}"
and passes the Factory instance asfactory=
- the Factory is not picklable, it probably references the _FDWaker described above
- when the WAPI server is started, a NevowSite instance starts up, and logs this unpicklable message
- since tahoe enables twisted-to-foolscap log bridging, the unpicklable "Starting factory" message is put into the foolscap buffer, awaiting any Incident to cause an unending loop.
The original attempt to reproduce used a misconfigured tahoe client node, which was told to connect its IntroducerClient to the introducer's logport.furl
rather its introducer.furl
: the logport has no version data (so the introducerclient fell back to trying a v1 publish()
request), and has a known schema (RILogPublisher
) which does offer a publish()
method, causing a local Violation. The Violation is weird enough to trigger an incident.
The loop can also be provoked by pushing the "Report An Incident" button on the WUI welcome page.
So basically we're going to have to deal with the fact that code outside our control is going to call twisted's logger with all sorts of random junk, expecting that nobody will look at it except the formatting string.
I think we need to change the twisted-to-foolscap log bridge to render the message down to text, and then throw out all keys except for the ones we know about. I think (hope?) we can keep allowing arbitrary keys for foolscap-native logging, with a rule that you aren't allowed to use non-serializable values.
comment:5 Changed 9 years ago by
I'm also going to switch Incidents (and the saved output of flogtool tail --save-to=
and flogtool filter
) to use lines of JSON instead of pickles. I should have never used pickles in the first place, but I was lazy and they were easy.
The rule will be that you can put arbitrary objects in arbitrary keys in your (native) foolscap log.msg()
arguments, and:
- anything you put there will stick around (refcounted / won't be garbage-collected) for an arbitrary period of time (until the circular buffer wraps and kicks it out), so think about about memory usage
- it won't be serialized unless/until someone reads it: a
flogtool tail
or a saved Incident report - at that time,
json.dumps(event)
will be used to serialize the event. If there's anything which can't be JSON-serialized, the entire event will be replaced with a lame error message (like{serialization_error: "unable to serialize event data"}
, with maybe a stringified form of the dictionary to give you a hint)
Instead of a sequence of pickle strings, the file will contain a magic header and a series of newline-terminated JSON strings. The header will probably be # foolscap flogfile v1\n
, and tools will be modified to look for the header. If the header is not found, but the first few characters suggest that we've got a pickled dictionary, the tool will complain and tell the user to re-run with --allow-dangerous-pickles
to enable the pickle-loading code.
comment:6 Changed 9 years ago by
Even more understanding. First the twisted pieces:
twistd
sets up a logger which replacessys.stdout/err
with LoggingFile objects (viaLogBeginner.beginLoggingTo()
). Notably, stderr goes to a stream which usesLogLevel.error
, which is above foolscap's Incident threshold.- the
LoggingFile
callsself.log.emit()
for each (newline-separated) line of the text that was written to stdout/err Logger.emit()
addslog_logger=self
(the Logger instance) to the event dict, so trying to pickle the event will include the Logger (which includes theglobalLogPublisher
, which includes any observers that have been addedtwisted.python.log.msg()
does not appear to add anything significant to the event dictionarytwisted.python.log.err()
doesn't either, although it *does* (maybe?) set the log level high enough to trigger an Incident
So any print
statements in a daemonized program will cause several unpicklable incident-causing events to be sent into the foolscap logging system, one for each line that was printed.
Foolscap's logging code has two significant print statements:
log.msg()
, ifformat_message(event)
failslog.add_event()
(called insidelog.msg
), ifself.active_incident_qualifier.event(event)
fails
IncidentQualifier.event()
calls FoolscapLogger.declare_incident()
for any event that is WEIRD or worse. declare_incident()
either passes the event to IncidentReporter.new_trigger()
if a reporter is already active, or creates a reporter and calls IncidentReporter.incident_declared()
if not.
So, I think the full sequence was:
- Tahoe starts, configures twisted-to-foolcap bridging, a Factory is started, a the first unpickleable event is added to the buffer
- something logs an Incident-triggering event
- twlog.msg() calls flog.TwistedLogBridge?.observer, which calls flog.msg, which calls flog.add_event, which queues the message for any observers (none so far), then calls
IncidentQualifier.event()
, which callsdeclare_incident
, which callsincident_declared
incident_declared
fails when it tries to pickle the buffered event, leavingself.trailing_event()
as an observer but not settingself.timer
- the exception is caught by the try/except in flog.add_event, a Failure is printed (to stdout), and the function eats the exception
- for each line of that Failure, the write to stdout (since we're daemonzed with twistd) causes an unpickleable event to be written to twlog, which sends it to flog, which adds it to the buffer, and queues an eventual-send to the IncidentReporter
trailing_event
observer
- for each line of that Failure, the write to stdout (since we're daemonzed with twistd) causes an unpickleable event to be written to twlog, which sends it to flog, which adds it to the buffer, and queues an eventual-send to the IncidentReporter
- finally the triggering event finishes dispatching. The results so far are:
- twlog has the triggering event
- IncidentReporter is open and active, and wrote the triggering event and 0 or more pickleable buffered events to the open .flog file. No
self.timer
was set. - a few dozen eventual-send calls are pending for the
trailing_incident
observer
- twlog.msg() calls flog.TwistedLogBridge?.observer, which calls flog.msg, which calls flog.add_event, which queues the message for any observers (none so far), then calls
- on the next turn, the first of the pending eventual-sends is delivered, with an unpickleable stdout event
trailing_event
fails when it tries to pickle the event- since the eventual-send is really a reactor.callLater(), twisted logs an Unhandled Error to twlog
- this calls flog.msg with a pickleable event which is added to the buffer
- the event would trigger an Incident if one weren't already active. Since one *is* active, it instead calls new_trigger, which ignores it.
- results: twlog gets an Unhandled Error, new message added to buffers,
IncidentReporter.remaining_events
is decremented
- eventually (after 100 events),
remaining_events
drops to zero, andtrailing_event
callsstop_recording
instead of trying to pickle the eventstop_recording
clearsself.still_recording
andself.active
, then fails becauseself.timer
wasn't set- this leaves the observer in place, and doesn't close the .flog files
- the error is again caught by twisted's callLater, and delivers an Unhandled Error to twlog
- this time, since there is no longer an active IncidentReporter, a new one is created, which fails immediately because of the unpickleable events in the buffer from before (either the original Factory message, or any one of the stdout messages)
I'm still not quite sure where the infinite (or merely very-large) loop appears; if the first-unpickleable buffered event's Failure can be printed with fewer than 100 lines, then it seems like the process would stop.
In any case, there are a lot of things that should be improved:
- store events with JSON, not pickle
- the TwistedLogBridge should stringify the Twisted event dict and build a flog.msg dict with only known-serializable keys
- serialization should use a helper function that catches errors and replaces the dictionary with a "oops sorry" stub
IncidentReporter.incident_declared
should start the timer before trying to serialize anything- trailing_event and stop_recording should be cleaned up
I was able to reproduce the infinite(ish?) messages by running the following short program under twistd -y
:
import sys from twisted.application import service from twisted.python import log as tw_log from twisted.internet import defer, reactor from foolscap.api import Tub from foolscap.logging import log as flog from foolscap.logging.levels import BAD class S(service.Service): def startService(self): print "started my service", sys.stdout self.t = Tub() self.t.setOption("bridge-twisted-logs", True) flog.setLogDir("logdir") reactor.callLater(1.0, self.unloggable) reactor.callLater(2.0, self.boom) def unloggable(self): #flog.msg("unloggable", unloggable=lambda: "nope") #tw_log.msg("hi from twisted log") print "hi from stdout" def boom(self): flog.msg("boom", level=BAD) application = service.Application("logtest") S().setServiceParent(application)
(note that running it --nodaemon
doesn't cause the infinite logging, because then writing the Failure to stdout doesn't produce more log messages)
I'm not yet sure how to write a test that captures the stdout problem, since trial does not daemonize or replace sys.stdout
(for good reason). I think maybe the foolscap logging tests should create a twisted.logger.LoggingFile
(glued to a Logger that is bridged to a flog which is watching for Incidents), pretend it's sys.stdout, write something to it, and make sure the results are visible from the flog. Then check the incidents and make sure there aren't any. Or trigger a single incident and look to see if extra ones show up.
comment:7 Changed 9 years ago by
More notes about twisted's updated logging framework:
twisted.python.log.msg()
is now legacy. The new approach is to instantiate atwisted.logger.Logger
(which remembers the function/class where it was born, and uses it as a "namespace"), and calllogger.info/warn/error/critical()
.- the new twisted.logger module first appeared in 15.2.0
twisted.python.log.addObserver()
adds a legacy observer. Both new and old observers are single-argument callable that get an event dictionary, but the contents of the dictionary differ. The legacy observer always gets "message" (an empty tuple), "time", and "system" ("-"). Non-formatted events (ones with "message" instead of "format") get funky lazy-stringifying "format" and "log_legacy" keys, and failures are mapped to "failure", "isError", and "why" keys.- twisted.logger.ILogObserver describes the new observer interface, and the
log_*
event keys that might be added by the system
- twisted.logger.ILogObserver describes the new observer interface, and the
- The new singleton log object is at
twisted.logger.globalLogPublisher
, but you don't use that to publish events: you create a Logger instead. I think it might be ok to dotwisted.logger.globalLogPublisher.addObserver()
with a new-styleILogObserver
callable.
To maintain pre-15.2.0 compatibility, we need to stick with t.p.log.addObserver
. We could try a protected import, and swap in a different observer function if it fails, but that sounds like twice the effort for not very much benefit.
comment:8 Changed 9 years ago by
I've split the pickle-vs-JSON piece out to ticket #247, since it's a bit trickier than I first thought.
comment:9 Changed 9 years ago by
This is affecting debian, and the foolscap package is scheduled to be removed altogether because a test case fails during the build (FTBFS): https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=795848 . I haven't seen these errors in foolscap's buildbot, so it'll take some investigation to figure out what's different about the debian/testing build environment.
In any case, I'm going to try to get a new release out this week, which means it probably won't include #247.
comment:10 Changed 9 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
This should be fixed in the patches I just landed (finishing with [11ede132]).
I reproduced this in a tahoe server with foolscap-0.9.1 and twisted-15.4.0, simply doing a "tahoe start". The server wrote out 4MB of log files in a few seconds before I managed to kill it off.
The first problem in the logs is an inability to pickle a lambda expression:
followed by a pair of Unhandled Error (probably in a Deferred, but maybe in a reactor.callLater) where we try to make v1 calls to an introducer that doesn't handle them. One of those was:
Then, a moment later, the log shows the foolscap.logging.incident trailing_event() function being activated, which tries and fails to pickle a weakref object:
This pickling error repeats until the process is terminated.
Pickles are still used to create foolscap "incident files", and Tahoe configures foolscap to write incident files into a local directory, so any serious error message will trigger the writing process.
The log suggests four problems:
It's possible that the incident-logging mechanism does not tolerate errors that occur as it runs, which could lead to the loop.
Incidentally, nothing in this points blame at
twisted.logger.LogLevel.info
being aNamedConstant
. I think something in one of the logs I've seen recently made me think this was a problem, but it's not this particular reproduction case. Twisted-15.2.0 was the first version that started using NamedConstant.