Opened 9 years ago

Closed 9 years ago

#244 closed defect (fixed)

infinite incident loop when logging non-serializable event values

Reported by: Brian Warner Owned by:
Priority: critical Milestone: 0.10.0
Component: logging Version: 0.9.1
Keywords: Cc:

Description

A recent change to Twisted caused the loglevels passed into twisted.python.log.msg() to use special NamedConstant objects instead of simple integers. When Foolscap is configured to bridge twisted logs into foolscap, it attempts to serialize these objects, which fails, which provokes another error, which fails, etc.

The consequence is that any error causes the process to go into an infinite loop and spam the logging system forever. If it is configured to write incidents to a directory, this will also eventually fill up the disk.

I think we need to translate these constants into the old-style integers in the bridge function.

Change History (10)

comment:1 Changed 9 years ago by Brian Warner

Version: 0.7.00.9.1

comment:2 Changed 9 years ago by Brian Warner

I reproduced this in a tahoe server with foolscap-0.9.1 and twisted-15.4.0, simply doing a "tahoe start". The server wrote out 4MB of log files in a few seconds before I managed to kill it off.

The first problem in the logs is an inability to pickle a lambda expression:

2015-10-13 18:14:23-0700 [-] [Failure instance: Traceback: <class 'pickle.PicklingError'>: Can't pickle <function <lambda> at 0x7f22387629b0>: it's not found as twisted.internet.posixbase.<lambda>
2015-10-13 18:14:23-0700 [-] /home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/logger/_observer.py:131:__call__
2015-10-13 18:14:23-0700 [-] /home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/logger/_legacy.py:93:__call__
2015-10-13 18:14:23-0700 [-] /home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/logging/log.py:397:_new_twisted_log_observer
2015-10-13 18:14:23-0700 [-] /home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/logging/log.py:225:msg
2015-10-13 18:14:23-0700 [-] --- <exception caught here> ---
2015-10-13 18:14:23-0700 [-] /home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/logging/log.py:280:add_event
2015-10-13 18:14:23-0700 [-] /home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/logging/incident.py:38:event
2015-10-13 18:14:23-0700 [-] /home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/logging/log.py:293:declare_incident
2015-10-13 18:14:23-0700 [-] /home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/logging/incident.py:112:incident_declared
2015-10-13 18:14:23-0700 [-] /usr/lib/python2.7/pickle.py:1370:dump
2015-10-13 18:14:23-0700 [-] /usr/lib/python2.7/pickle.py:224:dump
2015-10-13 18:14:23-0700 [-] /usr/lib/python2.7/pickle.py:286:save
2015-10-13 18:14:23-0700 [-] /usr/lib/python2.7/pickle.py:649:save_dict
...

followed by a pair of Unhandled Error (probably in a Deferred, but maybe in a reactor.callLater) where we try to make v1 calls to an introducer that doesn't handle them. One of those was:

2015-10-13 18:14:23-0700 [Negotiation,client] Unhandled Error
	Traceback (most recent call last):
	  File "/home/warner/stuff/tahoe/tahoe/src/allmydata/introducer/client.py", line 142, in _got_versioned_introducer
	    self._maybe_publish()
	  File "/home/warner/stuff/tahoe/tahoe/src/allmydata/introducer/client.py", line 248, in _maybe_publish
	    d = self._handle_v1_publisher(ann_t) # for_v1
	  File "/home/warner/stuff/tahoe/tahoe/src/allmydata/introducer/client.py", line 260, in _handle_v1_publisher
	    d = self._publisher.callRemote("publish", ann_v1)
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/referenceable.py", line 413, in callRemote
	    return defer.maybeDeferred(self._callRemote, _name, *args, **kwargs)
	--- <exception caught here> ---
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 150, in maybeDeferred
	    result = f(*args, **kw)
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/referenceable.py", line 461, in _callRemote
	    methodSchema) = self._getMethodInfo(_name)
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/referenceable.py", line 578, in _getMethodInfo
	    (interfaceName, self, name))
	foolscap.tokens.Violation: Violation: ('RILogPublisher.foolscap.lothar.com(<RemoteReference at 0x7f22342936d0 [pb://kiqpg6q6zoaznovvwkxkbv6edog7hznx@10.0.1.55:34967,127.0.0.1:34967/7mgg72sujtanlqswunlgwwpyvjaztz76]>) does not offer publish',)

Then, a moment later, the log shows the foolscap.logging.incident trailing_event() function being activated, which tries and fails to pickle a weakref object:

2015-10-13 18:14:23-0700 [-] Unhandled Error
	Traceback (most recent call last):
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/application/app.py", line 285, in runReactorWithLogging
	    reactor.run()
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 1194, in run
	    self.mainLoop()
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 1203, in mainLoop
	    self.runUntilCurrent()
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/Twisted-15.4.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 825, in runUntilCurrent
	    call.func(*call.args, **call.kw)
	--- <exception caught here> ---
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/eventual.py", line 26, in _turn
	    cb(*args, **kwargs)
	  File "/home/warner/stuff/tahoe/tahoe/support/lib/python2.7/site-packages/foolscap-0.9.1-py2.7.egg/foolscap/logging/incident.py", line 135, in trailing_event
	    pickle.dump(wrapper, self.f1)
	  File "/usr/lib/python2.7/pickle.py", line 1370, in dump
	    Pickler(file, protocol).dump(obj)
	  File "/usr/lib/python2.7/pickle.py", line 224, in dump
	    self.save(obj)
	  File "/usr/lib/python2.7/pickle.py", line 286, in save
	    f(self, obj) # Call unbound method with explicit self
	  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
	    self._batch_setitems(obj.iteritems())
	  File "/usr/lib/python2.7/pickle.py", line 663, in _batch_setitems
	    save(v)
	  File "/usr/lib/python2.7/pickle.py", line 286, in save
	    f(self, obj) # Call unbound method with explicit self
	  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
	    self._batch_setitems(obj.iteritems())
          ...

	  File "/usr/lib/python2.7/pickle.py", line 663, in _batch_setitems
	    save(v)
	  File "/usr/lib/python2.7/pickle.py", line 306, in save
	    rv = reduce(self.proto)
	  File "/home/warner/stuff/tahoe/tahoe/ve/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
	    raise TypeError, "can't pickle %s objects" % base.__name__
	exceptions.TypeError: can't pickle weakref objects

This pickling error repeats until the process is terminated.

Pickles are still used to create foolscap "incident files", and Tahoe configures foolscap to write incident files into a local directory, so any serious error message will trigger the writing process.

The log suggests four problems:

  • something tried to log a lambda. I notice that twisted.internet.posixbase._FDWaker.fileno is a lambda, and maybe if the _FDWaker gets logged, it'll try to log the fileno
  • we're making old introducer calls. Not a foolscap problem, but maybe tahoe's handle-older-introducer logic is missing some catch-and-ignore calls
  • something tried to log a weakref
  • something went into an infinite loop, in which each loop tried to log a weakref

It's possible that the incident-logging mechanism does not tolerate errors that occur as it runs, which could lead to the loop.

Incidentally, nothing in this points blame at twisted.logger.LogLevel.info being a NamedConstant. I think something in one of the logs I've seen recently made me think this was a problem, but it's not this particular reproduction case. The NamedConstant appeared in twisted-15.2.0.

Version 0, edited 9 years ago by Brian Warner (next)

comment:3 Changed 9 years ago by Brian Warner

Digging in more:

  • The first exception was the introducer "does not offer publish" one. This triggered an incident.
  • While trying to dump the buffered log events for this incident (while inside incident_declared()), something unserializable was encountered. This exception happens after the triggering event has been written, and after an observer has been added to collect trailing events, but *before* we set a timer to stop recording trailing events. So the Incident does not yet have a .timer attribute.
  • Some time later, another event is logged, and dispatched to all observers, including Incident.trailing_event.
    • If the event is unserializable, trailing_event raises an exception.
    • Once TRAILING_EVENT_LIMIT (100) events are delivered, we don't try to serialize any more, and call stop_recording(). However because the .timer attribute was not set, this raises an exception, and fails to remove the observer, so subsequent events will still be delivered to trailing_event`.
  • If an exception happens in trailing_event, which is dispatched with an eventual-send (which is really inside a twisted callLater()), that exception is logged by Twisted log.err as an Unhandled Error.
    • It appears that log.failure() is used, because the resulting log_level is 50, which is equal to LogLevel.critical
    • tahoe is bridging logs from twisted into foolscap, so this error gets logged by foolscap too.
    • Moreover, the LogLevel.critical (50) is above the threshold (WEIRD=30) for declaring Incidents, so a second Incident is triggered
    • Since the original unserializable event is still in the buffer, the second incident is doomed too: it will fail in the same way, triggering a third Incident, and the process will repeat
    • It might terminate at some point, if enough well-formed events are logged (into the same buffer) to push the unserializable event out. But if new instances of the unserializable one are created, the process will not terminate.

So the result is a infinite series of Incidents, none of which are ever closed, all of which are subscribed to hear about new events, each with some number of early events written to disk, but then stopping before the unserializable one.

comment:4 Changed 9 years ago by Brian Warner

Summary: infinite failure loop when logging twisted NamedConstant loglevelsinfinite incident loop when logging non-serializable event values

I think I understand this now:

  • the root cause is actually Twisted-15.3.0, which changed twisted.internet.factory.Factory.doStart to use some new fancy logging
  • the new fancy log call uses "Starting factory {factory!r}" and passes the Factory instance as factory=
  • the Factory is not picklable, it probably references the _FDWaker described above
  • when the WAPI server is started, a NevowSite instance starts up, and logs this unpicklable message
  • since tahoe enables twisted-to-foolscap log bridging, the unpicklable "Starting factory" message is put into the foolscap buffer, awaiting any Incident to cause an unending loop.

The original attempt to reproduce used a misconfigured tahoe client node, which was told to connect its IntroducerClient to the introducer's logport.furl rather its introducer.furl: the logport has no version data (so the introducerclient fell back to trying a v1 publish() request), and has a known schema (RILogPublisher) which does offer a publish() method, causing a local Violation. The Violation is weird enough to trigger an incident.

The loop can also be provoked by pushing the "Report An Incident" button on the WUI welcome page.

So basically we're going to have to deal with the fact that code outside our control is going to call twisted's logger with all sorts of random junk, expecting that nobody will look at it except the formatting string.

I think we need to change the twisted-to-foolscap log bridge to render the message down to text, and then throw out all keys except for the ones we know about. I think (hope?) we can keep allowing arbitrary keys for foolscap-native logging, with a rule that you aren't allowed to use non-serializable values.

comment:5 Changed 9 years ago by Brian Warner

I'm also going to switch Incidents (and the saved output of flogtool tail --save-to= and flogtool filter) to use lines of JSON instead of pickles. I should have never used pickles in the first place, but I was lazy and they were easy.

The rule will be that you can put arbitrary objects in arbitrary keys in your (native) foolscap log.msg() arguments, and:

  • anything you put there will stick around (refcounted / won't be garbage-collected) for an arbitrary period of time (until the circular buffer wraps and kicks it out), so think about about memory usage
  • it won't be serialized unless/until someone reads it: a flogtool tail or a saved Incident report
  • at that time, json.dumps(event) will be used to serialize the event. If there's anything which can't be JSON-serialized, the entire event will be replaced with a lame error message (like {serialization_error: "unable to serialize event data"}, with maybe a stringified form of the dictionary to give you a hint)

Instead of a sequence of pickle strings, the file will contain a magic header and a series of newline-terminated JSON strings. The header will probably be # foolscap flogfile v1\n, and tools will be modified to look for the header. If the header is not found, but the first few characters suggest that we've got a pickled dictionary, the tool will complain and tell the user to re-run with --allow-dangerous-pickles to enable the pickle-loading code.

comment:6 Changed 9 years ago by Brian Warner

Even more understanding. First the twisted pieces:

  • twistd sets up a logger which replaces sys.stdout/err with LoggingFile objects (via LogBeginner.beginLoggingTo()). Notably, stderr goes to a stream which uses LogLevel.error, which is above foolscap's Incident threshold.
  • the LoggingFile calls self.log.emit() for each (newline-separated) line of the text that was written to stdout/err
  • Logger.emit() adds log_logger=self (the Logger instance) to the event dict, so trying to pickle the event will include the Logger (which includes the globalLogPublisher, which includes any observers that have been added
  • twisted.python.log.msg() does not appear to add anything significant to the event dictionary
  • twisted.python.log.err() doesn't either, although it *does* (maybe?) set the log level high enough to trigger an Incident

So any print statements in a daemonized program will cause several unpicklable incident-causing events to be sent into the foolscap logging system, one for each line that was printed.

Foolscap's logging code has two significant print statements:

  • log.msg(), if format_message(event) fails
  • log.add_event() (called inside log.msg), if self.active_incident_qualifier.event(event) fails

IncidentQualifier.event() calls FoolscapLogger.declare_incident() for any event that is WEIRD or worse. declare_incident() either passes the event to IncidentReporter.new_trigger() if a reporter is already active, or creates a reporter and calls IncidentReporter.incident_declared() if not.

So, I think the full sequence was:

  • Tahoe starts, configures twisted-to-foolcap bridging, a Factory is started, a the first unpickleable event is added to the buffer
  • something logs an Incident-triggering event
    • twlog.msg() calls flog.TwistedLogBridge?.observer, which calls flog.msg, which calls flog.add_event, which queues the message for any observers (none so far), then calls IncidentQualifier.event(), which calls declare_incident, which calls incident_declared
    • incident_declared fails when it tries to pickle the buffered event, leaving self.trailing_event() as an observer but not setting self.timer
    • the exception is caught by the try/except in flog.add_event, a Failure is printed (to stdout), and the function eats the exception
      • for each line of that Failure, the write to stdout (since we're daemonzed with twistd) causes an unpickleable event to be written to twlog, which sends it to flog, which adds it to the buffer, and queues an eventual-send to the IncidentReporter trailing_event observer
    • finally the triggering event finishes dispatching. The results so far are:
      • twlog has the triggering event
      • IncidentReporter is open and active, and wrote the triggering event and 0 or more pickleable buffered events to the open .flog file. No self.timer was set.
      • a few dozen eventual-send calls are pending for the trailing_incident observer
  • on the next turn, the first of the pending eventual-sends is delivered, with an unpickleable stdout event
    • trailing_event fails when it tries to pickle the event
    • since the eventual-send is really a reactor.callLater(), twisted logs an Unhandled Error to twlog
    • this calls flog.msg with a pickleable event which is added to the buffer
    • the event would trigger an Incident if one weren't already active. Since one *is* active, it instead calls new_trigger, which ignores it.
    • results: twlog gets an Unhandled Error, new message added to buffers, IncidentReporter.remaining_events is decremented
  • eventually (after 100 events), remaining_events drops to zero, and trailing_event calls stop_recording instead of trying to pickle the event
    • stop_recording clears self.still_recording and self.active, then fails because self.timer wasn't set
      • this leaves the observer in place, and doesn't close the .flog files
    • the error is again caught by twisted's callLater, and delivers an Unhandled Error to twlog
    • this time, since there is no longer an active IncidentReporter, a new one is created, which fails immediately because of the unpickleable events in the buffer from before (either the original Factory message, or any one of the stdout messages)

I'm still not quite sure where the infinite (or merely very-large) loop appears; if the first-unpickleable buffered event's Failure can be printed with fewer than 100 lines, then it seems like the process would stop.

In any case, there are a lot of things that should be improved:

  • store events with JSON, not pickle
  • the TwistedLogBridge should stringify the Twisted event dict and build a flog.msg dict with only known-serializable keys
  • serialization should use a helper function that catches errors and replaces the dictionary with a "oops sorry" stub
  • IncidentReporter.incident_declared should start the timer before trying to serialize anything
  • trailing_event and stop_recording should be cleaned up

I was able to reproduce the infinite(ish?) messages by running the following short program under twistd -y:

import sys
from twisted.application import service
from twisted.python import log as tw_log
from twisted.internet import defer, reactor
from foolscap.api import Tub
from foolscap.logging import log as flog
from foolscap.logging.levels import BAD

class S(service.Service):
    def startService(self):
        print "started my service", sys.stdout
        self.t = Tub()
        self.t.setOption("bridge-twisted-logs", True)
        flog.setLogDir("logdir")
        reactor.callLater(1.0, self.unloggable)
        reactor.callLater(2.0, self.boom)

    def unloggable(self):
        #flog.msg("unloggable", unloggable=lambda: "nope")
        #tw_log.msg("hi from twisted log")
        print "hi from stdout"

    def boom(self):
        flog.msg("boom", level=BAD)

application = service.Application("logtest")
S().setServiceParent(application)

(note that running it --nodaemon doesn't cause the infinite logging, because then writing the Failure to stdout doesn't produce more log messages)

I'm not yet sure how to write a test that captures the stdout problem, since trial does not daemonize or replace sys.stdout (for good reason). I think maybe the foolscap logging tests should create a twisted.logger.LoggingFile (glued to a Logger that is bridged to a flog which is watching for Incidents), pretend it's sys.stdout, write something to it, and make sure the results are visible from the flog. Then check the incidents and make sure there aren't any. Or trigger a single incident and look to see if extra ones show up.

comment:7 Changed 9 years ago by Brian Warner

More notes about twisted's updated logging framework:

  • twisted.python.log.msg() is now legacy. The new approach is to instantiate a twisted.logger.Logger (which remembers the function/class where it was born, and uses it as a "namespace"), and call logger.info/warn/error/critical().
  • the new twisted.logger module first appeared in 15.2.0
  • twisted.python.log.addObserver() adds a legacy observer. Both new and old observers are single-argument callable that get an event dictionary, but the contents of the dictionary differ. The legacy observer always gets "message" (an empty tuple), "time", and "system" ("-"). Non-formatted events (ones with "message" instead of "format") get funky lazy-stringifying "format" and "log_legacy" keys, and failures are mapped to "failure", "isError", and "why" keys.
    • twisted.logger.ILogObserver describes the new observer interface, and the log_* event keys that might be added by the system
  • The new singleton log object is at twisted.logger.globalLogPublisher, but you don't use that to publish events: you create a Logger instead. I think it might be ok to do twisted.logger.globalLogPublisher.addObserver() with a new-style ILogObserver callable.

To maintain pre-15.2.0 compatibility, we need to stick with t.p.log.addObserver. We could try a protected import, and swap in a different observer function if it fails, but that sounds like twice the effort for not very much benefit.

comment:8 Changed 9 years ago by Brian Warner

I've split the pickle-vs-JSON piece out to ticket #247, since it's a bit trickier than I first thought.

comment:9 Changed 9 years ago by Brian Warner

This is affecting debian, and the foolscap package is scheduled to be removed altogether because a test case fails during the build (FTBFS): https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=795848 . I haven't seen these errors in foolscap's buildbot, so it'll take some investigation to figure out what's different about the debian/testing build environment.

In any case, I'm going to try to get a new release out this week, which means it probably won't include #247.

comment:10 Changed 9 years ago by Brian Warner

Resolution: fixed
Status: newclosed

This should be fixed in the patches I just landed (finishing with [11ede132]).

Note: See TracTickets for help on using tickets.