Opened 16 years ago

Last modified 15 years ago

#101 new defect

traceback from remote host doesn't come with identifying information about which remote host

Reported by: Zooko Owned by:
Priority: major Milestone: undecided
Component: error-handling Version: 0.3.0
Keywords: Cc: Zooko, zooko@…

Description

ax5avvay#1693941923 21:24:12.428: error during query: [CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/foolscap/eventual.py", line 26, in _turn
    cb(*args, **kwargs)
  File "/usr/lib/python2.5/site-packages/foolscap/broker.py", line 536, in doNextCall
    d.addCallback(lambda res: self._doCall(delivery))
  File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 191, in addCallback
    callbackKeywords=kw)
  File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 182, in addCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 317, in _runCallbacks
    self.result = callback(self.result, *args, **kw)
  File "/usr/lib/python2.5/site-packages/foolscap/broker.py", line 536, in <lambda>
    d.addCallback(lambda res: self._doCall(delivery))
  File "/usr/lib/python2.5/site-packages/foolscap/broker.py", line 571, in _doCall
    return obj.doRemoteCall(delivery.methodname, args, kwargs)
  File "/usr/lib/python2.5/site-packages/foolscap/referenceable.py", line 59, in doRemoteCall
    res = meth(*args, **kwargs)
  File "/usr/lib/python2.5/site-packages/allmydata/storage.py", line 1202, in remote_slot_readv
    msf = MutableShareFile(filename, self)
  File "/usr/lib/python2.5/site-packages/allmydata/storage.py", line 383, in __init__
    data = f.read(self.HEADER_SIZE)
exceptions.IOError: [Errno 5] Input/output error
] [Errno 5] Input/output error

I would really like to know which storage server is having I/O errors. Too bad this stack trace doesn't tell me. Any identifying information would work -- tubid or IP address would be fine.

Change History (2)

comment:1 Changed 16 years ago by Brian Warner

So, for background, the following sequence of events occurred:

  • node ax5avvay (which we'll call Alice) sent a callRemote to some node Bob
  • Bob raised an exception during the processing of this message, so a Failure was sent back to Alice
  • Alice had an addErrback handler, which uses foolscap.logging.log.msg to record the error, with a bit of code that looks something like:
def _query_failed(self, f, peerid):
  ...
  self.log(format="error during query: %(f_value)s",
           f_value=str(f.value), failure=f,
           level=level, umid="IHXuQg")
  • The log event was sent to a Log Gatherer, and written into a .flog file
  • Zooko ran 'flogtool dump' on that .flog file, narrowed it down to the time period of interest, and copied the result.

My usual practice is to throw enough information into the log.msg() arguments that we can figure out the context of the message later on, but obviously I failed to do that here. The sort of thing that I would add would either be the peerid that we sent the query to, or a parent= which referenced an earlier message which had that same information (and which would mention the storageindex we were querying, etc). Using parent= would mean less extra context that needed to be passed to the errback handler, but also a higher risk that the parent message might be pruned (if it were at a lower severity level). So something like this:

def _query_failed(self, f, peerid):
  ...
  self.log(format="error during query to %(peerid)s: %(f_value)s",
           peerid=idlib.nodeid_b2a(peerid), f_value=str(f.value), failure=f,
           level=level, umid="IHXuQg")

But clearly, a more general solution would be to see if we could get this information embedded in the CopiedFailure. Now, since processes can have multiple Tubs, it wouldn't be appropriate to put this in the original Failure, but the act of serializing the Failure involves a specific Tub, so that first serialization would be a good place to stuff in the tubid. Failures are serialized into bits, those bits are deserialized into CopiedFailures. The CopiedFailure can be shipped around after the fact (CopiedFailure to bits to a new CopiedFailure), so it's just the initial serialization from Failure to bits where this should be performed.

Also, we need a way to get the nodeid into the str() or repr() of the CopiedFailure. At the moment we're just subclassing Failure, and not overriding the stringification code at all, so that would need to be changed.

comment:2 Changed 15 years ago by Brian Warner

Component: loggingerror-handling
Note: See TracTickets for help on using tickets.