Ticket #105: failures.2.xhtml

File failures.2.xhtml, 21.0 KB (added by Brian Warner, 15 years ago)

2nd version of proposed API

1<html xmlns="http://www.w3.org/1999/xhtml">
3<title>Foolscap Failure Reporting</title>
4<style src="stylesheet-unprocessed.css"></style>
8<h1>Foolscap Failure Reporting</h1>
10<h2>Signalling Remote Exceptions</h2>
12<p>The <code>remote_</code> -prefixed methods which Foolscap invokes, just
13like their local counterparts, can either return a value or raise an
14exception. Foolscap callers can use the normal Twisted conventions for
15handling asyncronous failures: <code>callRemote</code> returns a Deferred
16object, which will eventually either fire its callback function (if the
17remote method returned a normal value), or its errback function (if the
18remote method raised an exception).</p>
20<p>There are several reasons that the Deferred returned
21by <code>callRemote</code> might fire its errback:</p>
24 <li>local outbound schema violation: the outbound method arguments did not
25     match the <code>RemoteInterface</code> that is in force. This is an
26     optional form of typechecking for remote calls, and is activated when
27     the remote object describes itself as conforming to a named
28     <code>RemoteInterface</code> which is also declared in a local class.
29     The local constraints are checked before the message is transmitted over
30     the wire. A constraint violation is indicated by
31     raising <code>foolscap.schema.Violation</code>, which is delivered
32     through the Deferred's errback.</li>
33 <li>network partition: if the underlying TCP connection is lost before the
34     response has been received, the Deferred will errback with
35     a <code>foolscap.ipb.DeadReferenceError</code> exception. Several things
36     can cause this: the remote process shutting down (intentionally or
37     otherwise), a network partition or timeout, or the local process
38     shutting down (<code>Tub.stopService</code> will terminate all
39     outstanding remote messages before shutdown).</li>
40 <li>remote inbound schema violation: as the serialized method arguments were
41     unpacked by the remote process, one of them violated that processes
42     inbound <code>RemoteInterface</code>. This check serves to protect each
43     process from incorrect types which might either confuse the subsequent
44     code or consume a lot of memory. These constraints are enforced as the
45     tokens are read off the wire, and are signalled with the
46     same <code>Violation</code> exception as above (but this may be wrapped
47     in a <code>RemoteException</code>: see below).</li>
48 <li>remote method exception: if the <code>remote_</code> method raises an
49     exception, or returns a Deferred which subsequently fires its errback,
50     the remote side will send the caller that an exception occurred, and may
51     attempt to provide some information about this exception. The caller
52     will see an errback that may or may not attempt to replicate the remote
53     exception. This may be wrapped in a <code>RemoteException</code>. See
54     below for more details.</li>
55 <li>remote outbound schema violation: as the remote method's return value is
56     serialized and put on the wire, the values are compared against the
57     return-value constraint (if a <code>RemoteInterface</code> is in
58     effect). If it does not match the constraint, a Violation will be raised
59     (but may be wrapped in a <code>RemoteException</code>).</li>
60 <li>local inbound schema violation: when the serialized return value arrives
61     on the original caller's side of the wire, the return-value constraint
62     of any effective <code>RemoteInterface</code> will be applied. This
63     protects the caller's response code from unexpected values. Any
64     mismatches will be signalled with a Violation exception.</li>
67<h2>Distinguishing Remote Exceptions</h2>
69<p>When a remote call fails, what should you do about it? There are several
70factors to consider. Raising exceptions may be part of your remote API:
71easy-to-use exceptions are a big part of Python's success, and Foolscap
72provides the tools to use them in a remote-calling environment as well.
73Exceptions which are not meant to be part of the API frequently indicate
74bugs, sometimes as precondition assertions (of which schema Violations are a
75subset). It might be useful to react to the specific type of remote
76exception, and/or it might be important to log as much information as
77possible so a programmer can find out what went wrong, and in either case it
78might be appropriate to react by falling back to some alternative code
81<p>Good debuggability frequently requires at least one side of the connection
82to get lots of information about errors that indicate possible bugs. Note
83that the <code>Tub.setOption("logLocalFailures", True)</code>
84and <code>Tub.setOption("logRemoteFailures", True)</code> options are
85relevant: when these options are enabled, exceptions that are sent over the
86wire (in one direction or the other) are recorded in the Foolscap log stream.
87If you use exceptions as part of your regular remote-object API, you may want
88to consider disabling both options. Otherwise the logs may be cluttered with
89perfectly harmless exceptions.</p>
91<p>Should your code pay attention to the details of a remote exception (other
92than the fact that an exception happened at all)? There are roughly two
93schools of thought:</p>
96  <li>Distrust Outsiders: assume, like any sensible program which connects to
97  the internet, that the entire world is out to get you. Use external
98  services to the extent you can, but don't allow them to confuse you or
99  trick you into some code path that will expose a vulnerability. Treat all
100  remote exceptions as identical.</li>
102  <li>"E" mode: treat external code with the same level of trust or distrust
103  that you would apply to local code. In the "E" programming language (which
104  inspires much of Foolscap's feature set), each object is a separate trust
105  domain, and the only distinction made between "local" and "remote" objects
106  is that the former may be called synchronously, while the latter may become
107  partitioned. Treat remote exceptions just like local ones, interpreting
108  their type as best you can.</li>
111<p>From Foolscap's point of view, what we care about is how to handle
112exceptions raised by the remote code. When operating in the first mode,
113Foolscap will merge all remote exceptions into a single exception type
114named <code>foolscap.api.RemoteException</code>, which cannot be confused
115with regular Python exceptions like <code>KeyError</code>
116and <code>AttributeError</code>. In the second mode, Foolscap will try to
117convert each remote exception into a corresponding local object, so that
118error-handling code can catch e.g. <code>KeyError</code> and use it as part
119of the remote API.</p>
121<p>To tell Foolscap which mode you want to use,
122call <code>tub.setOption("expose-remote-exception-types", BOOL)</code>, where
123BOOL is either True (for the "E mode") or False (for the "Distrust Outsiders"
124mode). The default is True.</p>
126<p>In "Distrust Outsiders" mode, a remote exception will cause the caller's
127errback handler to be called with a regular <code>Failure</code> object which
128contains a <code>foolscap.api.RemoteException</code>, effectively hiding all
129information about the nature of the problem except that it was caused by some
130other system. Caller code can test for this with <code>f.check</code>
131and <code>f.trap</code> as usual. If the caller's code decides to investigate
132further, it can use <code>f.value.failure</code> to obtain
133the <code>CopiedFailure</code> (see below) that arrived from the remote
134system. Note that schema Violations which are caught on the local system are
135reported normally, whereas Violations which are caught on the remote system
136are reported as RemoteExceptions.</p>
138<p>In "E mode", a remote exception will cause the errback handler to be
139called with a <code>CopiedFailure</code> object.
140This <code>CopiedFailure</code> will behave as much as possible like the
141corresponding Failure from the remote side, given the limitations of the
142serialization process (see below for details). In particular, if the remote
143side raises e.g. a standard Python <code>IndexError</code>, the local side
144can use <code>f.trap(IndexError)</code> to catch it. However, this same
145f.trap call would also catch locally-generated IndexErrors, which could be
148<h3>Examples: Distrust Outsiders</h3>
150<p>Since Deferreds can be chained, it is quite common to see remote calls
151sandwiched in the middle of two (possibly asynchronous) local calls. The
152following snippet performs a local processing step, then asks a remote server
153for information, then adds that information into a local database. All three
154steps are asynchronous.</p>
156<pre class="python">
157# Example 1
158def get_and_store_record(name):
159    d = local_db.getIDNumber(name)
160    d.addCallback(lambda idnum: rref.callRemote("get_record", idnum))
161    d.addCallback(lambda record: local_db.storeRecord(name))
162    return d
165<p>To motivate an examination of error handling, we'll extend this example to
166use two separate servers for the record: if one of them doesn't have it, we
167ask the other. The first server might raise <code>KeyError</code> to tell us
168it can't find the record, or it might experience some other internal error,
169or we might lose the connection to that server before it can get us an
170answer: all three cases should prompt us to talk to the second server.</p>
172<pre class="python">
173# Example 2
174from foolscap.api import Tub, RemoteException
175t = Tub()
176t.setOption("expose-remote-exception-types", False) # Distrust Outsiders
179def get_and_store_record(name):
180    d = local_db.getIDNumber(name)
181    def get_record(idnum):
182        d2 = server1.callRemote("get_record", idnum) # could raise KeyError
183        def maybe_try_server2(f):
184            f.trap(RemoteException)
185            return server2.callRemote("get_record", idnum) # or KeyError
186        d2.addErrback(maybe_try_server2)
187        return d2
188    d.addCallback(get_record)
189    d.addCallback(lambda record: local_db.storeRecord(name))
190    return d
193<p>In this example, only a failure that occurs on server1 will cause the code
194to attempt to use server2. A locally-triggered error will be trapped by the
195first line of <code>maybe_try_server2</code> and will not proceed to the
196second <code>callRemote</code>. This allows a more complex control flow like
197the following:</p>
199<pre class="python">
200# Example 3
201def get_and_store_record(name):
202    d = local_db.getIDNumber(name) # could raise IndexError
204    def get_record(idnum):
205        d2 = server1.callRemote("get_record", idnum) # or KeyError
206        def maybe_try_server2(f):
207            f.trap(RemoteException)
208            return server2.callRemote("get_record", idnum) # or KeyError
209        d2.addErrback(maybe_try_server2)
210        return d2
211    d.addCallback(get_record)
213    d.addCallback(lambda record: local_db.storeRecord(name))
215    def ignore_unknown_names(f):
216        f.trap(IndexError)
217        print "Couldn't get ID for name, ignoring"
218        return None
219    d.addErrback(ignore_unknown_names)
221    def failed(f):
222        print "didn't get data!"
223        if f.check(RemoteException):
224            if f.value.failure.check(KeyError):
225                print "both servers claim to not have the record"
226            else:
227                print "both servers had error"
228        else:
229           print "local error"
230        print "error details:", f
231    d.addErrback(failed)
233    return d
236<p>The final <code>failed</code> method will catch any unexpected error: this
237is the place where you want to log enough information to diagnose a code bug.
238For example, if the database fetch had returned a string, but the
239RemoteInterface had declared <code>get_record</code> as taking an integer,
240then the <code>callRemote</code> would signal a (local) Violation exception,
241causing control to drop directly to the <code>failed()</code> error handler.
242On the other hand, if the first server decided to throw a Violation on its
243inbound argument, the <code>callRemote</code> would signal a RemoteException
244(wrapping a Violation), and control would flow to
245the <code>maybe_try_server2</code> fallback.</p>
247<p>It is usually best to put the errback as close as possible to the call
248which might fail, since this provides the highest "signal to noise ratio"
249(i.e. it reduces the number of possibilities that the error-handler code must
250handle). But it is frequently more convenient to place the errback later in
251the Deferred chain, so it can be useful to distinguish between the
252local <code>IndexError</code> and a remote exception of the same type. This
253is the same decision that needs to be made with synchronous code: whether to
254use lots of <code>try:/except:</code> blocks wrapped around individual method
255calls, or to use one big block around a whole sequence of calls. Smaller
256blocks will catch an exception sooner, but larger blocks are less effort to
257write, and can be more appropriate, especially if you do not expect
258exceptions to happen very often.</p>
260<p>Note that if this example had used "E mode" and the first remote server
261decided (perhaps maliciously) to raise <code>IndexError</code>, then the
262client could be tricked into following the same ignore-unknown-names code
263path that was meant to be reserved for a local database miss.</p>
265<p>To examine the type of failure more closely, the error-handling code
266should access the <code>RemoteException</code>'s <code>.value.failure</code>
267attribute. By making the following change to <code>maybe_try_server2</code>,
268the behavior is changed to only query the second server in the specific case
269of a remote <code>KeyError</code>. Other remote exceptions (and all local
270exceptions) will skip the second query and signal an error
271to <code>failed()</code>. You might want to do this if you believe that a
272remote failure like <code>AttributeError</code> is worthy of error-logging
273rather than fallback behavior.</p>
275<pre class="python">
276# Example 4
277        def maybe_try_server2(f):
278            f.trap(RemoteException)
279            if f.value.failure.check(KeyError):
280                return server2.callRemote("get_record", idnum) # or KeyError
281            return f
284<p>Note that you should probably not use <code>f.value.failure.trap</code>,
285since if the exception type does not match, that will raise the inner
286exception (i.e. the <code>KeyError</code>) instead of
287the <code>RemoteException</code>, potentially confusing subsequent
288error-handling code.</p>
291<h3>Examples: E Mode</h3>
293<p>Systems which use a lot of remote exceptions as part of their
294inter-process API can reduce the size of the remote-error-handling code by
295switching modes, at the expense of risking confusion between local and remote
296occurrences of the same exception type. In the following example, we use "E
297Mode" and look for <code>KeyError</code> to indicate a
298remote <code>get_record</code> miss.</p>
300<pre class="python">
301# Example 5
302from foolscap.api import Tub
303t = Tub()
304t.setOption("expose-remote-exception-types", True) # E Mode
307def get_and_store_record(name):
308    d = local_db.getIDNumber(name)
310    def get_record(idnum):
311        d2 = server1.callRemote("get_record", idnum) # or KeyError
312        def maybe_try_server2(f):
313            f.trap(KeyError)
314            return server2.callRemote("get_record", idnum) # or KeyError
315        d2.addErrback(maybe_try_server2)
316        return d2
317    d.addCallback(get_record)
319    d.addCallback(lambda record: local_db.storeRecord(name))
321    def ignore_unknown_names(f):
322        f.trap(IndexError)
323        print "Couldn't get ID for name, ignoring"
324        return None
325    d.addErrback(ignore_unknown_names)
327    def failed(f):
328        print "didn't get data!"
329        if f.check(KeyError):
330            # don't bother showing details
331            print "both servers claim to not have the record"
332        else:
333            # show details by printing "f", the Failure instance
334            print "other error", f
335    d.addErrback(failed)
337    return d
340<p>In this example, <code>KeyError</code> is part of the
341remote <code>get_record</code> method's API: it either returns the data, or
342it raises KeyError, and anything else indicates a bug. The caller explicitly
343catches KeyError and responds by either falling back to the second server
344(the first time) or announcing a servers-have-no-record error (if the
345fallback failed too). But if something else goes wrong, the client indicates
346a different error, along with the exception that triggered it, so that a
347programmer can investigate.</p>
349<p>The remote error-handling code is slightly simpler, relative to the
350identical behavior expressed in Example 4,
351since <code>maybe_try_server2</code> only needs to
352use <code>f.trap(KeyError)</code>, instead of needing to unwrap
353a <code>RemoteException</code> first. But when this error-handling code is at
354the end of a larger block (such as the <code>f.trap(IndexError)</code>
355in <code>ignore_unknown_names()</code>, or the <code>f.check(KeyError)</code>
356in <code>failed()</code>), it is vulnerable to confusion:
357if <code>local_db.getIDNumber</code> raised <code>KeyError</code> (instead of
358the expected <code>IndexError</code>), or if the remote server
359raised <code>IndexError</code> (instead of <code>KeyError</code>), then the
360error-handling logic would follow the wrong path.</p>
362<h3>Default Mode</h3>
364<p>Exception modes were introduced in Foolscap-0.4.0 . Releases before that
365only offered "E mode". The default in 0.4.0 is "E mode"
366(expose-remote-exception-types=True), to retain compatibility with the
367exception-handling code in existing applications. A future release of
368Foolscap may change the default mode to expose-remote-exception-types=False,
369since it seems likely that apps written in this style are less likely to be
370confused by remote exceptions of unexpected types.</p>
374<p>Twisted uses the <code>twisted.python.failure.Failure</code> class to
375encapsulate Python exceptions in an instance which can be passed around,
376tested, and examined in an asynchronous fashion. It does this by copying much
377of the information out of the original exception context (including a stack
378trace and the exception instance itself) into the <code>Failure</code>
379instance. When an exception is raised during a Deferred callback function, it
380is converted into a Failure instance and passed to the next errback handler
381in the chain.</p>
383<p>When <code>RemoteReference.callRemote</code> needs to transport
384information about a remote exception over the wire, it uses the same
385convention. However, Failure objects cannot be cleanly serialized and sent
386over the wire, because they contain references to local state which cannot be
387precisely replicated on a different system (stack frames and exception
388classes). So, when an exception happens on the remote side of
389a <code>callRemote</code> invocation, and the exception-handling mode passes
390the remote exception back to the calling code somehow, that code will receive
391a <code>CopiedFailure</code> instance instead.</p>
393<p>In "E mode", the <code>callRemote</code>'s errback function will receive
394a <code>CopiedFailure</code> in response to a remote exception, and will
395receive a regular <code>Failure</code> in response to locally-generated
396exceptions. In "Distrust Outsiders" mode, the errback will always receive a
397regular <code>Failure</code>, but
398if <code>f.check(foolscap.api.RemoteException)</code> is True, then
399the <code>CopiedFailure</code> can be obtained
400with <code>f.value.failure</code> and examined further.</p>
402<p><code>CopiedFailure</code> is designed to behave very much like a
403regular <code>Failure</code> object. The <code>check</code>
404and <code>trap</code> methods work on <code>CopiedFailure</code>s just like
405they do on <code>Failure</code>s.</p>
407<p>However, all of the Failure's attributes must be converted into strings
408for serialization. As a result, the original <code>.value</code> attribute
409(which contains the exception instance, which might contain additional
410information about the problem) is replaced by a stringified representation,
411which tends to lose information. The frames of the original stack trace are
412also replaced with a string, so they can be printed but not examined. The
413exception class is also passed as a string (using
414Twisted's <code>reflect.qual</code> fully-qualified-name utility),
415but <code>check</code> and <code>trap</code> both compare by string name
416instead of object equality, so most applications won't notice the
419<p>The default behavior of CopiedFailure is to include a string copy of the
420stack trace, generated with <code>printTraceback()</code>, which will include
421lines of source code when available. To reduce the amount of information sent
422over the wire, stack trace strings larger than about 2000 bytes are truncated
423in a fashion that tries to preserve the top and bottom of the stack.</p>
427<p>Applications which consider their lines of source code or their
428exceptions' list of (filename, line number) tuples to be sensitive
429information can set the "unsafeTracebacks" flag in their Tub to False; the
430server will then remove stack information from the CopiedFailure objects it
431sends to other systems.</p>
433<pre class="python">
434t = Tub()
435t.unsafeTracebacks = False
438<p>When unsafeTracebacks is False, the <code>CopiedFailure</code> will only
439contain the stringified exception type, value, and parent class names.</p>