Opened 13 years ago

Closed 13 years ago

#190 closed defect (fixed)

log gatherer fails to process more incidents when processing a given incident fails (e.g. due to serialization problems)

Reported by: davidsarah Owned by: Brian Warner
Priority: major Milestone: 0.6.4
Component: logging Version: 0.6.1
Keywords: incident Cc: davidsarah

Description

LAE has its Tahoe storage servers configured with a log gatherer. Some of the incidents being sent by servers included unserializable objects (due to a bug in the Tahoe S3 backend which has been fixed). This caused the 'latest' file maintained by the log gatherer for each affected storage server to be stuck at or just before (the latter, I think) the unserializable incident, so subsequent incidents on that server were not requested, even though they had no serialization problem. IRC discussion:

(01:22:16) davidsarah: (01:20:23) zooko: There was an instance of TahoeS3Error that foolscap was asked to serialize.
(01:22:16) davidsarah: (01:20:35) zooko: And it raised a Violation exception saying "cannot serialize".
(01:22:44) davidsarah: it's the causing subsequent incidents not to be sent that is confusing
(01:22:55) zooko: Yes, being able to withhold write-access while granting read-access is frequently nice.
(01:23:05) zooko: davidsarah: I posted a link to the foolscap github...
(01:23:17) zooko: https://github.com/warner/foolscap/blob/3fd4331b67abf307aa38e898e7d1e7fd37fc0b3d/foolscap/logging/gatherer.py#L343
(01:23:26) ***davidsarah looks
(01:23:27) zooko: So, that violation exception is happening on the incident reporter side -- the tahoe-lafs storage server.
(01:23:47) zooko: But, over on the incident *gatherer* side, it attempted to fetch the incident, and got instead a message from foolscap saying something like "Error -- couldn't send you the thing you wanted"
(01:23:54) zooko: and the errback for that doesn't proceed to try the next one.
(01:23:58) zooko: I think. Am I right?
(01:24:08) ***davidsarah looks at the code
(01:25:18) davidsarah: I see, so _got_incident doesn't get called and doesn't recurse to maybe_fetch_incident

Attachments (1)

foolscap-190-fix.diff (2.8 KB) - added by davidsarah 13 years ago.
Fix for logging/gatherer.py

Download all attachments as: .zip

Change History (3)

Changed 13 years ago by davidsarah

Attachment: foolscap-190-fix.diff added

Fix for logging/gatherer.py

comment:1 Changed 13 years ago by davidsarah

Cc: davidsarah added
Owner: set to Brian Warner

comment:2 Changed 13 years ago by Brian Warner

Milestone: undecided0.6.4
Resolution: fixed
Status: newclosed

This *seems* like a good idea.. just need to think about what happens in the different error cases. An unserializable Incident causes a remote error to happen during callRemote("get_incident"), after we've popped the incident name off the wanted stack, but before we save any data or increment update_latest(). With the patch, we'll proceed to fetch the other incidents from the remote side, if any subsequent ones are fetched successfully we'll call update_latest(), after which we'll never again try to get the erroring ones. If we lose the connection (say, the server is rebooted) during the fetch, the whole IncidentObserver will be stale (all calls will fail, nothing will ever get updated) until we reconnent and create a new IncidentObserver. If the unserializable incident was the last one, or if the last N incidents were all unserializable, and if no valid incidents arrive before the gatherer is restarted, then we'll try to fetch them again next time (and they'll fail again, and be ignored again).

I think that's safe: we'll lose unserializable data, but never good data, and the worst case behavior is to have a server that emits a whole stream of bad incidents (say it's got buggy code), which takes longer and longer on each reconnect to attempt and fail. Some day, if it gets fixed, and produces a good incident, the next reconnect will (if it stays up long enough to get through all the bad ones) finally be able to ignore them properly.

Landed in [3dc5bc5]

Note: See TracTickets for help on using tickets.