Opened 13 years ago
Closed 13 years ago
#190 closed defect (fixed)
log gatherer fails to process more incidents when processing a given incident fails (e.g. due to serialization problems)
Reported by: | davidsarah | Owned by: | Brian Warner |
---|---|---|---|
Priority: | major | Milestone: | 0.6.4 |
Component: | logging | Version: | 0.6.1 |
Keywords: | incident | Cc: | davidsarah |
Description
LAE has its Tahoe storage servers configured with a log gatherer. Some of the incidents being sent by servers included unserializable objects (due to a bug in the Tahoe S3 backend which has been fixed). This caused the 'latest' file maintained by the log gatherer for each affected storage server to be stuck at or just before (the latter, I think) the unserializable incident, so subsequent incidents on that server were not requested, even though they had no serialization problem. IRC discussion:
(01:22:16) davidsarah: (01:20:23) zooko: There was an instance of TahoeS3Error that foolscap was asked to serialize. (01:22:16) davidsarah: (01:20:35) zooko: And it raised a Violation exception saying "cannot serialize". (01:22:44) davidsarah: it's the causing subsequent incidents not to be sent that is confusing (01:22:55) zooko: Yes, being able to withhold write-access while granting read-access is frequently nice. (01:23:05) zooko: davidsarah: I posted a link to the foolscap github... (01:23:17) zooko: https://github.com/warner/foolscap/blob/3fd4331b67abf307aa38e898e7d1e7fd37fc0b3d/foolscap/logging/gatherer.py#L343 (01:23:26) ***davidsarah looks (01:23:27) zooko: So, that violation exception is happening on the incident reporter side -- the tahoe-lafs storage server. (01:23:47) zooko: But, over on the incident *gatherer* side, it attempted to fetch the incident, and got instead a message from foolscap saying something like "Error -- couldn't send you the thing you wanted" (01:23:54) zooko: and the errback for that doesn't proceed to try the next one. (01:23:58) zooko: I think. Am I right? (01:24:08) ***davidsarah looks at the code (01:25:18) davidsarah: I see, so _got_incident doesn't get called and doesn't recurse to maybe_fetch_incident
Attachments (1)
Change History (3)
Changed 13 years ago by
Attachment: | foolscap-190-fix.diff added |
---|
comment:1 Changed 13 years ago by
Cc: | davidsarah added |
---|---|
Owner: | set to Brian Warner |
comment:2 Changed 13 years ago by
Milestone: | undecided → 0.6.4 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
This *seems* like a good idea.. just need to think about what happens in the different error cases. An unserializable Incident causes a remote error to happen during callRemote("get_incident")
, after we've popped the incident name off the wanted stack, but before we save any data or increment update_latest()
. With the patch, we'll proceed to fetch the other incidents from the remote side, if any subsequent ones are fetched successfully we'll call update_latest()
, after which we'll never again try to get the erroring ones. If we lose the connection (say, the server is rebooted) during the fetch, the whole IncidentObserver
will be stale (all calls will fail, nothing will ever get updated) until we reconnent and create a new IncidentObserver
. If the unserializable incident was the last one, or if the last N incidents were all unserializable, and if no valid incidents arrive before the gatherer is restarted, then we'll try to fetch them again next time (and they'll fail again, and be ignored again).
I think that's safe: we'll lose unserializable data, but never good data, and the worst case behavior is to have a server that emits a whole stream of bad incidents (say it's got buggy code), which takes longer and longer on each reconnect to attempt and fail. Some day, if it gets fixed, and produces a good incident, the next reconnect will (if it stays up long enough to get through all the bad ones) finally be able to ignore them properly.
Landed in [3dc5bc5]
Fix for logging/gatherer.py