One of my cron tasks got itself stuck.
The underlying cause was actually my own fault, I accidentally removed the docker container I was using as an InfluxDB output and forgot to create a retention policy when I brought the new one up
2023-02-08 23:15:39.056887: Failed to write back to: websites/analytics, exception: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Request-Id': '80150e03-a806-11ed-802a-0242ac110002', 'X-Influxdb-Build': 'OSS', 'X-Influxdb-Error': 'retention policy not found: analytics', 'X-Influxdb-Version': '1.8.10', 'X-Request-Id': '80150e03-a806-11ed-802a-0242ac110002', 'Date': 'Wed, 08 Feb 2023 23:15:39 GMT', 'Content-Length': '50'})
HTTP response body: {"error":"retention policy not found: analytics"}
But, shit goes down sometimes, the downsampler needs to be able to handle it.
I'm guessing the issue has prevented close()
from completing on the output
Activity
09-Feb-23 00:00
assigned to @btasker
09-Feb-23 00:02
I've created the retention policy now, I want to see whether the container ever actually manages to complete (I assume not)
09-Feb-23 00:09
If I
strace
it, I just getI'm guessing that that's
close()
backing off on checking whether the queue has been flushed yet.We can be reasonably sure it's not a write retry backoff because the loglines we have are only printed when all tries have been exhausted.
09-Feb-23 00:13
Yeah, so looking at the code,
close()
is defined here, but simply calls__del__
(defined here)That function is
I'd be willing to bet we're stuck in that
while
loop09-Feb-23 08:48
No movement overnight
09-Feb-23 09:04
OK, quick notes
the attribute
_disposable
is populated when in batching mode (here) and is areactivex.subject
(details on that here, but basically, an observable).Going back to the InfluxDB client then, the call chain is
Which calls this function.
Then
Calling this function, this is where
is_disposed
should get flippedself.lock
is a threading RLock, so perhaps we're stuck waiting for that lock?09-Feb-23 09:10
I'm wondering if the implementation of #16 might somehow have contributed to this.
With the configuration that was in use for the broken run, there were two client objects created to point to the same InfluxDB endpoint: one for writes, and another for stats.
I wonder if that might have led to some non-thread-safe behaviour somewhere.
It'd make more sense than the client simply locking up when write failures happen, issue trackers would be full of complaints if that were the case.
If that's the case, then we should be able to repro this
If the theory is correct, this should lock up. If we then turn stats off and re-run, it should still fail to write, but exit cleanly. Then, as a final check, should create the output bucket, reenable stats just to prove it all works.
09-Feb-23 09:19
OK, repro setup.
config.yaml
InfluxDB setup
Running a downsample
We get the failure back nice and quickly
Worth noting, the status code is different this time. In this test, the database is missing, whereas the thing that started this was a missing RP. It shouldn't make a difference, but if this doesn't repro we'll need to correct that.
We've ended up in the same state. Repro successful
09-Feb-23 09:21
OK, test 2, same test, different config
config.yaml
Running the sample
The script exited correctly.
09-Feb-23 10:43
Third test - stats are being sent to a different URL (but the same underlying instance, it's just an aliased IP on the box)
Oooo, interesting... it locks up:
That's unexpected.
I'll have to do a run writing stats out to an entirely different instance to prove whether it's the existence of the stats object or something else
09-Feb-23 10:50
Sending the stats write to the source instance
Still hangs.
So, somehow it's the existence of the stats writer which is the issue.
09-Feb-23 10:56
I wonder if it's a timing issue?
If we step through the command output and look at where in the call chain we are
We print the logline before calling
close()
and before closing the stats handler:So we know that at the time of the lockup
close()
has only been called on outputoutx
We can add certainty to the second point though by disabling the failure stats write and seeing whether we can still repro
09-Feb-23 11:08
Updated to
Running (same config as above)
There we have it.
So to summarise:
error_callback
)It's obviously racey - the output won't close because we haven't finished handling it's error, but we can't finish handling the error because that involves writing a stat to another handler and that's not possible whilst this one's in close.
I assume it's not actually specific to the stats output, and that it's actually not possible to write into any output whilst another is being closed. There isn't an easy way to test that in this codebase though.
We should be able to prevent the issue by setting a flag within the downsample object to note that we're in a closing state and then checking the truthiness of that flag in
recordFailure
09-Feb-23 11:11
Implementing that change ```diff --git a/app/downsample.py b/app/downsample.py index 0cefbb1..6e15d80 100755 --- a/app/downsample.py +++ b/app/downsample.py @@ -52,6 +52,8 @@ class Downsample: } self.dry_run = dry_run self.quiet = False + # Added for utilities/python_influxdb_downsample#17 + self.in_closing = False
@@ -575,15 +577,17 @@ class Downsample: Now, it just logs an error ''' self.output_line(f"Failed to write back to: {conf[0]}, exception: {exception}", True) - tags = { - "bucket" : conf[0] - } - fields = { - "failures" : 1, - "message" : exception - }
def write_stat(self, fields, tags): @@ -621,6 +625,9 @@ class Downsample: ''' Iterate over the InfluxDB writers and ensure their buffers are flushed '''
ben@optimus:~/tmp/ds17$ docker run --name="ds17_test" --rm -h $HOSTNAME -v $PWD/config.yaml:/config.yaml bentasker12/downsample:0.2-rc1 2023-02-09 11:09:37.651988: Starting Job: Downsample Website Response Times 2023-02-09 11:09:37.652905: Running Query: from(bucket: "websites/autogen" )\n|> range(start: 2023-02-09T10:09:37.650843Z, stop: 2023-02-09T11:09:37.650843Z)\n|> filter(fn:(r) => r._measurement == "bunnycdn" or r._measurement == "website_response_times" )\n|> window(every: 15m) 2023-02-09 11:09:37.763448: Processing Results 2023-02-09 11:09:37.764622: Inserting 212 new rows 2023-02-09 11:09:37.770243: Job Complete: Downsample Website Response Times 2023-02-09 11:09:37.770272: Flushing queue for outx 2023-02-09 11:09:37.783855: Failed to write back to: websites/analytics, exception: (404) Reason: Not Found HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Request-Id': '3df50489-a86a-11ed-800a-0242ac110005', 'X-Influxdb-Build': 'OSS', 'X-Influxdb-Error': 'database not found: "websites"', 'X-Influxdb-Version': '1.8.10', 'X-Request-Id': '3df50489-a86a-11ed-800a-0242ac110005', 'Date': 'Thu, 09 Feb 2023 11:09:37 GMT', 'Content-Length': '45'}) HTTP response body: {"error":"database not found: \"websites\""}
2023-02-09 11:09:37.880352: Flushing stats queue ```
Awesome
09-Feb-23 11:12
mentioned in commit c9456395ebe07709b3c88445d1a65afd5779a8d7
Message
fix: Prevent failure stats being written when outputs are closing (utilities/python_influxdb_downsample#17)
This addresses a race condition which can lead to process lockup
09-Feb-23 18:37
I think I've now got to the bottom of the behaviour.
It's not actually caused by there being two client instances active, that was me misleading myself.
The crucial bit is that the
error_callback
needs to result in an exception being thrown.That exception will never be seen by the user, but will result in the writer thread dying in the background, as a result this while loop will continue indefinitely.
I created a simple repro of the behaviour: repro.py
I've raised https://github.com/influxdata/influxdb-client-python/issues/558 to improve handling in the client.
I do also need to work out what the exception was in this case, to ensure that it won't be thrown at other points.
09-Feb-23 20:19
Pull Request with fixes in the client lib is at https://github.com/influxdata/influxdb-client-python/pull/559
10-Feb-23 08:34
I'm no longer able to reproduce the original failure even with the fixes/workaround removed.
If I call
recordFailure()
directly it completes successfully and doesn't throw an exception. Whatever the original failure was, it was either transient or has since been fixed. Either way, the new changes should protect against repeats.