project jira-projects / Miscellaneous avatar

jira-projects/MISC#33: Trailing space in content-length header renders POST body unavailable in Python's Flask



Issue Information

Issue Type: issue
Status: closed
Reported By: btasker
Assigned To: btasker

Created: 22-Jun-23 08:12



Description

This is quite an interesting one, raising this ticket so I can keep track of what I need to do to get this fixed.

Background

An issue was originally filed seeking advice in my Ecowitt_to_InfluxDB project

I'm also using Python/Flask to receive the POST data from the Ecowitt but I'm having some really weird issues with the request object and the way Flask parses out the request.form data.

Like you, I initially tried to access the POST data with: (request.form)

which worked absolutely fine for my "faked" data submitted using curl but was empty for real data from the Ecowitt. A packet dump shows all the form data is present and correct, just that Flask doesn't seem to access it.

A related issue was found in the Flask/Werkzeug repo

What happens is that in the route handler, request.form is empty depending on how the form request looks in detail. Now sending a POST request using a simple html form works fine as long as it looks like this: ith that request, request.form is filled and the data can be parsed normally. However, the Ecowitt station sends requests that have a different set of header fields, which looks like this:

POST /weather/report HTTP/1.1
HOST: 192.168.178.208
Connection: Close
Content-Type: application/x-www-form-urlencoded
Content-Length:502

PASSKEY=78D8DCE861B2164BAD38A473167ACFD2&stationtype=GW2000A_V2.2.4&runtime=1247&dateutc=2023-06-17+10:3

Sending that request results in an empty request.form. It's also no use to try and use request.get_data() or stuff like that. The method in both requests is POST.

There was initially some suggestion that the missing space in the Content-Length header might be the issue.

However, RFC 7230 states that whitespace around header field values is optional (and "ought to be excluded" from values when parsing).

It wasn't, however possible to repro the behaviour by omitting this space:

Example app (repro.py)

from flask import Flask, request

app = Flask(__name__)


@app.route('/', methods=["POST"])
def hello():
    return request.get_data(as_text=True)

app.run(host="0.0.0.0", port=8090, debug=True)    

repro attempt

printf "POST / HTTP/1.1\r\nHost:127.0.0.1:8090\r\nContent-Length:15\r\nContent-Type: application/x-www-form-urlencoded\r\n\r\nfoo=bar&sed=zoo" | nc 127.0.0.1 8090
HTTP/1.1 200 OK
Server: Werkzeug/2.3.6 Python/3.10.6
Date: Wed, 21 Jun 2023 13:35:50 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 15
Connection: close

foo=bar&sed=zoo

Failed - the expectation was that we wouldn't get the POST body returned to us (because the application's not supposed to have it).

To dig in further I took a packet capture of my weather station's writes and found the headers were formatted the same way as described in the ticket

Screenshot_20230622_085255

This suggested it should be capable of exhibiting the behaviour described in GH.

Further analysis of the packet capture helped to identify the cause of the issue


Cause

It's present but non-obvious in the screenshot above (and un-noticeable in something like a Github description), but looking at the packet bytes:

Screenshot_20230622_085405

The Content-Length header value has a trailing space (20 in hex) before the CRLF.

Again, this is legal under RFC 7230, but allowed a repro to be built (same command as before, but with a space after the content length value)

ben@optimus:~/tmp$ printf "POST / HTTP/1.1\r\nHost:127.0.0.1:8090\r\nContent-Length:15 \r\nContent-Type: application/x-www-form-urlencoded\r\n\r\nfoo=bar&sed=zoo" | nc 127.0.0.1 8090
HTTP/1.1 200 OK
Server: Werkzeug/2.3.6 Python/3.10.6
Date: Wed, 21 Jun 2023 16:11:30 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Connection: close

Repro!


Root Cause

As noted here Flask's debug server is a thin wrapper around Python's http.server.

When http.server receives a request it calls http.client.parse_headers() to parse the request headers.

http.client doesn't perform the header parsing itself, but instead imports Python's email and passes the headers (currently one long string) into email.parser.Parser.parsestr()

Eventually the call chain works down to _parsegen() which calls email.EmailPolicy.header_source_parse().

The function is quite simple:

    def header_source_parse(self, sourcelines):
        """+
        The name is parsed as everything up to the ':' and returned unmodified.
        The value is determined by stripping leading whitespace off the
        remainder of the first line, joining all subsequent lines together, and
        stripping any trailing carriage return or linefeed characters.

        """
        name, value = sourcelines[0].split(':', 1)
        value = value.lstrip(' \t') + ''.join(sourcelines[1:])
        return (name, value.rstrip('\r\n'))

The important bit is that middle line.

It's stripping any leading whitespace, but not any trailing whitespace.

We can show the effect of this

>>> header_source_parse(["content-length:15 "])
('content-length', '15 ')

The trailing space is left in place, and bubbles back up the callchain back into Flask.

Flask's form data parser looks like this

    def start_file_streaming(
        self, event: File, total_content_length: int | None
    ) -> t.IO[bytes]:
        content_type = event.headers.get("content-type")

        try:
            content_length = _plain_int(event.headers["content-length"])
        except (KeyError, ValueError):
            content_length = 0

        container = self.stream_factory(
            total_content_length=total_content_length,
            filename=event.filename,
            content_type=content_type,
            content_length=content_length,
        )
        return container

It calls _plain_int() to convert the content-length header to an integer. This is where the breakage happens.

_plain_int() tests the value against re.compile(r"-?\d+", re.ASCII), which doesn't permit spaces (and why should it?)

When the value doesn't match the regex (because of the trailing space), the content-length value is set to 0 by start_file_streaming(). As a result, the form data then won't be read in, so is not available via request.form and request.get_data().


RFC Conflict

There is a reason why header_source_parse() does not strip trailing whitespace

Email header formatting is governed by RFC 5322 which (arguably) considers trailing spaces to be part of the value. It doesn't make any special dispensation for stripping trailing spaces.

This is incompatible with how the HTTP spec believes header values should be parsed.


Fix

There are two fixes that are needed

  1. Have Werkzeug strip any trailing whitepace in _plain_int() (there's also a _plain_float() that should do it too): Werkzeug Pull 2738
  2. Get a PR into Python to strip the whitespace: Python issue 105973


Toggle State Changes

Activity


assigned to @btasker

changed the description

OK, so having already put the Werkzeug PR in to have Flask mitigate, the next thing to do is to put together a PR for Python itself.

Per the code-owner's instruction, we need to

  • Include regression tests
  • strip the values

I think in this case, it actually probably makes sense to write the test-case first and then implement the fix.

Although it could be tested via the HTTP server tests, it probably makes most sense to add to the HTTP Lib tests

There's a doc at writing tests for python

Installing build deps (I already have build-essentials etc)

sudo apt install libssl-dev

Running the existing tests

./configure
make
make test TESTOPTS="-v test_httplib"

The result is

Ran 127 tests in 1.407s

OK

== Tests result: SUCCESS ==

1 test OK.

Total duration: 1.7 sec
Tests result: SUCCESS

I'm out of lunch break, but have implemented the test (test_header_whitespace_removal).

  • Commit 598f791 introduces the new test and tests for the behaviour described here
  • Commit 8e65cf2 updates that test to test for other related failures (failing to strip leading, or stripping whitespace from the middle) to try and catch possible future failure conditions

Now I can see why the maintainer of email said that

So it might take some effort to clean it, which makes it a non-trivial change and the code owner's opinion is needed.

But, I believe I've got an initial implementation.

Some notes:

  • We cannot simply iterate with for header in headers. The HTTP spec allows multiple headers with the same name. __getitem__ cannot work with this
  • One option would have been to iterate in this manner and call message.get_all() to retrieve all values for the header
  • I felt it was cleaner to call email.raw_items() to retrieve a list of tuples of name + value
  • message.__setitem__ sets header and value, but does not overwrite existing headers with the same name
  • message.__delitem__ deletes all headers with that name
  • It was therefore necessary to cater to multi-headers by calling __delitem__ once before using __setitem__ to add the new header(s)

This currently takes the following form

    # Track to prevent duplicate deletions
    already_deleted = []

    # Iterate through and strip any trailing whitespace from header values.
    #
    # Note: We cannot simply use the _getitem_ to iterate over headers
    # because HTTP allows duplicate header names to be sent
    # (this is common, for example with `set-cookie`).
    for header in headers.raw_items():
        name = header[0]
        value = header[1].rstrip()

        # message.__delitem__ matches all headers with
        # the provided name so we only want to call it
        # the once
        if name not in already_deleted:
            del(headers[name])
            already_deleted.append(name)

        # Set the corrected header value
        # message.__setitem__ doesn't replace existing headers
        # so a second copy of this header will not overwrite the first
        headers[name] = value

Introduced in commit fb8fc15.

I've also added some additional tests:

  • Add test to ensure that multiple set-cookie headers are preserved (Commit 12ec939)
  • Add test to ensure that a lack of headers does not lead to an exception (Commit 13bf796)

The library specific tests pass, running full tests now to double check I've not introduced any wider issues

make test

PR 105994 is up and awaiting review.

One of the maintainers has asked that we go a different route and adjust email.policy.HTTP so that it's able to form the headers correctly.

It'll be more efficient as it removes the need for post-processing. email feels like the wrong place to be doing HTTP specific stuff, but, not my codebase or call :D

OK, so need to rebase the branch to remove the following commits

  • 3b09931f0be5f43a754876e4e6a60c0220a1f341
  • 3cce99d1342f808aaad642b2239606658473e97b
  • 13bf7964f7d5ac4f233f99febfc6fd5ba2f50f2e
  • fb8fc155343c39c230b12f63605f7d7d309ec083

Have synced the branch with the source to make sure we're up to date and rebasing to remove those

$ git rebase -i HEAD~9

Have force pushed to my branch

Oh, that was a mistake, that's pulled other people's commits into the PR need to fix that.

Rebasing again and only picking my own commits... muppet

I've made changes so the Policy classes are able to do this.

If I hardcode the value to False then my tests pass.

However, despite updating the HTTP clone to set this

HTTP = default.clone(linesep='\r\n', max_line_length=None, allow_trailing_whitespace=False)

the test fails.

Best guess, at the moment, is that http.client is not actually using the HTTP policy.

I'm guessing it's do to with the invocation here

    return email.parser.Parser(_class=_class).parsestr(hstring)

The default value of _class in http.client is HTTPMessage. Need to look at how the policy is set/specified though.

Yeah, it wasn't using HTTP.

Updated to

return email.parser.Parser(_class=_class, policy=policy.HTTP).parsestr(hstring)

and shazam!

Unfortunately that seems to have broken some tests

  • test_docxmlrpc
  • test_email
  • test_logging
  • test_urllibnet

Errors

FAIL: test_all_attributes_covered (test.test_email.test_policy.PolicyAPITests.test_all_attributes_covered) (policy=Compat32(), attr='allow_trailing_whitespace')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\Lib\test\test_email\test_policy.py", line 83, in test_all_attributes_covered
    self.assertIn(attr, expected,
AssertionError: 'allow_trailing_whitespace' not found in {'max_line_length': 78, 'linesep': '\n', 'cte_type': '8bit', 'raise_on_defect': False, 'mangle_from_': True, 'message_factory': None} : allow_trailing_whitespace is not fully tested

FAIL: test_all_attributes_covered (test.test_email.test_policy.PolicyAPITests.test_all_attributes_covered) (policy=EmailPolicy(), attr='allow_trailing_whitespace')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\Lib\test\test_email\test_policy.py", line 83, in test_all_attributes_covered
    self.assertIn(attr, expected,
AssertionError: 'allow_trailing_whitespace' not found in {'max_line_length': 78, 'linesep': '\n', 'cte_type': '8bit', 'raise_on_defect': False, 'mangle_from_': False, 'message_factory': <class 'email.message.EmailMessage'>, 'utf8': False, 'header_factory': <email.headerregistry.HeaderRegistry object at 0x04C37F88>, 'refold_source': 'long', 'content_manager': <email.contentmanager.ContentManager object at 0x04C3BAD8>} : allow_trailing_whitespace is not fully tested

FAIL: test_all_attributes_covered (test.test_email.test_policy.PolicyAPITests.test_all_attributes_covered) (policy=EmailPolicy(linesep='\r\n'), attr='allow_trailing_whitespace')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\Lib\test\test_email\test_policy.py", line 83, in test_all_attributes_covered
    self.assertIn(attr, expected,
AssertionError: 'allow_trailing_whitespace' not found in {'max_line_length': 78, 'linesep': '\r\n', 'cte_type': '8bit', 'raise_on_defect': False, 'mangle_from_': False, 'message_factory': <class 'email.message.EmailMessage'>, 'utf8': False, 'header_factory': <email.headerregistry.HeaderRegistry object at 0x04C37F88>, 'refold_source': 'long', 'content_manager': <email.contentmanager.ContentManager object at 0x04C3BAD8>} : allow_trailing_whitespace is not fully tested

FAIL: test_all_attributes_covered (test.test_email.test_policy.PolicyAPITests.test_all_attributes_covered) (policy=EmailPolicy(linesep='\r\n', utf8=True), attr='allow_trailing_whitespace')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\Lib\test\test_email\test_policy.py", line 83, in test_all_attributes_covered
    self.assertIn(attr, expected,
AssertionError: 'allow_trailing_whitespace' not found in {'max_line_length': 78, 'linesep': '\r\n', 'cte_type': '8bit', 'raise_on_defect': False, 'mangle_from_': False, 'message_factory': <class 'email.message.EmailMessage'>, 'utf8': True, 'header_factory': <email.headerregistry.HeaderRegistry object at 0x04C37F88>, 'refold_source': 'long', 'content_manager': <email.contentmanager.ContentManager object at 0x04C3BAD8>} : allow_trailing_whitespace is not fully tested

So we need to add a test for the new attribute - I should have done that anyway.

FAIL: test_get_css (test.test_docxmlrpc.DocXMLRPCHTTPGETServer.test_get_css)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\Lib\test\test_docxmlrpc.py", line 106, in test_get_css
    self.assertEqual(response.getheader("Content-type"), "text/css; charset=UTF-8")
AssertionError: 'text/css; charset="UTF-8"' != 'text/css; charset=UTF-8'
- text/css; charset="UTF-8"
?                   -     -
+ text/css; charset=UTF-8

Looks like we've gained quoting, will have to look at why.

There's a function called unquote so might be worth looking at where/why that gets called. Presumably it now isn't when it should be (test is self.assertEqual(response.getheader("Content-type"), "text/html; charset=UTF-8") so the test wants the unquoted version)

Either quoted or unquoted is legal under RFC 7231 Sec 3.1.1.1 so we definitely want to restore the original behaviour.

FAIL: test_data_header (test.test_urllibnet.urlretrieveNetworkTests.test_data_header)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\cpython\cpython\Lib\test\test_urllibnet.py", line 199, in test_data_header
    time.strptime(datevalue, dateformat)
  File "D:\a\cpython\cpython\Lib\_strptime.py", line 549, in _strptime_time
    tt = _strptime(data_string, format)[0]
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\a\cpython\cpython\Lib\_strptime.py", line 333, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data 'Fri, 23 Jun 2023 08:50:09 +0000' does not match format '%a, %d %b %Y %H:%M:%S GMT'

The date handler seems to have changed too.

It looks like formatdate() handles switching to using GMT so need to find where that's called and adjust the HTTP policy accordingly.

Taken a quick look at test_email.

It's relatively easy to fix - need to update defaults in the tests (which also means we need to update docstrings in the codebase).

The only catch is, the variable name allow_trailing_whitespace is quite long, so it fucks with the existing spacing.

Rather than respacing everything, I'm going to rename to rstrip_whitespace (and obviously swap out values)

There's a function called unquote so might be worth looking at where/why that gets called. Presumably it now isn't when it should be (test is self.assertEqual(response.getheader("Content-type"), "text/html; charset=UTF-8") so the test wants the unquoted version)

Actually, looking at the input, there are no quotes. Something's adding the quotes, so it's not unquote() we should be looking for but something that does quoting.

I dropped a print() into feedparser._parsegen()

        # Done with the headers, so parse them and figure out what we're
        # supposed to see in the body of the message.
        self._parse_headers(headers)
        print(self._cur._headers)

At that point, the quotes have not been added

[('Server', 'BaseHTTP/0.6 Python/3.13.0a0'), ('Date', 'Fri, 23 Jun 2023 11:55:39 GMT'), ('Content-Type', 'text/css; charset=UTF-8'), ('Content-length', '1325')]
FAIL
test_valid_get_response (test.test_docxmlrpc.DocXMLRPCHTTPGETServer.test_valid_get_response) ... [('Host', 'localhost:39143'), ('Accept-Encoding', 'identity')]
[('Server', 'BaseHTTP/0.6 Python/3.13.0a0'), ('Date', 'Fri, 23 Jun 2023 11:55:39 GMT'), ('Content-Type', 'text/html; charset=UTF-8'), ('Content-length', '3197')]

The function called by http.client is parsestr. At the point that that returns

    def parsestr(self, text, headersonly=False):
        """Create a message structure from a string.

        Returns the root of the message structure.  Optional headersonly is a
        flag specifying whether to stop parsing after reading the headers or
        not.  The default is False, meaning it parses the entire contents of
        the file.
        """
        s = self.parse(StringIO(text), headersonly=headersonly)
        print(s)
        return s

The value still isn't quoted

Host: localhost:32937
Accept-Encoding: identity


Server: BaseHTTP/0.6 Python/3.13.0a0
Date: Fri, 23 Jun 2023 12:32:57 GMT
Content-Type: text/html; charset=UTF-8
Content-length: 3197

OK, so jumping back into the test itself

        print(response.headers)
        print(response.getheader("Content-type"))

gives

Server: BaseHTTP/0.6 Python/3.13.0a0
Date: Fri, 23 Jun 2023 12:39:21 GMT
Content-Type: text/css; charset=UTF-8
Content-length: 1325


text/css; charset="UTF-8"

The quotes are being added as the result of calling http.client.getheader().

That, in turn calls Message.get_all(), which in turns calls self.policy.header_fetch_parse(), which then passes it into the relevant header factory (this, incidentally, is where the policies differ - without the change in policy the header would be passed through _sanitize but that's it).

Ultimately it ends up in parse_content_type_header() (should totally have gone looking for content_type...

The parameter is then passed into parse_mime_parameters() which ultimately inserts them into an instance of MimeParameters()

The quoting occurs in MimeParameters.__str__()

    def __str__(self):
        params = []
        for name, value in self.params:
            if value:
                params.append('{}={}'.format(name, quote_string(value)))
            else:
                params.append(name)

That's got no access to the policy.... how the fuck are we going to fix that?

In order to preserve existing behaviour whilst applying the trailing space fix, I adjusted email.policy.HTTP to be a clone of Compat32 (the class it was using before the http.client change) rather than of EmailPolicy. (Commit e42e290b57)

That change fixes the issue (but currently breaks the email tests).

I've posted a couple of options in the PR to see which the maintainer wants

  • Adjust Compat32 to fix the test (essentially adding some attributes that won't ever be used, and fixing docstrings)
  • Spin out a new HTTP specific class (allowing easier expansion/extension if needed later).

Turns out I was totally overthinking it.

There's a section in test_policy where it specifies where the defaults should be read from

diff --git a/Lib/test/test_email/test_policy.py b/Lib/test/test_email/test_policy.py
index 8bd352a7dd..00fe011b86 100644
--- a/Lib/test/test_email/test_policy.py
+++ b/Lib/test/test_email/test_policy.py
@@ -54,7 +54,7 @@ class PolicyAPITests(unittest.TestCase):
         email.policy.SMTPUTF8: make_defaults(policy_defaults,
                                              {'linesep': '\r\n',
                                               'utf8': True}),
-        email.policy.HTTP: make_defaults(policy_defaults,
+        email.policy.HTTP: make_defaults(compat32_defaults,
                                          {'linesep': '\r\n',
                                           'max_line_length': None,
                                           'rstrip_whitespace': True

That's fixed the email tests. Have pushed into the PR to check the other tests pass.

changed the description

changed the description

Woot, all tests passed.

Awaiting review by the maintainer. I'm guessing they'll probably want me to update the base branch to verify it's still mergeable, but I'll hold off on doing that until asked.

changed the description

Looks like my PR targetted the wrong branch (I went for main, but turns out there are release specific branches). The maintainer has corrected that in PR 2764) which is now merged.

So, my fix will be in 2.3.7 whenever it rolls out