-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Fix encoding error with non-prettified encoded responses #1168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
BoboTiG
merged 19 commits into
master
from
mickael/oss-73-fix-encoding-error-with-non-prettified
Oct 6, 2021
Merged
Changes from 2 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
0b5f4d6
Fix encoding error with non-prettified encoded responses
BoboTiG 491188d
Encoding refactoring
jkbrzt 252fe02
`test_unicode.py` → `test_encoding.py`
jkbrzt d52a483
Drop sequence length check
BoboTiG a3bcd09
Clean-up tests
BoboTiG 8a031e3
[skip ci] Tweaks
BoboTiG 49ab59b
Use the compatible release clause for `charset_normalizer` requirement
BoboTiG fd5be7d
Clean-up
BoboTiG 93590b1
Partially revert d52a4833e461e1b16b7961a112ea5c53e93cd643
BoboTiG fb92865
Changelog
BoboTiG 7390f84
Tweak tests
BoboTiG f4f132c
[skip ci] Better test name
BoboTiG 115aef9
Cleanup tests and add request body charset detection
jkbrzt 6cbe2b2
More test suite cleanups
jkbrzt a655659
Cleanup
jkbrzt fb6a61f
Fix code style in test
BoboTiG 5d37226
Improve detect_encoding() docstring
BoboTiG ea7a36c
Uniformize pytest.mark.parametrize() calls
BoboTiG 743fc89
[skip ci] Comment out TODOs (will be tackled in a specific PR)
BoboTiG File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
from typing import Union | ||
|
||
from charset_normalizer import from_bytes | ||
from charset_normalizer.constant import TOO_SMALL_SEQUENCE | ||
|
||
UTF8 = 'utf-8' | ||
|
||
ContentBytes = Union[bytearray, bytes] | ||
|
||
|
||
def detect_encoding(content: ContentBytes) -> str: | ||
""" | ||
We default to utf8 if text too short, because the detection | ||
can return a random encoding leading to confusing results: | ||
|
||
>>> too_short = ']"foo"' | ||
>>> detected = from_bytes(too_short.encode()).best().encoding | ||
>>> detected | ||
'utf_16_be' | ||
>>> too_short.encode().decode(detected) | ||
'崢景漢' | ||
|
||
""" | ||
encoding = UTF8 | ||
if len(content) > TOO_SMALL_SEQUENCE: | ||
match = from_bytes(bytes(content)).best() | ||
if match: | ||
encoding = match.encoding | ||
return encoding | ||
|
||
|
||
def smart_decode(content: ContentBytes, encoding: str) -> str: | ||
"""Decode `content` using the given `encoding`. | ||
If no `encoding` is provided, the best effort is to guess it from `content`. | ||
|
||
Unicode errors are replaced. | ||
|
||
""" | ||
if not encoding: | ||
encoding = detect_encoding(content) | ||
return content.decode(encoding, 'replace') | ||
|
||
|
||
def smart_encode(content: str, encoding: str) -> bytes: | ||
"""Encode `content` using the given `encoding`. | ||
|
||
Unicode errors are replaced. | ||
|
||
""" | ||
return content.encode(encoding, 'replace') |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This comment was marked as spam.
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! What has changed?
This comment was marked as spam.
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s exactly what we needed. Is there still some length threshold below which it’s unreasonable to rely on the detected encoding? Or a way to get some sort of confidence interval for the best match?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe could we require
charset_normalizer>=2.0.5
and drop our ownTOO_SMALL_SEQUENCE
check?I am not sure that
charset_normalizer >= 2.0.5
is available on all OSes for our package thought.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I committed d52a483 just to see how it goes.
This comment was marked as spam.
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert d52a483 as it seems too new.