Skip to content

Encoding error with non-prettified encoded responses #1167

@BoboTiG

Description

@BoboTiG

Minimal reproduction code and steps

$ http -b --pretty=none https://raw.githubusercontent.com/Ousret/charset_normalizer/master/data/sample-chinese.txt

Current result

����j��]Wikipedia�^�̡A������¦�F���ѤU���B�|�����B���H�ӡA�Ѧʬ�j�C�l�@�̡A����C�����|�]�C

��F�G�����~�Q�G��ܤ@�A�Τv���~�G��Q�E�A���U��y���G�ʤ��Q�A�X�O�C�ʸU�ءF�G�Q�j�����K���A��^�����L�G�ʸU�C�x��D�ѤU���Ӧ@���Ӧ��F���N�U���A�X�����B�Hġ�@�A�j��_�j�C

����@���A�X�j�ոܺ���C�m���l�����n��J�u���Aô�]�v�A�m����n��J�u��A��l�]�v�A��H�X���A���Nô�����l�]�C��´�����A�H���ڥ��A�����]�C

�j��o���A�P�D�ݵo�F�p�s�D�B�Х��B�����B�ӾǡB�y���B��w���C����ئh�i�w�\�ש��s�������F�\�����׽֡A�ѱo�Pġ��A����_��ɨ�a��H�A�M�h��Ҹ��չ�A���K���G�C

�Z���򤧵��A�Ҿڪ̡A�ꭲ���ۥѤ��ɳ\�i��ij�A�G�i�ۥѼs�ǤѤU�աC

�娥����l������~�C�i�A�����o��G�d�@�ʤK�Q�E�C
   
commons:����
        �n�H�M�T�A��������@�ɡJ����j��C

Expected result

維基大典(Wikipedia)者,網路為礎;集天下知、四海言、眾人志,書百科焉。始作者,維基媒體基金會也。

典肇乎庚辰年十二月廿一,及己丑年二月十九,收各方語言二百五十,合逾七百萬目;二十大卷佔八成,單英文卷亦過二百萬。悉文乃天下有志共筆而成;有意助之,幾網路、隨纂作,大典茁焉。

維基一詞,出焉白話維基。《墨子閒詁》曰︰「維,繫也」,《說文》曰︰「基,牆始也」,其人合之,取意繫物之始也。維織網綱,以載根本,亦維基也。

大典得幸,同道兼發;如新聞、教本、爾雅、太學、語錄、文庫等。其卷目多可逕閱修於瀏覽之器;蓋眾不論誰,俱得與纂綴,弗制于其時其地其人,然則典所載謬實,未免爭辯。

凡維基之策,所據者,曰革奴自由文檔許可協議,故可自由廣傳天下耳。

文言維基始於丙戌年七夕,迄今得文二千一百八十九。
   
commons:卷首
        聲象映響,具錄於維基共享︰維基大典。

Debug output

$ http --debug  --pretty=none https://raw.githubusercontent.com/Ousret/charset_normalizer/master/data/sample-chinese.txt
HTTPie 2.6.0.dev0
Requests 2.26.0
Pygments 2.10.0
Python 3.9.7+ (heads/3.9:09390c837a, Sep 22 2021, 11:36:19) 
[GCC 10.3.0]
/home/tiger-222/projects/httpie/venv39/bin/python
Linux 5.10.0-8-amd64

<Environment {'colors': 256,
 'config': {'default_options': []},
 'config_dir': PosixPath('/home/tiger-222/.config/httpie'),
 'devnull': <property object at 0x7f55691a66d0>,
 'is_windows': False,
 'log_error': <function Environment.log_error at 0x7f55691ad3a0>,
 'program_name': '__main__.py',
 'stderr': <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>,
 'stderr_isatty': True,
 'stdin': <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>,
 'stdin_encoding': 'utf-8',
 'stdin_isatty': True,
 'stdout': <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>,
 'stdout_encoding': 'utf-8',
 'stdout_isatty': True}>

<PluginManager {'adapters': [],
 'auth': [<class 'httpie.plugins.builtin.BasicAuthPlugin'>,
          <class 'httpie.plugins.builtin.DigestAuthPlugin'>],
 'converters': [],
 'formatters': [<class 'httpie.output.formatters.headers.HeadersFormatter'>,
                <class 'httpie.output.formatters.json.JSONFormatter'>,
                <class 'httpie.output.formatters.xml.XMLFormatter'>,
                <class 'httpie.output.formatters.colors.ColorFormatter'>]}>

>>> requests.request(**{'auth': None,
 'data': RequestJSONDataDict(),
 'headers': {'User-Agent': b'HTTPie/2.6.0.dev0'},
 'method': 'get',
 'params': <generator object MultiValueOrderedDict.items at 0x7f5569058900>,
 'url': 'https://raw.githubusercontent.com/Ousret/charset_normalizer/master/data/sample-chinese.txt'})

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 626
Cache-Control: max-age=300
Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
Content-Type: text/plain; charset=utf-8
ETag: W/"afe2305da1ba095a7b6cfc792fb066356cc5f3f571f6dfcdea533a41a9b5daa1"
Strict-Transport-Security: max-age=31536000
X-Content-Type-Options: nosniff
X-Frame-Options: deny
X-XSS-Protection: 1; mode=block
X-GitHub-Request-Id: 38A4:EAF4:D2C6A:ED8C9:6154225D
Content-Encoding: gzip
Accept-Ranges: bytes
Date: Wed, 29 Sep 2021 08:42:31 GMT
Via: 1.1 varnish
X-Served-By: cache-cdg20732-CDG
X-Cache: HIT
X-Cache-Hits: 1
X-Timer: S1632904951.350979,VS0,VE0
Vary: Authorization,Accept-Encoding,Origin
Access-Control-Allow-Origin: *
X-Fastly-Request-ID: cb5e11b4a5af2375903564d55226b463701ef104
Expires: Wed, 29 Sep 2021 08:47:31 GMT
Source-Age: 74


����j��]Wikipedia�^�̡A������¦�F���ѤU���B�|�����B���H�ӡA�Ѧʬ�j�C�l�@�̡A����C�����|�]�C

��F�G�����~�Q�G��ܤ@�A�Τv���~�G��Q�E�A���U��y���G�ʤ��Q�A�X�O�C�ʸU�ءF�G�Q�j�����K���A��^�����L�G�ʸU�C�x��D�ѤU���Ӧ@���Ӧ��F���N�U���A�X�����B�Hġ�@�A�j��_�j�C

����@���A�X�j�ոܺ���C�m���l�����n��J�u���Aô�]�v�A�m����n��J�u��A��l�]�v�A��H�X���A���Nô�����l�]�C��´�����A�H���ڥ��A�����]�C

�j��o���A�P�D�ݵo�F�p�s�D�B�Х��B�����B�ӾǡB�y���B��w���C����ئh�i�w�\�ש��s�������F�\�����׽֡A�ѱo�Pġ��A����_��ɨ�a��H�A�M�h��Ҹ��չ�A���K���G�C

�Z���򤧵��A�Ҿڪ̡A�ꭲ���ۥѤ��ɳ\�i��ij�A�G�i�ۥѼs�ǤѤU�աC

�娥����l������~�C�i�A�����o��G�d�@�ʤK�Q�E�C
   
commons:����
        �n�H�M�T�A��������@�ɡJ����j��C

Additional information, screenshots, or code examples

The issue occurs because the Content-Type header specifies the UTF-8 encoding, but it is BIG5 actually.

Before #1110, the error was more obvious:

# headers (...)

__main__.py: error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Traceback (most recent call last):
  File "/.../Lib/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/.../Lib/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "__main__.py", line 19, in <module>
    sys.exit(main())
  File "__main__.py", line 9, in main
    exit_status = main()
  File "core.py", line 70, in main
    exit_status = program(
  File "core.py", line 190, in program
    write_message(requests_message=message, env=env, args=args, with_headers=with_headers,
  File "output/writer.py", line 44, in write_message
    write_stream(**write_stream_kwargs)
  File "output/writer.py", line 66, in write_stream
    for chunk in stream:
  File "output/writer.py", line 108, in build_output_stream_for_message
    yield from stream_class(
  File "output/streams.py", line 69, in __iter__
    for chunk in self.iter_body():
  File "output/streams.py", line 118, in iter_body
    yield line.decode(self.msg.encoding) \
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Solution

A potential fix would be to expand the impact area of the --response-as option: it could handle all encoded responses and not only encoded prettified ones.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions