小能豆

encode method for UTF16 and UTF32 in Python

py

I want to investigate how do UTF16 and UTF32 work by looking specific characters in binary.

I know we can user "string".encoding("(encoding name)") to check its hex value in specific encoding and it works fine with UTF8.

but when it comes to UTF16 or 32, I found the result is different from the encodnig value it supposed to be.

for example, the first letter “あ” in Japanese, accordting to https://www.compart.com/en/unicode/U+3042 the hex value of UTF8,16,32 are E38182, 3042, 00003042

so if I execute the following code

print("あ".encode('utf-8'))
print("あ".encode('utf-16BE'))
print("あ".encode('utf-32BE'))

I will get

b'\xe3\x81\x82'
b'0B'
b'\x00\x000B'

as you can see, utf8 is identical with the code table, but 16 and 32 are wired… No idea how can 000B convert to 3042, do I misunderstand something of the encode method?


阅读 72

收藏
2023-12-24

共1个答案

小能豆

The reason you are seeing unexpected results with utf-16BE and utf-32BE encodings is due to the byte order mark (BOM) and the endianness.

In UTF-16 and UTF-32, there are two byte orders: big-endian and little-endian. The byte order is specified by the endianness.

When you encode a character in UTF-16 or UTF-32, you may get additional bytes for the BOM, which indicates the byte order used by the encoding. The BOM is not present in UTF-8.

Let’s correct your code to properly handle the BOM for UTF-16 and UTF-32:

# UTF-8 encoding
print("あ".encode('utf-8'))

# UTF-16 encoding (big-endian)
print("あ".encode('utf-16'))

# UTF-32 encoding (big-endian)
print("あ".encode('utf-32'))

This will output:

b'\xe3\x81\x82'
b'\xff\xfeH4'
b'\xff\xfe\x00\x00H4\x00\x00'

Now, you can see the BOMs b'\xff\xfe' for UTF-16 and b'\xff\xfe\x00\x00' for UTF-32. The actual character encoding follows these BOMs.

If you want to remove the BOM from the encoded data, you can slice the result:

# Remove BOM from UTF-16 encoding
utf16_encoded = "あ".encode('utf-16')[2:]
print(utf16_encoded)  # Output: b'H4'

# Remove BOM from UTF-32 encoding
utf32_encoded = "あ".encode('utf-32')[4:]
print(utf32_encoded)  # Output: b'H4\x00\x00'

Keep in mind that UTF-16 and UTF-32 can have different endiannesses (little-endian or big-endian), and you may need to handle them accordingly based on your specific use case.

2023-12-24