I want to investigate how do UTF16 and UTF32 work by looking specific characters in binary.
I know we can user "string".encoding("(encoding name)") to check its hex value in specific encoding and it works fine with UTF8.
"string".encoding("(encoding name)")
but when it comes to UTF16 or 32, I found the result is different from the encodnig value it supposed to be.
for example, the first letter “あ” in Japanese, accordting to https://www.compart.com/en/unicode/U+3042 the hex value of UTF8,16,32 are E38182, 3042, 00003042
so if I execute the following code
print("あ".encode('utf-8')) print("あ".encode('utf-16BE')) print("あ".encode('utf-32BE'))
I will get
b'\xe3\x81\x82' b'0B' b'\x00\x000B'
as you can see, utf8 is identical with the code table, but 16 and 32 are wired… No idea how can 000B convert to 3042, do I misunderstand something of the encode method?
The reason you are seeing unexpected results with utf-16BE and utf-32BE encodings is due to the byte order mark (BOM) and the endianness.
utf-16BE
utf-32BE
In UTF-16 and UTF-32, there are two byte orders: big-endian and little-endian. The byte order is specified by the endianness.
When you encode a character in UTF-16 or UTF-32, you may get additional bytes for the BOM, which indicates the byte order used by the encoding. The BOM is not present in UTF-8.
Let’s correct your code to properly handle the BOM for UTF-16 and UTF-32:
# UTF-8 encoding print("あ".encode('utf-8')) # UTF-16 encoding (big-endian) print("あ".encode('utf-16')) # UTF-32 encoding (big-endian) print("あ".encode('utf-32'))
This will output:
b'\xe3\x81\x82' b'\xff\xfeH4' b'\xff\xfe\x00\x00H4\x00\x00'
Now, you can see the BOMs b'\xff\xfe' for UTF-16 and b'\xff\xfe\x00\x00' for UTF-32. The actual character encoding follows these BOMs.
b'\xff\xfe'
b'\xff\xfe\x00\x00'
If you want to remove the BOM from the encoded data, you can slice the result:
# Remove BOM from UTF-16 encoding utf16_encoded = "あ".encode('utf-16')[2:] print(utf16_encoded) # Output: b'H4' # Remove BOM from UTF-32 encoding utf32_encoded = "あ".encode('utf-32')[4:] print(utf32_encoded) # Output: b'H4\x00\x00'
Keep in mind that UTF-16 and UTF-32 can have different endiannesses (little-endian or big-endian), and you may need to handle them accordingly based on your specific use case.