使用Javascript的atob解码base64不能正确解码utf-8字符串

一尘不染

使用Javascript的atob解码base64不能正确解码utf-8字符串

javascript

我正在使用Javascript window.atob()函数解码base64编码的字符串（特别是来自GitHubAPI的base64编码的内容）。问题是我回来了ASCII编码的字符（â¢而不是™）。如何正确处理传入的以base64编码的流，以便将其解码为utf-8？

阅读 487

2020-04-25

共1个答案

一尘不染

此问题：

“ Unicode问题”由于DOMStrings是16位编码的字符串，因此在大多数浏览器中window.btoa，Character Out Of Range exception如果字符超出8位字节的范围（0x00〜0xFF），则调用Unicode字符串将导致。有两种方法可以解决此问题：

* 第一个是转义整个字符串使用UTF-8，请参见encodeURIComponent，然后对其进行编码；
* 第二个是将UTF-16 DOMString转换为UTF-8字符数组，然后对其进行编码。

关于以前的解决方案的说明：MDN文章最初建议使用unescape和escape解决Character Out Of Range异常问题，但是自那以后就不建议使用。这里的其他一些答案建议使用decodeURIComponent和解决此问题，encodeURIComponent事实证明这是不可靠且不可预测的。此答案的最新更新使用现代JavaScript函数来提高速度和代码现代化。

编码UTF8⇢base64

function b64EncodeUnicode(str) {
    // first we use encodeURIComponent to get percent-encoded UTF-8,
    // then we convert the percent encodings into raw bytes which
    // can be fed into btoa.
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
        function toSolidBytes(match, p1) {
            return String.fromCharCode('0x' + p1);
    }));
}

b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64EncodeUnicode('\n'); // "Cg=="

解码base64⇢UTF8

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"
b64DecodeUnicode('Cg=='); // "\n"

2018年之前的解决方案（功能齐全，虽然可能会更好地支持旧版浏览器，但不是最新的）

这是直接来自MDN的当前建议，并通过@ MA-Maddin具有一些其他TypeScript兼容性：

// Encoding UTF8 ⇢ base64

function b64EncodeUnicode(str) {
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
        return String.fromCharCode(parseInt(p1, 16))
    }))
}

b64EncodeUnicode('✓ à la mode') // "4pyTIMOgIGxhIG1vZGU="
b64EncodeUnicode('\n') // "Cg=="

// Decoding base64 ⇢ UTF8

function b64DecodeUnicode(str) {
    return decodeURIComponent(Array.prototype.map.call(atob(str), function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2)
    }).join(''))
}

b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU=') // "✓ à la mode"
b64DecodeUnicode('Cg==') // "\n"

原始解决方案（已弃用）

使用了escape和unescape（现在已弃用，尽管在所有现代浏览器中仍然可以使用）：

function utf8_to_b64( str ) {
    return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
    return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

最后一件事：我在调用GitHubAPI时首先遇到了这个问题。为了使此功能在（Mobile）Safari上正常工作，实际上我什至必须解码base64源中的所有空白，然后才能对其进行解码。在2017年这是否仍然有意义，我不知道：

function b64_to_utf8( str ) {
    str = str.replace(/\s/g, '');    
    return decodeURIComponent(escape(window.atob( str )));
}

2020-04-25