聘我网

新概念招聘3.0

Unicode的一个弊端

vote up0vote downstar

一直从事web开发,

UTF-8一直都是首选,

最近才发现它的一个弊端,

Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (as far as a user can tell) can be represented by several distinct character sequences. For instance, an e with acute accent can be represented by the precomposed U+00E9 E ACUTE character or by the canonically equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though UTF-8 provides a single byte sequence for each character sequence, the existence of multiple character sequences for "the same thing" may have security consequences whenever string matching, indexing, -- source

简单来说就是Unicode由于Precomposed character的存在,

导致码点(codepoint)和字符串的关系不是简单的双射,

某些字符串(主要是含有音标的字母)可能有多种可能的码点组合。

JS例子:

/*ḱṷṓn*/
alert('\u1E31\u1E77\u1E53\u006E');
alert('\u006B\u0301\u0075\u032D\u006F\u0304\u0301\u006E')

/*Åström*/
alert('\u00C5\u0073\u0074\u0072\u00F6\u006D');
alert('\u0041\u030A\u0073\u0074\u0072\u006F\u0308\u006D')

一种解决方案是Normalization

但实际项目中从未执行过。

 

您的回答





不是您要找的问题? 浏览其他含有标签 的问题或者 自己问个.