Unicode في JavaScript

تعرف على كيفية العمل مع Unicode في JavaScript ، وتعرّف على مكونات Emojis ، وتحسينات ES6 وبعض عيوب التعامل مع Unicode في JS

ترميز يونيكود لملفات المصدر

إذا لم يتم تحديد خلاف ذلك ، يفترض المستعرض أن الكود المصدري لأي برنامج يجب كتابته في مجموعة الأحرف المحلية ، والتي تختلف حسب البلد وقد تؤدي إلى مشكلات غير متوقعة. لهذا السبب ، من المهم تعيين مجموعة أحرف أي مستند JavaScript.

كيف تحدد ترميزًا آخر ، على وجه الخصوص UTF-8 ، وهو ترميز الملفات الأكثر شيوعًا على الويب؟

إذا كان الملف يحتوي على ملفBOMالشخصية ، التي لها الأولوية في تحديد الترميز. يمكنك قراءة العديد من الآراء المختلفة عبر الإنترنت ، يقول البعض إن BOM في UTF-8 غير محبذ ، وبعض المحررين لن يضيفوه.

هذا هو مايونيكودالمعيار يقول:

... استخدام BOM ليس مطلوبًا ولا موصى به لـ UTF-8 ، ولكن يمكن مواجهته في السياقات حيث يتم تحويل بيانات UTF-8 من نماذج تشفير أخرى تستخدم BOM أو حيث يتم استخدام BOM كتوقيع UTF-8.

هذا ما يقوله W3C:

في متصفحات HTML5 مطلوبة للتعرف على UTF-8 BOM واستخدامها لاكتشاف ترميز الصفحة ، وتتعامل الإصدارات الحديثة من المتصفحات الرئيسية مع BOM كما هو متوقع عند استخدامها لصفحات UTF-8 المشفرة. -https://www.w3.org/International/questions/qa-byte-order-mark

إذا تم جلب الملف باستخدام HTTP (أو HTTPS) ، فإن ملفرأس نوع المحتوىيمكن تحديد الترميز:

Content-Type: application/javascript; charset=utf-8

If this is not set, the fallback is to check the charset attribute of the script tag:

<script src="./app.js" charset="utf-8">

If this is not set, the document charset meta tag is used:

  <meta charset="utf-8" />

The charset attribute in both cases is case insensitive (see the spec)

All this is defined in RFC 4329 “Scripting Media Types”.

Public libraries should generally avoid using characters outside the ASCII set in their code, to avoid it being loaded by users with an encoding that is different than their original one, and thus create issues.

How JavaScript uses Unicode internally

While a JavaScript source file can have any kind of encoding, JavaScript will then convert it internally to UTF-16 before executing it.

JavaScript strings are all UTF-16 sequences, as the ECMAScript standard says:

When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.

Using Unicode in a string

A unicode sequence can be added inside any string using the format \uXXXX:

const s1 = '\u00E9' //é

A sequence can be created by combining two unicode sequences:

const s2 = '\u0065\u0301' //é

Notice that while both generate an accented e, they are two different strings, and s2 is considered to be 2 characters long:

s1.length //1
s2.length //2

And when you try to select that character in a text editor, you need to go through it 2 times, as the first time you press the arrow key to select it, it just selects half element.

You can write a string combining a unicode character with a plain char, as internally it’s actually the same thing:

const s3 = 'e\u0301' //é
s3.length === 2 //true
s2 === s3 //true
s1 !== s3 //true


Unicode normalization is the process of removing ambiguities in how a character can be represented, to aid in comparing strings, for example.

Like in the example above:

const s1 = '\u00E9' //é
const s3 = 'e\u0301' //é
s1 !== s3

ES6/ES2015 introduced the normalize() method on the String prototype, so we can do:

s1.normalize() === s3.normalize() //true


Emojis are fun, and they are Unicode characters, and as such they are perfectly valid to be used in strings:

const s4 = '🐶'

Emojis are part of the astral planes, outside of the first Basic Multilingual Plane (BMP), and since those points outside BMP cannot be represented in 16 bits, JavaScript needs to use a combination of 2 characters to represent them

The 🐶 symbol, which is U+1F436, is traditionally encoded as \uD83D\uDC36 (called surrogate pair). There is a formula to calculate this, but it’s a rather advanced topic.

Some emojis are also created by combining together other emojis. You can find those by looking at this list https://unicode.org/emoji/charts/full-emoji-list.html and notice the ones that have more than one item in the unicode symbol column.

👩‍❤️‍👩 is created combining 👩 (\uD83D\uDC69), ❤️‍ (\u200D\u2764\uFE0F\u200D) and another 👩 (\uD83D\uDC69) in a single string: \uD83D\uDC69\u200D\u2764\uFE0F\u200D\uD83D\uDC69

There is no way to make this emoji be counted as 1 character.

Get the proper length of a string

If you try to perform


You’ll get 8 in return, as length counts the single Unicode code points.

Also, iterating over it is kind of funny:

Iterating an emoji

And curiously, pasting this emoji in a password field it’s counted 8 times, possibly making it a valid password in some systems.

How to get the “real” length of a string containing unicode characters?

One easy way in ES6+ is to use the spread operator:

;[...'🐶'].length //1

You can also use the Punycode library by Mathias Bynens:

require('punycode').ucs2.decode('🐶').length //1

(Punycode is also great to convert Unicode to ASCII)

Note that emojis that are built by combining other emojis will still give a bad count:

require('punycode').ucs2.decode('👩‍❤️‍👩').length //6
[...'👩‍❤️‍👩'].length //6

If the string has combining marks however, this still will not give the right count. Check this Glitch https://glitch.com/edit/#!/node-unicode-ignore-marks-in-length as an example.

(you can generate your own weird text with marks here: https://lingojam.com/WeirdTextGenerator)

Length is not the only thing to pay attention. Also reversing a string is error prone if not handled correctly.

ES6 Unicode code point escapes

ES6/ES2015 introduced a way to represent Unicode points in the astral planes (any Unicode code point requiring more than 4 chars), by wrapping the code in graph parentheses:


The dog 🐶 symbol, which is U+1F436, can be represented as \u{1F436} instead of having to combine two unrelated Unicode code points, like we showed before: \uD83D\uDC36.

But length calculation still does not work correctly, because internally it’s converted to the surrogate pair shown above.

Encoding ASCII chars

The first 128 characters can be encoded using the special escaping character \x, which only accepts 2 characters:

'\x61' // a
'\x2A' // *

This will only work from \x00 to \xFF, which is the set of ASCII characters.

More js tutorials: