Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
666 views
in Technique[技术] by (71.8m points)

4 byte unicode character in Java

I am writing unit tests for my custom StringDatatype, and I need to write down 4 byte unicode character. "U" - not working (illegal escape character error) for example: U+1F701 (0xf0 0x9f 0x9c 0x81). How it can be written in a string?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).

Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).

You need to do this:

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);

When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.

Given that a char is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "uD83DuDF01" -- or directly as the symbol if your computing environment has support for it.

See also the CharsetDecoder and CharsetEncoder classes.

See also String.codePointCount(), and, since Java 8, String.codePoints() (inherited from CharSequence).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...