A Unicode code point is not 4 bytes; it is an integer (ranging, at the moment, from U+0000 to U+10FFFF).
Your 4 bytes are (wild guess) its UTF-8 encoding version (edit: I was right).
You need to do this:
final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
When Java was created, Unicode did not define code points outside the BMP (ie, U+0000 to U+FFFF), which is the reason why a char
is only 16 bits long (well, OK, this is only a guess, but I think I'm not far off the mark here); since then, well, it had to adapt... And code points outside the BMP need two chars (a leading surrogate and a trailing surrogate -- Java calls these a high and low surrogate respectively). There is no character literal in Java allowing to enter code points outside the BMP directly.
Given that a char
is, in fact, a UTF-16 code unit and that there are string literals for these, you can input this "character" in a String as "uD83DuDF01"
-- or directly as the symbol if your computing environment has support for it.
See also the CharsetDecoder
and CharsetEncoder
classes.
See also String.codePointCount()
, and, since Java 8, String.codePoints()
(inherited from CharSequence
).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…