Notes on Unicode

2025-08-04

UTF

  • Unicode Transformation Format (UTF) is a way to encode Unicode characters.

UTF-8, UTF-16, and UTF-32

Unicode

  • UTF-8 uses 1 byte to represent characters in the old ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.
  • UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.
  • UTF-32 uses 4 bytes everywhere.

Run C program with UTF-8 encoded string

#include <stdio.h>

int main() {
    // UTF-8 encoded string "A 😀"
    unsigned char text[] = {0x41, 0x20, 0xF0, 0x9F, 0x98, 0x80, 0x00};

    // Printing the string
    printf("%s\n", text);  // Output: A 😀

    return 0;
}
  • Save the above code in a file named test.c.
  • Compile it using the command: cl test.c -o test.exe for Windows (run in cmd) or gcc test.c -o test for Linux.
  • Run the compiled program with test.exe on Windows or ./test on Linux
  • On Windows, if it is run in cmd, ensure the console supports UTF-8 encoding by running chcp 65001 before executing the program. Then I have to run chcp 437 to switch back to default to a code page that supports the characters.
  • The output will be A 😀, demonstrating that the UTF-8 encoded string is correctly interpreted.

Reference

UnicodeTutorial UniView