Friday, October 24, 2025
HomeLanguagesJavaUTF-8 Validation in Java

UTF-8 Validation in Java

A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:

  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. For n-bytes character, the first n-bits are all ones, the n+1 bit is 0, followed by n-1 bytes with the most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

  Char. number range   |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example:

Given an array of integers representing the data, return whether it is a valid UTF-8 encoding.

The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.

Return false.

The first 3 bits are all one’s and the 4th bit is 0 means it is a 3-bytes character.

The next byte is a continuation byte which starts with 10 and that’s correct.

But the second continuation byte does not start with 10, so it is invalid.

———————————————————————————————–

data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true.

It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Approach 1: As long as every byte in the array is of the right type, it is a valid UTF-8 encoding.    

  • Start from index 0, determine each byte’s type and check its validity.
  • There are five kinds of valid byte type: 0**, 10**, 110**,1110** and 11110**
  • Give them type numbers, 0, 1, 2, 3, 4 which are the index of the first 0 from left.
  • So, the index of the first 0 determines the byte type.
  • If a byte belongs to one of them: if it is type 0, continue if it is type 2 or 3 or 4, check whether the following 1, 2, and 3 byte(s) are of type 1 or not.
  • If not, return false; else if a byte is type 1 or not of valid type, return false.

Java




// Java program to check whether the data
// is a valid UTF-8 encoding
 
import java.io.*;
import java.util.*;
 
class Sol {
 
    private int[] masks = { 128, 64, 32, 16, 8 };
 
    public boolean validUtf8(int[] data)
    {
        int len = data.length;
 
        // for each value in the data array we have to take
        // the "and" with the masks array
        for (int i = 0; i < len; i++) {
            int curr = data[i];
 
            // method to check the array if the
            // and with the num and masks array is
            // 0 then return true
            int type = getType(curr);
 
            if (type == 0) {
                continue;
            }
 
            else if (type > 1 && i + type <= len)
            {
                while (type-- > 1)
                {
                    if (getType(data[++i]) != 1)
                    {
                        return false;
                    }
                }
            }
            else {
                return false;
            }
        }
        return true;
    }
 
    // method to check the type
    public int getType(int num)
    {
        for (int i = 0; i < 5; i++) {
 
            // checking the each input
            if ((masks[i] & num) == 0) {
                return i;
            }
        }
 
        return -1;
    }
}
 
class GFG {
    public static void main(String[] args)
    {
        Sol st = new Sol();
        int[] arr = { 197, 130, 1 };
 
        boolean res = st.validUtf8(arr);
        System.out.println(res);
    }
}


Output

true

Time Complexity: O(n)

Auxiliary Space: O(1)

Approach 2: To translate the provided data array into a sequence of valid UTF-8 encoded characters

  1. Start with count = 0.
  2. for “i” ranging from 0 to the size of the data array. 
    • Take the value from data array and store it in x  = data[i]
    • If the count is 0, then
    • If x/32 = 110, then set count as 1. (x/32 is same as doing x >> 5 as 2^5 = 32)
    • Else if x/16 = 1110, then count = 2 (x/16 is same as doing x >> 4 as 2^4 = 16)
    • Else If x/8 = 11110, then count = 3. (x/8 is same as doing x >> 3 as 2^3 = 8)
    • Else if x/128 is 0, then return false. (x/128 is same as doing x >> 7 as 2^7 = 128)
    • Else If x/64 is not 10, then return false and decrease the count by 1.

3. When the count is 0, return true.

Java




// Java program to check whether the data
// is a valid UTF-8 encoding
 
import java.io.*;
import java.util.*;
 
class Sol {
 
    public boolean validUtf8(int[] data)
    {
        int count = 0;
        for (int i = 0; i < data.length; i++) {
 
            int x = data[i];
 
            if (count == 0) {
                if ((x >> 5) == 0b110)
                    count = 1;
 
                else if ((x >> 4) == 0b1110)
                    count = 2;
 
                else if ((x >> 3) == 0b11110)
                    count = 3;
 
                else if ((x >> 7) != 0)
                    return false;
            }
            else {
                if ((x >> 6) != 0b10)
                    return false;
                count--;
            }
        }
        return (count == 0);
    }
}
 
class GFG {
    public static void main(String[] args)
    {
        Sol st = new Sol();
        int[] arr = { 197, 130, 1 };
 
        boolean res = st.validUtf8(arr);
        System.out.println(res);
    }
}


Output

true

Time Complexity: O(N)

Auxiliary Space: O(1)

RELATED ARTICLES

Most Popular

Dominic
32361 POSTS0 COMMENTS
Milvus
88 POSTS0 COMMENTS
Nango Kala
6728 POSTS0 COMMENTS
Nicole Veronica
11892 POSTS0 COMMENTS
Nokonwaba Nkukhwana
11954 POSTS0 COMMENTS
Shaida Kate Naidoo
6852 POSTS0 COMMENTS
Ted Musemwa
7113 POSTS0 COMMENTS
Thapelo Manthata
6805 POSTS0 COMMENTS
Umr Jansen
6801 POSTS0 COMMENTS