Thursday, December 26, 2024
Google search engine
HomeLanguagesPython – Remove Non-English characters Strings from List

Python – Remove Non-English characters Strings from List

Given a List of Strings, perform removal of all Strings with non-english characters.

Input : test_list = [‘Good| ????’, ‘??Geeks???’] 
Output : [] 
Explanation : Both contain non-English characters 

Input : test_list = [“Gfg”, “Best”] 
Output : [“Gfg”, “Best”] 
Explanation : Both are valid English words.

Method #1 : Using regex + findall() + list comprehension

In this, we create a regex of unicodes and check for occurrence in String List, extract each String without unicode using findall().

  • Initializes a list called “test_list” with some sample strings containing non-English characters.
  • Prints the original list using the print() function along with a message.
  • Next, uses the findall() method of the re module to check for the presence of non-English characters in each string of the “test_list“.
  • The regular expression “[^\u0000-\u05C0\u2100-\u214F]+” matches any character that is not within the Unicode ranges of \u0000-\u05C0 and \u2100-\u214F. These ranges cover most of the Latin, Cyrillic, and Hebrew scripts, which are the commonly used scripts for English and other European languages.
  • The list comprehension creates a new list called “res” that contains only those strings from the original list which do not contain any non-English characters.
  • Finally, prints the extracted list using the print() function along with a message.

Below is the implementation of the above approach:

Python3




# Python3 code to demonstrate working of
# Remove Non-English characters Strings from List
# Using regex + findall() + list comprehension
import re
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for"'??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# using findall() to neglect unicode of Non-English alphabets
res = [idx for idx in test_list if not re.findall("[^\u0000-\u05C0\u2100-\u214F]+", idx)]
 
# printing result
print("The extracted list : " + str(res))


Output

The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good| ????', 'for', '??Geeks???']

Time complexity: O(n*k), where n is the length of the input list and k is the average length of the strings in the list.
Auxiliary space: O(m), where m is the length of the output list.

Method #2 : Using regex + search() + filter() + lambda

In this, we search for only English alphabets in String, and extract only those that have those. We use filter() + lambda to perform the task of passing filter functionality and iteration.

Python3




# Python3 code to demonstrate working of
# Remove Non-English characters Strings from List
# Using regex + search() + filter() + lambda
import re
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for"'??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# using search() to get only those strings with alphabets
res = list(filter(lambda ele: re.search("[a-zA-Z\s]+", ele) is not None, test_list))
 
# printing result
print("The extracted list : " + str(res))


Output

The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good| ????', 'for', '??Geeks???']

Time Complexity: O(n)
Auxiliary Space: O(n)

Method #3: Using for loop

Python3




# Python3 code to demonstrate working of
# Remove Non-English characters Strings from List
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for", '??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
loweralphabets="abcdefghijklmnopqrstuvwxyz"
upperalphabets="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
x=loweralphabets+upperalphabets
res=[]
for i in test_list:
    a=""
    for j in i:
        if j in x:
            a+=j
    res.append(a)
             
# printing result
print("The extracted list : " + str(res))


Output

The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good', 'for', 'Geeks']

Time complexity: O(n*m), where n is the length of the input list and m is the maximum length of a string in the list.
Auxiliary space: O(n*m), as we are creating a new list to store the filtered strings.

Method 4: Using the unicodedata library

Step-by-step approach:

  • Import the unicodedata library.
  • Define a function is_english that takes a character as input and returns True if the character is a English alphabet, otherwise False.
  • Define a function remove_non_english that takes a list of strings as input, and returns a new list with only the English alphabets from the original strings.
  • In the remove_non_english function, iterate through each string in the input list using a for loop.
  • For each string, convert it to a list of characters using the list function.
  • Use the filter function with the is_english function as the filter condition to keep only the English alphabets in the list.
  • Use the join function to convert the filtered list of characters back into a string.
  • Append the filtered string to the output list.
  • Return the output list.

Below is the implementation of the above approach:
 

Python3




import unicodedata
 
def is_english(c):
    return c.isalpha() and unicodedata.name(c).startswith(('LATIN', 'COMMON'))
 
def remove_non_english(lst):
    output = []
    for s in lst:
        filtered = filter(is_english, list(s))
        english_str = ''.join(filtered)
        output.append(english_str)
    return output
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for", '??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# printing result
print("The extracted list : " + str(remove_non_english(test_list)))


Output

The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good', 'for', 'Geeks']

Time complexity: O(nk) where n is the length of the input list and k is the length of the longest string in the input list. 
Auxiliary space: O(nk) since we are storing the filtered strings in the output list.

Method #5: Using the ord() function

use the ord() function to determine if a character is an English alphabet. English alphabets have ASCII values ranging from 65 to 90 for uppercase letters and 97 to 122 for lowercase letters.

Here’s the step-by-step approach:

  1. Define a function is_english(c) that takes a character as input and returns True if the character is an English alphabet and False otherwise. We can use the ord() function to get the ASCII value of the character and compare it with the ASCII values of English alphabets.
  2. Define a function remove_non_english(lst) that takes a list of strings as input and returns a list of strings with non-English characters removed. We can iterate through each string in the input list and iterate through each character in the string. If a character is English, we add it to a new string. If not, we skip it. We append the new string to an output list.
  3. Initialize a list test_list with some sample input strings.
  4. Call the remove_non_english() function with the test_list as input.
  5. Print the original and extracted lists.

Python3




def is_english(c):
    ascii_value = ord(c)
    return (ascii_value >= 65 and ascii_value <= 90) or (ascii_value >= 97 and ascii_value <= 122)
 
def remove_non_english(lst):
    output = []
    for s in lst:
        english_str = ""
        for c in s:
            if is_english(c):
                english_str += c
        output.append(english_str)
    return output
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for", '??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# printing result
print("The extracted list : " + str(remove_non_english(test_list)))


Output

The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good', 'for', 'Geeks']

Time Complexity: O(n*m), where n is the number of strings in the input list and m is the length of the longest string in the list.
Auxiliary Space: O(n*m), where n is the number of strings in the input list and m is the length of the longest string in the list. 

Method #6: Using the translate() method

Step-by-step approach:

  • Initialize a translation table that will be used to remove non-English characters from the strings. This is done using the str.maketrans() method and passing two strings as arguments: the first string contains all non-English characters that should be replaced with None, and the second string is an empty string to indicate that those characters should be removed.
  • Initialize a list called result to store the modified strings.
  • Iterate over each string in the test_list using a for loop.
  • Apply the translation table to the current string using the translate() method and passing the translation table as an argument.
  • Append the modified string to the result list.
  • Print the resulting list using the print() function and passing the string representation of result.

Below is the implementation of the above approach:

Python3




# Python3 code to demonstrate working of
# Remove Non-English characters Strings from List
 
# initializing list
test_list = ['Gfg', 'Good| ????', "for", '??Geeks???']
 
# printing original list
print("The original list is : " + str(test_list))
 
# create a translation table to remove non-English characters
non_english = str.maketrans("", "", "0123456789!@#$%^&*()_+-=[]{}\\|;:'\",./<>?`~¡¢£¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ")
 
# initialize an empty list to store modified strings
result = []
 
# iterate over each string in the test_list
for string in test_list:
    # apply the translation table to remove non-English characters
    modified_string = string.translate(non_english)
    # append the modified string to the result list
    result.append(modified_string)
 
# print the resulting list
print("The extracted list : " + str(result))


Output

The original list is : ['Gfg', 'Good| ????', 'for', '??Geeks???']
The extracted list : ['Gfg', 'Good ', 'for', 'Geeks']

Time complexity: O(n*m), where n is the length of the test_list and m is the maximum length of a string in the list. 
Auxiliary space: O(n*m), where n is the length of the test_list and m is the maximum length of a string in the list. 

RELATED ARTICLES

Most Popular

Recent Comments