Sometimes while working with a large corpus of text, we can have a problem in which we try to find which character is acting as a delimiter. This can be an interesting and useful utility while working with a huge amount of data and judging the delimiter. A way to solve this problem is discussed in this article using the Python library of detect_delimiter.
Installation
To install this module type the below command in the terminal.
pip install detect_delimiter
The first step is to check for all the whitelist characters’ presence in the input text, if found, then those characters are counted for most frequencies and a maximum of one is returned, ignoring all from the blacklist list if provided. If no delimiter is from the whitelist, then characters avoiding blacklist characters are computed for maximum frequency, if found, that character is returned as the delimiter. If still delimiter is not found, default is returned as a delimiter if provided, else None is returned.
Syntax: detect(text:str, text:str, default=None, whitelist=[‘,’, ‘;’, ‘:’, ‘|’, ‘\t’], blacklist=None)
text : The input string to test for delimiter.
default : The default value to output in case no valid delimiter is found.
whitelist : The first set of characters to be checked for delimiters, if these are found, they are treated as delimiters. Useful in cases one knows out of which delimiters are possible. Defaults to [‘,’, ‘;’, ‘:’, ‘|’, ‘\t’].
blacklist : By default all digits, alphabets and full stop are not considered as blacklist, In case more values one needs to avoid being tagged as delimiters, these will get avoided in check.
Example 1: Working with detect() and default
In this, few examples of detecting the delimiters are demonstrated along with the use of default.
Python3
from detect_delimiter import detect # simple example print ( "The found delimiter [base example] : " ) print (detect( "GeeksforLazyroar-is-best-for-Lazyroar" )) # simple example without default and no delimiter # . is not considered as delim print ( "The found delimiter [no default] : " ) print (detect( "GeeksforLazyroar.is.best.for.Lazyroar" )) # simple example with default # . is not considered as delim # No delim is found, hence, default is printed print ( "The found delimiter [with default] : " ) print (detect( "GeeksforLazyroar.is.best.for.Lazyroar" , default = '@' )) |
Output :
Example 2: Using blacklist and whitelist parameters
Providing whitelist parameter prioritizes any particular delimiter even if its frequency is less than nonwhitelisted delim. The blacklist parameter can help to ignore any delimiter.
Python3
from detect_delimiter import detect from string import ascii_letters # simple example # check for , as whitelist picked from default # - [',', ';', ':', '|', '\t'] print ( "The found delimiter [default whitelist] : " ) print (detect( "GeeksforLazyroar$is-best,for-Lazyroar" )) # simple example with whitelist # ! prioritized print ( "The found delimiter [provided whitelist] : " ) print (detect( "GeeksforLazyroar-is-best-for!Lazyroar" , whitelist = [ '@' , "!" ])) # simple example with blacklist # default blacklist overridden print ( "The found delimiter [provided blacklist] : " ) print (detect( "GeeksforLazyroar-is-best-for!Lazyroar" , blacklist = [ '@' , "-" , 'e' ])) |
Output :