Regular expressions (also called “Regex”) are special types of pattern-matching strings that describe a pattern in a text. A regex can be matched against another string to see whether the string fits the pattern. In general, regex consists of normal characters, character classes, wildcard characters, and quantifiers. We will talk specifically about character classes here. At times there’s a need to match any sequence that contains one or more characters, in any order, that is part of a set of characters. For example, to match whole words, we want to match any sequence of the letters of alphabets. Character classes come in handy for such use-cases. A character class is a set of characters such that characters are put between square brackets ‘[‘ and ‘]’. For example, class [abc] matches a, b, c characters. A range of characters can also be specified using hyphens. For example, to match whole words of lowercase letters, the [a-z] class can be used.
Note that a character class has no relation with a class construct or class files in Java. Also, the word “match” means a pattern exists in a string, it does not mean the whole string matches the pattern. A regex pattern lets us use two types of character classes:
- Predefined character classes and
- Custom character classes
Predefined Character Classes
Some frequently used character classes are predefined in Java, these are listed below. Characters in these classes are usually proceeded with a backslash “\” and need not reside in brackets “[” and “]”.
Predefined character classes |
Meaning of predefined character classes |
---|---|
. (dot) | This special character dot (.) matches any character. One dot matches one (any) character, two dots match two characters and so on. Dot characters may or may not match line terminators. |
\d | This matches any digit character. This works the same as the character class [0-9]. |
\D | This matches any character except for digits. This works the same as the character class [^0-9]. |
\s | This matches any whitespace character. This includes a space ‘ ‘, a tab ‘\t’, a new line ‘\n’, a vertical tab ‘\x0B’, a form feed ‘\f’, a carriage return ‘\r’ and backspace character ‘\b’. |
\S | This matches any character except for the whitespace characters listed above. |
\w | This matches any word character, including both uppercase and lowercase, also including digit characters and the underscore character ‘_’. This works the same as the class [a-zA-z_0-9]. |
\W | This matches any character except for word characters. This works the same as the class [^a-zA-z_0-9]. |
A few example regex patterns using predefined character classes:
Regex pattern using predefined character classes |
Input String – Result |
Input String – Result |
Input String – Result |
---|---|---|---|
b.r | bar – Match | ab1r – Match | ba1r – Does not match |
“b.r” regex means there can be any 1 character between “b” and “r”, the pattern is found in “bar” and “ab1r”, but is not found in “ba1r” as one dot matches only one character, but here there are more than one characters between “b” and “r”. | |||
\d\d-\d\d-\d\d\d\d | 01-01-2022 – Match | 12-31-2050 – Match | 2022-02-02 – Does not match |
“\d\d-\d\d-\d\d\d\d” regex is a naive regex for date in format “DD-MM-YYYY” all characters are digits. The regex is “naive” because it matches dates of the format “MM-DD-YYYY” too and dates > 31 or months > 12 are not taken care of either. | |||
\d\d-\D\D\D-\d\d\d\d | 01-JAN-2022 – Match | 31-12-2050 – Does not match | 22-a1B-1234 – Does not match |
“\d\d-\d\d-\d\d\d\d” regex is another naive regex for the date in format “DD-MMM-YYYY” where date and year characters are digits and month characters are anything other than digits. | |||
…\s… | abc xyz – Match | abc_xyz – Does not match | abc <tab_space> xyz – Match |
“…\s…” regex means two groups of any 3 characters separated by any whitespace character. As “_” is not a whitespace character, “abc_xyz” does not match. | |||
…\S… | 123 456 – Does not match | 123+456 – Match | abc_xyz – Match |
“…\S…” regex means two groups of any 3 characters separated by any character other than a whitespace character. As ” ” (space) is a whitespace character, “123 456” does not match. | |||
\w\w\w\W\w\w\w | abc xyz – Match | LMN_opq – Does not match | 123+456 – Match |
“\w\w\w\W\w\w\w” regex means two groups of 3 word characters separated by any non-word character. As “_” is a word character, “LMN_opq” does not match. |
Custom Character Classes
Java allows us to define character classes of our own using […]. A few examples of custom character classes are as follows:
Example of custom character class |
Meaning of the custom character class |
---|---|
b[aeiou]t | This regex means pattern must start with “b” followed by any of the vowels “a”, “e”, “i”, “o”, “u” followed by “t”. Strings “bat”, “bet”, “bit”, “bot”, “but” would match this regex, but “bct”, “bkt”, etc. would not match. |
[bB][aAeEiIoOuU][tT] | Such a regex can be used to allow uppercase letters too in the previous regex. So the strings “bAT”, “BAT”, etc. would match the pattern. |
b[^aeiou]t | “^” at the beginning of character class works as negation or complement, such that this regex means any character other than vowels is allowed between “b” and “t”. Strings “bct”, “bkt”, “b+t”, etc. would match the pattern. Using a ‘^’ at the beginning of character class has a special meaning, but using ‘^’ anywhere in the class apart from at the beginning, acts like any other normal character. |
[a-z][0-3] | Range of letters and digits can be specified in character classes using the hyphen “-“. Strings “a1”, “z3”, etc. match the pattern. Strings “k7”, “n9”, etc. does not match. |
[a-zA-Z][0-9] | More than one range can be used in a class. Strings “A1”, “b2”, etc. match the pattern. |
[A-F[G-Z]] | Nesting character classes simply add them, so this class is the same as [A-Z] class. |
[a-p&&[l-z]] | Intersection of ranges also works in character classes. This regex means characters “l”, “m”, “n”, “o”, “p” would match in a string. |
[a-z&&[^aeiou]] | Subtraction of ranges also works in character classes. This regex means vowels are subtracted from the range “a-z”. |
Regex patterns discussed so far require that each position in the input string match a specific character class. For example, the “[a-z]\s\d” pattern requires a letter at the first position, a whitespace character at the second position, and a digit at the third position. These patterns are inflexible, restrictive, and require more maintenance efforts. To solve this issue quantifiers can be used in character classes. Using quantifiers we can specify the number of times a character in a regex may match the sequence of characters.
Quantifiers |
Meaning of the quantifier |
---|---|
* | Zero or more times |
Placing an asterisk “*” after a character class means “allow any number of occurrences of that character class”. For example, “0*\d” regex matches any number of leading zeroes followed by a digit. | |
+ | One or more times |
“+” plus sign has the same effect as XX*, meaning a pattern followed by pattern asterisk. For example, “0+\d” regex matches at least one leading zeroes followed by a digit. | |
? | zero or one time |
“?” question mark sign allows either zero or one occurrence. For example, “\w\w-?\d\d” regex matches 2-word characters followed by an optional hyphen and then followed by 2 digit characters. | |
{m} | Exactly “m” times |
{m, } | At least “m” times |
{m, n} | At least “m” times and at most “n” times |