In this article, we will see how to extract all the valid emails in a text using python and regex.
- A regular expression shortened as regex or regexp additionally called a rational expression) is a chain of characters that outline a seek pattern. Usually, such styles are utilized by string-looking algorithms for “locate” or “locate and replace” operations on strings, or to enter validation.
- It is a method evolved in theoretical computer technology and natural language theory.
- The re module in python provides full support for Perl-like regular expressions in Python. It offers a set of functions that allows us to search a string for a match.
- The re.findall() function defined in the re python module accepts two parameters and returns a list of all the matching strings found.
Syntax: re.findall( regex , string )
Parameters:
- The regex is the regular expression which is made of various predefined symbols used to search for the pattern we are looking for.
- The string is the original string on which we are going to perform search action on.
After importing the necessary module, we will call findall() method defined in the re module to find all the strings that match the regex expression passed as a parameter.
The regex expression can be divided into three parts:
1. r”[A-Za-z0-9_%+-.]+”
This expression looks for a continuous sequence of characters consist of all capital alphabets defined by A-Z, lowercase alphabets a-z, all digits 0-9, and special characters such as _%+-. . The ‘+’ is used to append the second regex to the first.
2. r”@[A-Za-z0-9.-]+”
This expression looks for a continuous sequence of characters consist of all capital alphabets defined by A-Z, lowercase alphabets a-z, all digits 0-9, and special characters such as ._. The ‘+’ is used to append the second regex to the first.
3. r”\.[A-Za-z]{2,5}”
This expression looks for a continuous sequence of characters consist of all capital alphabets defined by A-Z, lowercase alphabets a-z such that the size of this continuous sequence is between 2-5 both inclusive.
Example 1: Extract valid emails from a string
Python3
# Raw text text = "Duis info@neveropen.com convallis. Parturient montes nascetur ridiculus mus \ neveropen@rocks.xyz mauris. Odio eu feugiat pre@rsos_tium.index nibh ipsum consequat love@gfg. in \ pretium aenean pharetra magna ac placerat. Vitae justo eget magna fermentum iaculis eu non." #import regex module import re #finding all valid emails using regex reg = re.findall(r "[A-Za-z0-9_%+-.]+" r "@[A-Za-z0-9.-]+" r "\.[A-Za-z]{2,5}" ,text) #printing all the valid emails found print (reg) |
Output:
['info@neveropen.com', 'neveropen@rocks.xyz', 'love@gfg.in']
Example 2: Extract valid emails from a text file
Using open() function we open the required file in “r” mode, read mode only. And for each line, we strip the line so as to remove white spaces and the process them similarly to the first example.
Python3
#importing module import re with open ( 'sample.txt' , 'r' ) as file : for line in file : line = line.strip() # finding all valid emails reg = re.findall(r "[A-Za-z0-9_%+-.]+" r "@[A-Za-z0-9.-]+ " r "\.[A-Za-z]{2,5}" ,line) #printing all the valid emails found print (reg) |
Output:
['info@neveropen.com', 'neveropen@rocks.xyz', 'love@gfg.in']