Prerequisite: Pattern matching with Regular Expression In this article, we will need to accept a string and we need to check if the string contains any URL in it. If the URL is present in the string, we will say URL’s been found or not and print the respective URL present in the string. We will use the concept of Regular Expression of Python to solve the problem.
Examples:
Input : string = 'My Profile: https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles in the portal of https://www.geeksforgeeks.org/' Output : URLs : ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/'] Input : string = 'I am a blogger at https://geeksforgeeks.org' Output : URL : ['https://geeksforgeeks.org']
To find the URLs in a given string we have used the findall() function from the regular expression module of Python. This return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left to right, and matches are returned in the order found.
Python3
# Python code to find the URL from an input string # Using the regular expression import re def Find(string): # findall() has been used # with valid conditions for urls in string regex = r "(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))" url = re.findall(regex, string) return [x[ 0 ] for x in url] # Driver Code string = 'My Profile: https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles in the portal of https://www.geeksforgeeks.org/' print ( "Urls: " , Find(string)) |
Urls: ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']
Time complexity: O(n) where n is the length of the input string, as the findall function of the re library iterates through the entire string once to find the URLs.
Auxiliary space: O(n), where n is the length of the input string, as the function Find stores all the URLs found in the input string into a list which is returned at the end.
Method #2: Using startswith() method
Python3
# Python code to find the URL from an input string def Find(string): x = string.split() res = [] for i in x: if i.startswith( "https:" ) or i.startswith( "http:" ): res.append(i) return res # Driver Code string = 'My Profile: https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles in the portal of https://www.geeksforgeeks.org/' print ( "Urls: " , Find(string)) |
Urls: ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']
Time Complexity: O(n), where n is the number of words in the input string.
Auxiliary Space: O(n), where n is the number of URLs found in the input string. The auxiliary space is used to store the found URLs in the res list.
Method #3 : Using find() method
Python3
# Python code to find the URL from an input string def Find(string): x = string.split() res = [] for i in x: if i.find( "https:" ) = = 0 or i.find( "http:" ) = = 0 : res.append(i) return res # Driver Code string = 'My Profile: https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles in the portal of https://www.geeksforgeeks.org/' print ( "Urls: " , Find(string)) |
Urls: ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']
Time Complexity : O(N)
Auxiliary Space : O(N)
METHOD 4:Using the urlparse() function from urllib.parse
APPROACH:
The urlparse() function from the urllib.parse module in Python can be used to extract various components of a URL, such as the scheme, netloc, path, query, and fragment
ALGORITHM:
1.Import the required modules – urllib.parse for urlparse() and split() methods from Python’s inbuilt module string
2.Initialize the input string.
3.Split the input string into individual words using the split() method.
4.Initialize an empty list urls to store the extracted URLs.
5.Iterate through each word in the words list.
6.Use the urlparse() function to extract the scheme and netloc components of the URL.
7.Check if both scheme and netloc are present in the URL, indicating that it is a valid URL.
8.If the URL is valid, add it to the urls list.
9.Print the final list of extracted URLs.
Python3
from urllib.parse import urlparse string = 'My Profile: https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles in the portal of https://www.geeksforgeeks.org/' # Split the string into words words = string.split() # Extract URLs from the words using urlparse() urls = [] for word in words: parsed = urlparse(word) if parsed.scheme and parsed.netloc: urls.append(word) # Print the extracted URLs print ( "URLs:" , urls) |
URLs: ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']
The time complexity of this approach is O(n), where n is the length of the input string
The space complexity of this approach is O(n + k), where n is the length of the input string and k is the number of extracted URLs..
METHOD 5: Using reduce():
Algorithm :
- Define a function merge_url_lists(url_list1, url_list2) that takes two lists of URLs and returns their concatenation.
- Define a function find_urls_in_string(string) that takes a string as input and returns a list of URLs found in the string.
- Define two input strings, string1 and string2, and put them into a list called string_list.
- Call map(find_urls_in_string, string_list) to generate a list of lists of URLs found in each string.
- Call reduce(merge_url_lists, map(find_urls_in_string, string_list)) to concatenate all the lists of URLs found into a single list.
- Print the resulting list of URLs.
Python3
from functools import reduce def merge_url_lists(url_list1, url_list2): return url_list1 + url_list2 def find_urls_in_string(string): x = string.split() return [i for i in x if i.find( "https:" ) = = 0 or i.find( "http:" ) = = 0 ] string1 = 'My Profile: https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles in the portal of https://www.geeksforgeeks.org/' string2 = 'Some more text without URLs' string_list = [string1, string2] url_list = reduce (merge_url_lists, map (find_urls_in_string, string_list)) print ( "Urls:" , url_list) #This code is contributed by Rayudu. |
Urls: ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']
The time complexity: O(n*m), where n is the number of strings in the string_list list and m is the maximum number of words in a string. This is because the find_urls_in_string() function needs to split each string into words and check each word for the presence of a URL.
The space complexity : O(n*m), because it stores all the words in all the strings in memory, as well as all the URLs found.