Sunday, November 17, 2024
Google search engine
HomeLanguagesPython | Check for URL in a String

Python | Check for URL in a String

Prerequisite: Pattern matching with Regular Expression In this article, we will need to accept a string and we need to check if the string contains any URL in it. If the URL is present in the string, we will say URL’s been found or not and print the respective URL present in the string. We will use the concept of Regular Expression of Python to solve the problem.

Examples:

Input : string = 'My Profile: 
https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles 
in the portal of https://www.geeksforgeeks.org/'

Output : URLs :  ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles',
'https://www.geeksforgeeks.org/']

Input : string = 'I am a blogger at https://geeksforgeeks.org'
Output : URL :  ['https://geeksforgeeks.org']

To find the URLs in a given string we have used the findall() function from the regular expression module of Python. This return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left to right, and matches are returned in the order found. 

Python3




# Python code to find the URL from an input string
# Using the regular expression
import re
 
 
def Find(string):
 
    # findall() has been used
    # with valid conditions for urls in string
    regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    url = re.findall(regex, string)
    return [x[0] for x in url]
 
 
# Driver Code
print("Urls: ", Find(string))


Output

Urls:  ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']

Time complexity: O(n) where n is the length of the input string, as the findall function of the re library iterates through the entire string once to find the URLs.
Auxiliary space: O(n), where n is the length of the input string, as the function Find stores all the URLs found in the input string into a list which is returned at the end.

Method #2: Using startswith() method

Python3




# Python code to find the URL from an input string
 
def Find(string):
    x=string.split()
    res=[]
    for i in x:
        if i.startswith("https:") or i.startswith("http:"):
            res.append(i)
    return res
             
# Driver Code
print("Urls: ", Find(string))


Output

Urls:  ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']

Time Complexity: O(n), where n is the number of words in the input string.
Auxiliary Space: O(n), where n is the number of URLs found in the input string. The auxiliary space is used to store the found URLs in the res list.

Method #3 : Using find() method

Python3




# Python code to find the URL from an input string
 
def Find(string):
    x=string.split()
    res=[]
    for i in x:
        if i.find("https:")==0 or i.find("http:")==0:
            res.append(i)
    return res
             
# Driver Code
print("Urls: ", Find(string))


Output

Urls:  ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']

Time Complexity : O(N)
Auxiliary Space : O(N)

METHOD 4:Using the urlparse() function from urllib.parse

APPROACH:

The urlparse() function from the urllib.parse module in Python can be used to extract various components of a URL, such as the scheme, netloc, path, query, and fragment

ALGORITHM:

1.Import the required modules – urllib.parse for urlparse() and split() methods from Python’s inbuilt module string
2.Initialize the input string.
3.Split the input string into individual words using the split() method.
4.Initialize an empty list urls to store the extracted URLs.
5.Iterate through each word in the words list.
6.Use the urlparse() function to extract the scheme and netloc components of the URL.
7.Check if both scheme and netloc are present in the URL, indicating that it is a valid URL.
8.If the URL is valid, add it to the urls list.
9.Print the final list of extracted URLs.

Python3




from urllib.parse import urlparse
 
 
# Split the string into words
words = string.split()
 
# Extract URLs from the words using urlparse()
urls = []
for word in words:
    parsed = urlparse(word)
    if parsed.scheme and parsed.netloc:
        urls.append(word)
 
# Print the extracted URLs
print("URLs:", urls)


Output

URLs: ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']

The time complexity of this approach is O(n), where n is the length of the input string

The space complexity of this approach is O(n + k), where n is the length of the input string and k is the number of extracted URLs..

METHOD 5: Using reduce():

Algorithm :

  1. Define a function merge_url_lists(url_list1, url_list2) that takes two lists of URLs and returns their concatenation.
  2. Define a function find_urls_in_string(string) that takes a string as input and returns a list of URLs found in the string.
  3. Define two input strings, string1 and string2, and put them into a list called string_list.
  4. Call map(find_urls_in_string, string_list) to generate a list of lists of URLs found in each string.
  5. Call reduce(merge_url_lists, map(find_urls_in_string, string_list)) to concatenate all the lists of URLs found into a single list.
  6. Print the resulting list of URLs.

Python3




from functools import reduce
 
def merge_url_lists(url_list1, url_list2):
    return url_list1 + url_list2
 
def find_urls_in_string(string):
    x = string.split()
    return [i for i in x if i.find("https:")==0 or i.find("http:")==0]
 
string2 = 'Some more text without URLs'
 
string_list = [string1, string2]
 
url_list = reduce(merge_url_lists, map(find_urls_in_string, string_list))
 
print("Urls:", url_list)
#This code is contributed by Rayudu.


Output

Urls: ['https://auth.geeksforgeeks.org/user/Chinmoy%20Lenka/articles', 'https://www.geeksforgeeks.org/']

The time complexity: O(n*m), where n is the number of strings in the string_list list and m is the maximum number of words in a string. This is because the find_urls_in_string() function needs to split each string into words and check each word for the presence of a URL.

The space complexity : O(n*m), because it stores all the words in all the strings in memory, as well as all the URLs found.

RELATED ARTICLES

Most Popular

Recent Comments