Given a String and HTML tag, extract all the strings between the specified tag.
Input : ‘<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.’ , tag = “br”
Output : [‘Gfg’, ‘Best’, ‘Reading CS’]
Explanation : All strings between “br” tag are extracted.Input : ‘<h1>Gfg</h1> is <h1>Best</h1> I love <h1>Reading CS</h1>’ , tag = “h1”
Output : [‘Gfg’, ‘Best’, ‘Reading CS’]
Explanation : All strings between “h1” tag are extracted.
Using re module this task can be performed. In this we employ, findall() function to extract all the strings by matching appropriate regex built using tag and symbols.
Python3
# importing re module import re # initializing string test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.' # printing original string print ( "The original string is : " + str (test_str)) # initializing tag tag = "b" # regex to extract required strings reg_str = "<" + tag + ">(.*?)</" + tag + ">" res = re.findall(reg_str, test_str) # printing result print ( "The Strings extracted : " + str (res)) |
Output:
The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it. The Strings extracted : [‘Gfg’, ‘Best’, ‘Reading CS’]
Time Complexity: O(N), where N is the length of the input string.
Auxiliary Space: O(N)
Method 2: Using string manipulation
- Initialize a string named “test_str” with some HTML content.
- Initialize a string named “tag” with the name of the tag whose content needs to be extracted.
- Find the index of the first occurrence of the opening tag in the “test_str” using the “find()” method and store it in a variable named “start_idx”.
- Initialize an empty list named “res” to store the extracted strings.
- Use a while loop to extract the strings between the tags. The loop will run until there are no more occurrences of the opening tag.
- Inside the loop, find the index of the closing tag using the “find()” method and store it in a variable named “end_idx”. If the closing tag is not found, exit the loop.
- Extract the string between the tags using string slicing, and append it to the “res” list.
- Find the index of the next occurrence of the opening tag using the “find()” method and update the “start_idx” variable.
- Repeat steps 6-8 until there are no more occurrences of the opening tag.
- Print the extracted strings using the “print()” function. The strings are converted to a string using the “str()” function before being printed.
Python3
# initializing string test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.' # initializing tag tag = "b" # finding the index of the first occurrence of the opening tag start_idx = test_str.find( "<" + tag + ">" ) # initializing an empty list to store the extracted strings res = [] # extracting the strings between the tags while start_idx ! = - 1 : end_idx = test_str.find( "</" + tag + ">" , start_idx) if end_idx = = - 1 : break res.append(test_str[start_idx + len (tag) + 2 :end_idx]) start_idx = test_str.find( "<" + tag + ">" , end_idx) # printing the extracted strings print ( "The Strings extracted : " + str (res)) |
The Strings extracted : ['Gfg', 'Best', 'Reading CS']
Time complexity: O(n), where n is the length of the input string.
Auxiliary space: O(m), where m is the number of occurrences of the tag in the input string.
Method 3: Using recursion method:
Algorithm:
- Find the index of the first occurrence of the opening tag.
- If no opening tag is found, return an empty list.
- Extract the string between the opening and closing tags using the start index of the opening tag and the end index of the closing tag.
- Recursively call the function with the remaining string after the current tag.
- Return the list of extracted strings.
Python3
def extract_strings_recursive(test_str, tag): # finding the index of the first occurrence of the opening tag start_idx = test_str.find( "<" + tag + ">" ) # base case if start_idx = = - 1 : return [] # extracting the string between the opening and closing tags end_idx = test_str.find( "</" + tag + ">" , start_idx) res = [test_str[start_idx + len (tag) + 2 :end_idx]] # recursive call to extract strings after the current tag res + = extract_strings_recursive(test_str[end_idx + len (tag) + 3 :], tag) return res # example usage test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.' tag = "b" # printing original string print ( "The original string is : " + str (test_str)) res = extract_strings_recursive(test_str, tag) print ( "The Strings extracted : " + str (res)) #This code is contributed by Jyothi Pinjala. |
The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it. The Strings extracted : ['Gfg', 'Best', 'Reading CS']
Time Complexity:
The time complexity of this algorithm is O(n), where n is the length of the input string. This is because we iterate through the string only once, and the operations within the loop are constant time.
Auxiliary Space:
The space complexity of this algorithm is also O(n), where n is the length of the input string. This is because we create a new list for each recursive call, and the maximum depth of the recursion is n/2 (when the input string consists entirely of opening and closing tags). However, in practice, the depth of the recursion will be much smaller than n/2.