Friday, October 24, 2025
HomeLanguagesRule-Based Data Extraction in HTML using Python textminer Module

Rule-Based Data Extraction in HTML using Python textminer Module

While working with HTML, there are various requirements to extract data from plain HTML tags in an ordered manner in form of python data containers such as lists, dict, integers, etc. This article deals with a library that helps to achieve this using a rule-based approach.

Features of Python – textminer:

  • Extracts data in form of a list, dictionary, and texts from HTML.
  • Used rule-based system in YAML format.
  • Supports extraction from URL in form of scraping.

Installation: 

Use the below command to install Python textminer:

pip install textminer

Functions Description:

The following functions come in handy while extracting data from HTML:

Syntax:

extract(html, rule) 

Parameters:

  • html: The HTML to extract data from.
  • rule: Rule in YAML format to apply on HTML to extract data.

 

Syntax:

extract_from_url(url, rule)  

Parameters:

  • rule: Rule in YAML format to apply on HTML to extract data.
  • url: The HTML URL from which extraction of HTML has to be performed.

 

Example 1: Extracting data from HTML

This basic rule in YAML format is formulated to extract data between a suffix and a prefix.

Python3




import textminer
  
# input html
inp_html = '<html><body><div>GFG is best for Geeks</div></body></html>'
  
# yaml rule string
rule = '''
value:
  prefix: <div>
  suffix: </div>
'''
  
# using extract() to get required data
res = textminer.extract(inp_html, rule)
  
print("The data extracted between divs : ")
print(res)


Output : 

Extracted data between divs

Example 2: Extracting a list from HTML

The python-based list can be extracted from Html which is commonly referred to using list tags, by using <li> and </li> as prefix and suffix of rule. Additionally, the “list” keyword needs to be added to achieve this.

Python3




import textminer
  
# input html
inp_html = """<html>
<body>
<ul>
    <li>Gfg</li>
    <li>is</li>
    <li>best</li>
</ul>
</body>
</html>"""
  
# yaml rule string
# extracting list using <li>
# using "list" keyword
rule = '''
list:
  prefix: <li>
  suffix: </li>
'''
  
# using extract() to get required data
res = textminer.extract(inp_html, rule)
  
print("The data extracted between list tags : ")
print(res)


Output : 

Extracted list

Example 3: Extracting dictionary from HTML using defined data types.

Similar to the above example, a dictionary can be extracted using “dic” keyword, with mentioning “key” required to map key to, and value is extracted using defining prefix and suffix tags with a specific id. The data type can be mentioned using the “type” keyword.

Python3




import textminer
  
# input html
inp_html = """<html>
<body>
<div id="Gfg">Best</div>
<div id="4">Geeks</div>
</body>
</html>"""
  
# yaml rule string
# extracting dict. using dict
# using int to extract key in integer format
rule = '''
dict:
- key: gfg
  prefix: <div id="Gfg">
  suffix: </div>
- key: 4
  prefix: <div id="4">
  suffix: </div>
  type: int
'''
  
# using extract() to get required data
res = textminer.extract(inp_html, rule)
  
print("The data extracted between dictionary tags : ")
print(res)


Output : 

Extracted Dictionary

Example 4: Extract HTML from URL

Apart from giving HTML as a string, HTML can also be provided using a url using extract_from_url()

Python3




import textminer
  
# required url
  
# extracting title from url
rule = '''
value:
  prefix: <title>
  suffix: </title>
'''
  
# using extract() to get required data
res = textminer.extract_from_url(target_url, rule)
  
print("The data extracted between title tags from url : ")
print(res)


Output :

Extraction from URL.

Dominic
Dominichttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Dominic
32361 POSTS0 COMMENTS
Milvus
88 POSTS0 COMMENTS
Nango Kala
6728 POSTS0 COMMENTS
Nicole Veronica
11892 POSTS0 COMMENTS
Nokonwaba Nkukhwana
11954 POSTS0 COMMENTS
Shaida Kate Naidoo
6852 POSTS0 COMMENTS
Ted Musemwa
7113 POSTS0 COMMENTS
Thapelo Manthata
6805 POSTS0 COMMENTS
Umr Jansen
6801 POSTS0 COMMENTS