Java Program to Extract Content from a HTML document

7 August 2024

2

HTML is the core of the web, all the pages you see on the internet are HTML, whether they are dynamically generated by JavaScript, JSP, PHP, ASP, or any other web technology. Your browser actually parses HTML and render it for you But if we need to parse an HTML document and find some elements, tags, attributes or check if a particular element exists or not. In java, we can extract the HTML content and can parse the HTML Document.

Approaches:

Using FileReader
Using the Url.openStream()

Approach 1: The library called the FileReader which provides the way to read any File irrespective of any Extension. The way to append the HTML lines to the String Builder is as follows:

Using the FileReader to read the file from the Source Folder and further
Append each line to the String builder.
When there is not any content left in HTML Document then close the open File using the function br.close().
Print out the String.

Implementation:

Java

// Java Program to Extract Content from a HTML document
 
// Importing input/output java libraries
import java.io.*;
 
public class GFG {
 
    // Main driver method
    public static void main(String[] args)
        throws FileNotFoundException
    {
 
        /* Constructing String Builder to
        append the string into the html */
        StringBuilder html = new StringBuilder();
 
        // Reading html file on local directory
        FileReader fr = new FileReader(
            "C:\\Users\\rohit\\OneDrive\\Desktop\\article.html");
 
        // Try block to check exceptions
        try {
 
            // Initialization of the buffered Reader to get
            // the String append to the String Builder
            BufferedReader br = new BufferedReader(fr);
 
            String val;
 
            // Reading the String till we get the null
            // string and appending to the string
            while ((val = br.readLine()) != null) {
                html.append(val);
            }
 
            // AtLast converting into the string
            String result = html.toString();
            System.out.println(result);
 
            // Closing the file after all the completion of
            // Extracting
            br.close();
        }
 
        // Catch block to handle exceptions
        catch (Exception ex) {
 
            /* Exception of not finding the location and
            string reading termination the function
            br.close(); */
            System.out.println(ex.getMessage());
        }
    }
}

Output:

Approach 2: Using the Url.openStream()

Call the url.openStream() function that initiates the new TCP connection to the Server that the URL provides it to.
Now, HTTP gets Request is sent to the connection after the server sends back the HTTP response containing the information into it.
That information is in the form of the bytes then that information is read using the InputStreamReader() and openStream() method return the data to the program.

BufferedReader br = new BufferedReader(new InputStreamReader(URL.openStream()));

First, we open the URL using the openStream() to fetch the information. The information is contained in the URL in the form of bytes if the connection is all OK (means is shows 200) then HTTP request to the URL To fetch the content.
Then the information is collected in the form of bytes using the inputStreamReader()
Now the loop is run to print the information as the demand is to print the information in the console.

while ((val = br.readLine()) != null)   // condition
 {    
   System.out.println(val);             // execution if condition is true
  }

Implementation:

Java

// Java Program to Extract Content from a HTML document
 
// Importing java generic class
import java.io.*;
import java.util.*;
// Importing java URL class
import java.net.URL;
 
public class GFG {
 
    // Man driver method
    public static void main(String[] args)
        throws FileNotFoundException
    {
 
        // Try block to check exceptions
        try {
            String val;
 
            // Constructing the URL connection
            // by defining the URL constructors
            URL URL = new URL(
                "file:///C:/Users/rohit/OneDrive/Desktop/article.html");
 
            // Reading the HTML content from the .HTML File
            BufferedReader br = new BufferedReader(
                new InputStreamReader(URL.openStream()));
 
            /* Catching the string and  if found any null
             character break the String */
            while ((val = br.readLine()) != null) {
                System.out.println(val);
            }
 
            // Closing the file
            br.close();
        }
 
        // Catch block to handle exceptions
        catch (Exception ex) {
 
            // No file found
            System.out.println(ex.getMessage());
        }
    }
}

Output:

Java Program to Extract Content from a HTML document

Java

Java

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

Google Messages can now show your profile exactly how it’s supposed to be

Recent Comments

EDITOR PICKS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR POSTS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR CATEGORY

ABOUT US

FOLLOW US