Saturday, December 28, 2024
Google search engine
HomeLanguagesJavaJava Program to Extract Content From a XML Document

Java Program to Extract Content From a XML Document

An XML file contains data between the tags so it is complex to read the data when compared to other file formats like docx and txt. There are two types of parsers which parse an XML file:

  • Object-Based (e.g. D.O.M)
  • Event-Based (e.g. SAX, StAX)

Types of XML parsers

In this article, we will discuss how to parse XML using Java DOM parser and Java SAX parser.

Java DOM Parser: DOM stands for Document Object Model. The DOM API provides the classes to read and write an XML file. DOM reads an entire document. It is useful when reading small to medium size XML files. It is a tree-based parser and a little slow when compared to SAX and occupies more space when loaded into memory. We can insert and delete nodes using the DOM API.

We have to follow the below process to extract data from an XML file in Java.                                                                                                                 

  • Instantiate XML file:
  • Get root node: We can use getDocumentElement() to get the root node and the element of the XML file.
  • Get all nodes: On using getElementByTagName() Returns a NodeList of all the Elements in document order with a given tag name and are contained in the document.
  • Get Node by text value: We can use getElementByTextValue() method in order to search for a node by its value.
  • Get Node by attribute value: we can use the getElementByTagName() method along with getAttribute() method.

Let’s now see an example on extracting data from XML using Java DOM Parser.

Create a .xml file, in this case, we have created Gfg.xml

XML




<?xml version="1.0"?>  
<class>  
    <geek>  
        <id>1</id>  
        <username>geek1</username>   
        <EnrolledCourse>D.S.A</EnrolledCourse>
        <mode>online self paced</mode>
        <duration>Lifetime</duration>  
    </geek>  
        
    <geek>  
        <id>2</id>  
        <username>geek2</username>  
        <EnrolledCourse>System Design</EnrolledCourse>  
        <mode>online live course</mode>
        <duration>10 Lectures</duration>  
    </geek>  
    
    <geek>  
        <id>3</id>  
        <username>geek3</username>  
        <EnrolledCourse>Competitive Programming</EnrolledCourse
        <mode>online live course</mode
        <duration>8 weeks</duration>  
    </geek>  
    
    <geek>  
        <id>4</id>  
        <username>geek4</username>  
        <EnrolledCourse>Complete Interview Preparation</EnrolledCourse
        <mode>online self paced</mode
        <duration>Lifetime</duration>  
    </geek>  
    
</class>


Now create a java file for Java DOM parser. In this case GfgXmlExtractor.java

Java




import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import java.io.File;
public class GfgXmlExtractor {
    public static void main(String argv[])
    {
        try {
            // creating a constructor of file class and
            // parsing an XML file
            File file = new File(
                "F:\\neveropen_contributions\\gfg.xml");
            
            // Defines a factory API that enables
            // applications to obtain a parser that produces
            // DOM object trees from XML documents.
            DocumentBuilderFactory dbf
                = DocumentBuilderFactory.newInstance();
            
            // we are creating an object of builder to parse
            // the  xml file.
            DocumentBuilder db = dbf.newDocumentBuilder();
            Document doc = db.parse(file);
  
            /*here normalize method Puts all Text nodes in
            the full depth of the sub-tree underneath this
            Node, including attribute nodes, into a "normal"
            form where only structure separates
            Text nodes, i.e., there are neither adjacent
            Text nodes nor empty Text nodes. */
            doc.getDocumentElement().normalize();
            System.out.println(
                "Root element: "
                + doc.getDocumentElement().getNodeName());
            
            // Here nodeList contains all the nodes with
            // name geek.
            NodeList nodeList
                = doc.getElementsByTagName("geek");
            
            // Iterate through all the nodes in NodeList
            // using for loop.
            for (int i = 0; i < nodeList.getLength(); ++i) {
                Node node = nodeList.item(i);
                System.out.println("\nNode Name :"
                                   + node.getNodeName());
                if (node.getNodeType()
                    == Node.ELEMENT_NODE) {
                    Element tElement = (Element)node;
                    System.out.println(
                        "User id: "
                        + tElement
                              .getElementsByTagName("id")
                              .item(0)
                              .getTextContent());
                    System.out.println(
                        "User Name: "
                        + tElement
                              .getElementsByTagName(
                                  "username")
                              .item(0)
                              .getTextContent());
                    System.out.println(
                        "Enrolled Course: "
                        + tElement
                              .getElementsByTagName(
                                  "EnrolledCourse")
                              .item(0)
                              .getTextContent());
                    System.out.println(
                        "Mode: "
                        + tElement
                              .getElementsByTagName("mode")
                              .item(0)
                              .getTextContent());
                    System.out.println(
                        "Duration: "
                        + tElement
                              .getElementsByTagName(
                                  "duration")
                              .item(0)
                              .getTextContent());
                }
            }
        }
        
        // This exception block catches all the exception
        // raised.
        // For example if we try to access a element by a
        // TagName that is not there in the XML etc.
        catch (Exception e) {
            System.out.println(e);
        }
    }
}


Output

Root element: class

Node Name :geek
User id: 1
User Name: geek1
Enrolled Course: D.S.A
Mode: online self paced
Duration: Lifetime

Node Name :geek
User id: 2
User Name: geek2
Enrolled Course: System Design
Mode: online live course
Duration: 10 Lectures

Node Name :geek
User id: 3
User Name: geek3
Enrolled Course: Competitive Programming
Mode: online live course
Duration: 8 weeks

Node Name :geek
User id: 4
User Name: geek4
Enrolled Course: Complete Interview Preparation
Mode: online self paced
Duration: Lifetime




Method 2: Java SAX Parser

SAX Parser in java provides API to parse XML documents. SAX parser is a lot more different from DOM parser because it doesn’t load complete XML into memory and read XML document sequentially. In SAX, parsing is done by the ContentHandler interface and this interface is implemented by DefaultHandler class.

Let’s now see an example on extracting data from XML using Java SAX Parser.

Create a java file for SAX parser. In this case, we have created GfgSaxXmlExtractor.java

Java




import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class GfgSaxXmlParser {
    public static void main(String args[])
    {
        try {
            /*SAXParserFactory is  a factory API that
            enables applications to configure and obtain a
            SAX based parser to parse XML documents. */
            SAXParserFactory factory
                = SAXParserFactory.newInstance();
            
            // Creating a new instance of a SAXParser using
            // the currently configured factory parameters.
            SAXParser saxParser = factory.newSAXParser();
            
            // DefaultHandler is Default base class for SAX2
            // event handlers.
            DefaultHandler handler = new DefaultHandler() {
                boolean id = false;
                boolean username = false;
                boolean EnrolledCourse = false;
                boolean mode = false;
                boolean duration = false;
                
                // Receive notification of the start of an
                // element. parser starts parsing a element
                // inside the document
                public void startElement(
                    String uri, String localName,
                    String qName, Attributes attributes)
                    throws SAXException
                {
  
                    if (qName.equalsIgnoreCase("Id")) {
                        id = true;
                    }
                    if (qName.equalsIgnoreCase(
                            "username")) {
                        username = true;
                    }
                    if (qName.equalsIgnoreCase(
                            "EnrolledCourse")) {
                        EnrolledCourse = true;
                    }
                    if (qName.equalsIgnoreCase("mode")) {
                        mode = true;
                    }
                    if (qName.equalsIgnoreCase(
                            "duration")) {
                        duration = true;
                    }
                }
                
                // Receive notification of character data
                // inside an element, reads the text value of
                // the currently parsed element
                public void characters(char ch[], int start,
                                       int length)
                    throws SAXException
                {
                    if (id) {
                        System.out.println(
                            "ID : "
                            + new String(ch, start,
                                         length));
                        id = false;
                    }
                    if (username) {
                        System.out.println(
                            "User Name: "
                            + new String(ch, start,
                                         length));
                        username = false;
                    }
                    if (EnrolledCourse) {
                        System.out.println(
                            "Enrolled Course: "
                            + new String(ch, start,
                                         length));
                        EnrolledCourse = false;
                    }
                    if (mode) {
                        System.out.println(
                            "mode: "
                            + new String(ch, start,
                                         length));
                        mode = false;
                    }
                    if (duration) {
                        System.out.println(
                            "duration : "
                            + new String(ch, start,
                                         length));
                        duration = false;
                    }
                }
            };
            
            /*Parse the content described by the giving
             Uniform Resource
             Identifier (URI) as XML using the specified
             DefaultHandler. */
            saxParser.parse(
                "F:\\neveropen_contributions\\gfg.xml",
                handler);
        }
        catch (Exception e) {
            System.out.println(e);
        }
    }
}


Output

ID : 1
User Name: geek1
Enrolled Course: D.S.A
mode: online self paced
duration : Lifetime
ID : 2
User Name: geek2
Enrolled Course: System Design
mode: online live course
duration : 10 Lectures
ID : 3
User Name: geek3
Enrolled Course: Competitive Programming
mode: online live course
duration : 8 weeks
ID : 4
User Name: geek4
Enrolled Course: Complete Interview Preparation
mode: online self paced
duration : Lifetime

RELATED ARTICLES

Most Popular

Recent Comments