Parsing XML with Java

Here are some notes on parsing XML file with java to extract information

Contents hide

Simple Example

Assume we have an XML file and we want to extract specific attributes. A sample file could be:

<xml>
  <document id="001" date="20.12.2018">..content1..</document>
  <document id="002" date="25.12.2018">..content2..</document>
  <!-- etc .. -->
</xml>

With XMLStreamReder

Here is the code to quickly scan the XML file and get the values for the id and date attributes.

public static List<String> parseDate(Path xmlPath){
  List<String> result = new ArrayList<>();
  XMLInputFactory factory = new XMLInputFactory.newInstance();
  try(InpuStream in = Files.newInputStream(xmlPath)){
    XMLStreamReader xstr = factory.createXMLStreamReader(in);
    while(xstr.hasNext()){
      // scan all the tokens 
      if(xstr.isStartElement()){
        if(xstr.getLocalName().equals("document")){
          String id = attributeValue(xstr, "id");
          String date = attributeValue(xstr, "date");
          result.add( id + " - " + date);
        }
      }
      xstr.next();
    }
    xstr.close();
  } 
  catch (IOException | XMLStreamException e){
    throw new IllegalStateException("Error while parsing " + xmlPath, e);
  }
}

private static String attributeValue(final XMLStreamReader xstr, String attributeName){
  in n = xstr.getAttributeCount();
  for(int i=0; i<n; i++){
    if(xstr.getAttributeLocalName(i).equals(attributeName);
    return xstr.getAttributeValue(i);
  }
  throw new IllegalArgumentException("Attribute not present " + attributeName, e);
}

Parsing of connected XML files

Suppose we want to collect information about our xml files, and that the xml files are connected one to each other with an include relation.

Now, we will have two xml files referencing one each other

<xml>
  <document id="001" date="20.12.2018">..content1..</document>
  <document id="002" date="25.12.2018">..content2..</document>
  <reference>file2.xml
</xml>

<xml>
  <document id="001" date="20.12.2018">..content1..</document>
  <reference>file3.xml
</xml>

In this case, if the relative URLs are correct, we can construct a hierarchy of files and parse all of them

public parseHierarchy(File mainFile){
  if(!mainFile.exists())
    throw new IllegalArgumentException("File does not exists" + mainFile);
  // recursively walk all the hierarchy
  parseRecursive(mainFile.toPath(), new HashMap<Path, Info>());
}

public static parseRecursive(Path path, Map<Path, Info> map){
  Info info = parseInfo(path);
  if(!map.containsKey(info.path)){
    map.put(info.path, info);
    for(String href : info.hrefs){
      Path newPath = path.getParent().resolve(href).normalize();
      if(!map.containsKey(newPath){
        parseRecursive(newPath, map);
      }
    }
  }
}

private Info parseInfo(Path path){
  Info info = new Info();
  XMLInputFactory factory = new XMLInputFactory.newInstance();
  try(InpuStream in = Files.newInputStream(xmlPath)){
    XMLStreamReader xstr = factory.createXMLStreamReader(in);
    while(xstr.hasNext()){
      // scan all the tokens 
      if(xstr.isStartElement()){
        if(xstr.getLocalName().equals("a")){
          String href = attributeValue(xstr, "href");
          info.hrefs.add(href);
        }
      }
      xstr.next();
    }
    xstr.close();
  } 
  catch (IOException | XMLStreamException e){
    throw new IllegalStateException("Error while parsing " + xmlPath, e);
  }
}

class Info{
  Path path
  List<String> hrefs = new ArrayList<String>();
}

Applications

A possible application is to start from one XHTML document and get a hierarchy of all the connected documents.

Parsing XML with Java

Published by psuzzi on June 26, 2019June 26, 2019

Simple Example

With XMLStreamReder

Parsing of connected XML files

Applications

0 Comments

Leave a Reply Cancel reply

OWASP Top 10 Web Application Security Risks: A practical guide

Routine App Prototype Backend

Routine App Architecture

Parsing XML with Java

Published by psuzzi on June 26, 2019June 26, 2019

Simple Example

With XMLStreamReder

Parsing of connected XML files

Applications

0 Comments

Leave a Reply Cancel reply

Related Posts

OWASP Top 10 Web Application Security Risks: A practical guide

Routine App Prototype Backend

Routine App Architecture