Here are some notes on parsing XML file with java to extract information

Simple Example

Assume we have an XML file and we want to extract specific attributes. A sample file could be:

<xml>
  <document id="001" date="20.12.2018">..content1..</document>
  <document id="002" date="25.12.2018">..content2..</document>
  <!-- etc .. -->
</xml>

With XMLStreamReder

Here is the code to quickly scan the XML file and get the values for the id and date attributes.

public static List<String> parseDate(Path xmlPath){
  List<String> result = new ArrayList<>();
  XMLInputFactory factory = new XMLInputFactory.newInstance();
  try(InpuStream in = Files.newInputStream(xmlPath)){
    XMLStreamReader xstr = factory.createXMLStreamReader(in);
    while(xstr.hasNext()){
      // scan all the tokens 
      if(xstr.isStartElement()){
        if(xstr.getLocalName().equals("document")){
          String id = attributeValue(xstr, "id");
          String date = attributeValue(xstr, "date");
          result.add( id + " - " + date);
        }
      }
      xstr.next();
    }
    xstr.close();
  } 
  catch (IOException | XMLStreamException e){
    throw new IllegalStateException("Error while parsing " + xmlPath, e);
  }
}

private static String attributeValue(final XMLStreamReader xstr, String attributeName){
  in n = xstr.getAttributeCount();
  for(int i=0; i<n; i++){
    if(xstr.getAttributeLocalName(i).equals(attributeName);
    return xstr.getAttributeValue(i);
  }
  throw new IllegalArgumentException("Attribute not present " + attributeName, e);
}

Parsing of connected XML files

Suppose we want to collect information about our xml files, and that the xml files are connected one to each other with an include relation.

Now, we will have two xml files referencing one each other

<xml>
  <document id="001" date="20.12.2018">..content1..</document>
  <document id="002" date="25.12.2018">..content2..</document>
  <reference>file2.xml
</xml>
<xml>
  <document id="001" date="20.12.2018">..content1..</document>
  <reference>file3.xml
</xml>

In this case, if the relative URLs are correct, we can construct a hierarchy of files and parse all of them

public parseHierarchy(File mainFile){
  if(!mainFile.exists())
    throw new IllegalArgumentException("File does not exists" + mainFile);
  // recursively walk all the hierarchy
  parseRecursive(mainFile.toPath(), new HashMap<Path, Info>());
}

public static parseRecursive(Path path, Map<Path, Info> map){
  Info info = parseInfo(path);
  if(!map.containsKey(info.path)){
    map.put(info.path, info);
    for(String href : info.hrefs){
      Path newPath = path.getParent().resolve(href).normalize();
      if(!map.containsKey(newPath){
        parseRecursive(newPath, map);
      }
    }
  }
}

private Info parseInfo(Path path){
  Info info = new Info();
  XMLInputFactory factory = new XMLInputFactory.newInstance();
  try(InpuStream in = Files.newInputStream(xmlPath)){
    XMLStreamReader xstr = factory.createXMLStreamReader(in);
    while(xstr.hasNext()){
      // scan all the tokens 
      if(xstr.isStartElement()){
        if(xstr.getLocalName().equals("a")){
          String href = attributeValue(xstr, "href");
          info.hrefs.add(href);
        }
      }
      xstr.next();
    }
    xstr.close();
  } 
  catch (IOException | XMLStreamException e){
    throw new IllegalStateException("Error while parsing " + xmlPath, e);
  }
}
class Info{
  Path path
  List<String> hrefs = new ArrayList<String>();
}

Applications

A possible application is to start from one XHTML document and get a hierarchy of all the connected documents.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *