Here are some notes on parsing XML file with java to extract information
Simple Example
Assume we have an XML file and we want to extract specific attributes. A sample file could be:
<xml>
<document id="001" date="20.12.2018">..content1..</document>
<document id="002" date="25.12.2018">..content2..</document>
<!-- etc .. -->
</xml>
With XMLStreamReder
Here is the code to quickly scan the XML file and get the values for the id
and date
attributes.
public static List<String> parseDate(Path xmlPath){
List<String> result = new ArrayList<>();
XMLInputFactory factory = new XMLInputFactory.newInstance();
try(InpuStream in = Files.newInputStream(xmlPath)){
XMLStreamReader xstr = factory.createXMLStreamReader(in);
while(xstr.hasNext()){
// scan all the tokens
if(xstr.isStartElement()){
if(xstr.getLocalName().equals("document")){
String id = attributeValue(xstr, "id");
String date = attributeValue(xstr, "date");
result.add( id + " - " + date);
}
}
xstr.next();
}
xstr.close();
}
catch (IOException | XMLStreamException e){
throw new IllegalStateException("Error while parsing " + xmlPath, e);
}
}
private static String attributeValue(final XMLStreamReader xstr, String attributeName){
in n = xstr.getAttributeCount();
for(int i=0; i<n; i++){
if(xstr.getAttributeLocalName(i).equals(attributeName);
return xstr.getAttributeValue(i);
}
throw new IllegalArgumentException("Attribute not present " + attributeName, e);
}
Parsing of connected XML files
Suppose we want to collect information about our xml files, and that the xml files are connected one to each other with an include
relation.
Now, we will have two xml files referencing one each other
<xml>
<document id="001" date="20.12.2018">..content1..</document>
<document id="002" date="25.12.2018">..content2..</document>
<reference>file2.xml
</xml>
<xml>
<document id="001" date="20.12.2018">..content1..</document>
<reference>file3.xml
</xml>
In this case, if the relative URLs are correct, we can construct a hierarchy of files and parse all of them
public parseHierarchy(File mainFile){
if(!mainFile.exists())
throw new IllegalArgumentException("File does not exists" + mainFile);
// recursively walk all the hierarchy
parseRecursive(mainFile.toPath(), new HashMap<Path, Info>());
}
public static parseRecursive(Path path, Map<Path, Info> map){
Info info = parseInfo(path);
if(!map.containsKey(info.path)){
map.put(info.path, info);
for(String href : info.hrefs){
Path newPath = path.getParent().resolve(href).normalize();
if(!map.containsKey(newPath){
parseRecursive(newPath, map);
}
}
}
}
private Info parseInfo(Path path){
Info info = new Info();
XMLInputFactory factory = new XMLInputFactory.newInstance();
try(InpuStream in = Files.newInputStream(xmlPath)){
XMLStreamReader xstr = factory.createXMLStreamReader(in);
while(xstr.hasNext()){
// scan all the tokens
if(xstr.isStartElement()){
if(xstr.getLocalName().equals("a")){
String href = attributeValue(xstr, "href");
info.hrefs.add(href);
}
}
xstr.next();
}
xstr.close();
}
catch (IOException | XMLStreamException e){
throw new IllegalStateException("Error while parsing " + xmlPath, e);
}
}
class Info{
Path path
List<String> hrefs = new ArrayList<String>();
}
Applications
A possible application is to start from one XHTML document and get a hierarchy of all the connected documents.
0 Comments