Friday, February 4, 2011

Validating a HUGE XML file

I'm trying to find a way to validate a large XML file against an XSD. I saw the question ...best way to validate an XML... but the answers all pointed to using the Xerces library for validation. The only problem is, when I use that library to validate a 180 MB file then I get an OutOfMemoryException.

Are there any other tools,libraries, strategies for validating a larger than normal XML file?

EDIT: The SAX solution worked for java validation, but the other two suggestions for the libxml tool were very helpful as well for validation outside of java.

  • Instead of using a DOMParser, use a SAXParser. This reads from an input stream or reader so you can keep the XML on disk instead of loading it all into memory.

    SAXParserFactory factory = SAXParserFactory.newInstance();
    factory.setValidating(true);
    factory.setNamespaceAware(true);
    
    SAXParser parser = factory.newSAXParser();
    
    XMLReader reader = parser.getXMLReader();
    reader.setErrorHandler(new SimpleErrorHandler());
    reader.parse(new InputSource(new FileReader ("document.xml")));
    
    From jodonnell
  • Use libxml, which performs validation and has a streaming mode.

  • Personally I like to use XMLStarlet which has a command line interface, and works on streams. It is a set of tools built on Libxml2.

    From dlamblin
  • SAX and libXML will help, as already mentioned. You could also try increasing the maximum heap size for the JVM using the -Xmx option. E.g. to set the maximum heap size to 512MB: java -Xmx512m com.foo.MyClass

    From GaZ
  • XML ValidatorBuddy from http://www.xml-tools has an own command to validate huge XML files (multiple GB). It uses the Xerces-C SAX parser for this purpose.

    The tool also allows to specify a certain XSD for validation so you don't need to edit the large XML file (to add the schema reference).

No comments:

Post a Comment