Comparing Large Files

Introduction

It is possible to compare very large datasets using DeltaCompare.

There are many different factors that affect performance with large files apart from the CPU type and speed, and the amount of physical memory, both of which must be adequate for the job. Some of the more important are discussed in the following sections.

White space

White space nodes are generally significant in XML files. Each white space, e.g. newline or space, can be treated as a node in the XML file. This can increase the memory image size and slow the comparison process. It can also result in differences being identified that are not significant.

In many situations white space nodes are not important and can be ignored. If a file has a DTD, an XML parser can use this as the file is read in to identify whether white space is ignorable or not. If a white space node is ignorable, for example because it appears between two markup tags, DeltaXML will ignore it in the comparison process.

If there is no DTD white space nodes should be removed either using an editor or by processing using an XSL filter such as normalize-space.xsl, though using XSL can be time consuming for large files. The delta files generated by DeltaXML have no white space added to them: if you look at them in an editor you will see that new lines are added only inside tags. This may look strange at first but it is an effective way to have shorter lines without adding white space nodes to an XML file. White space inside tags will be ignored by any XML parser.

Remember also that indentation of PCDATA within a file has an effect: often white space in PCDATA and attributes should be normalized before comparison. Otherwise, again, there will be a lot of differences reported that are not important.

XML file structure

There is a performance difference in comparing 'flat' XML files, i.e. large number of records at one level, and more nested files, which tends to require less processing because there are fewer nodes at each level. Comparison with keying is generally faster. If you do word-by-word comparison the node count goes up a lot because the text is split into words and a word is counted as a node in the XML tree.

Number of differences

The performance is affacted by the number of differences: it is quickest when there are no differences! The more differences there are the slower the comparison process because the software is trying to find a best match between the files. The LCS algorithm used in DeltaXML for pattern matching ordered sequences has optimal performance for small numbers of differences and slows significantly for large numbers of differences.

PCDATA

DeltaXML shares text strings, so many different text strings will result in a larger memory image and may cause the program to hit memory size limitations sooner. On the other hand, files with many identical strings will be stored very efficiently.

Changes only or Full delta

The DeltaXML API has the ability to generate a delta file with 'changes-only' or a 'full delta' that includes unchanged data as well.

The time for comparison and the memory required is independent of the type of delta being produced. However the full-context delta output is typically larger and will require more disk IO and CPU time to write to disk.

Java Heap Size

The size of the JVM heap is one of the main factors which determines the size of datasets which DeltaXML can process. The size of the heap, amount of available RAM and other JVM configuration options affects both capacity and performance (too small a heap will result in excess garbage collection, similarly not enough RAM will causes performance degradation). The following guidelines are suggested:

Using java -Xmx can be used to increase the fairly small default JVM heap size. For example invoking using the: (java -Xmx2g...) command line argument will allocate two gigabytes of RAM to the JVM heap.
Performance is poor if there isn't enough RAM available to support the requested JVM heap size. Using disk based swapping to support the heap exhibited significant slow downs. We suggest ensuring that the heap size specified with the java -Xmx argument is available as free RAM.

For example, use the command line access like this, replacing x.y.z with the major.minor.patch version number of your release e.g. command-10.0.0.jar, to run a comparison with the increased JVM heap size to 4 gigabytes:

BASH

$ java -Xmx4g -jar command-x.y.z.jar compare delta file1.xml file2.xml file1-2.xml

SAX input sources

Reading from disk based files, for example, using the command-line interpreter, is typically slower than processing SAX events produced from an existing in memory data representation. As well as the reduced disk IO a more significant speedup arises from the lack of lexical analysis/tokenization that is otherwise performed by a SAX parser. We also recommend testing different SAX parsers and comparing their performance using your data if you need to read XML files from disk.

Conclusion

Be sure to remove white space from large input files. Performance depends on file structure and text content so needs to be evaluated on your own data.

However, it is clear from the above that DeltaCompare can be used successfully with very large XML datasets.