Skip to main content
Skip table of contents

DTCP User Guide

Introduction

The Data Comparator Pipeline

XML Compare's Data Comparator specializes in the comparison of XML documents with purely data-centric XML content.

What is DTCP?

The DTCP (Data Comparator Pipelines) format is an XML language used for configuring the Data Comparator component and its built-in pipeline. It uses a structure similar to the DCP format, which is used for configuring the Document Comparator.

With DTCP you define chains of XML processing filters that are inserted at specified extension points in the Data Comparator's built-in pipeline. You can set properties of the DataComparator and low-level built-in components; DTCP does not require any knowledge of Java programming.

REST

Our REST API is powered by DCP, DTCP and DXP pipelines. It exposes information about the available pipelines, including parameters and their default values.

The DTCPConfiguration Class

The ability to embed DTCP processing is also available for you to use in your applications. The DTCPConfiguration class provides the potential of DTCP in a wide range of Java applications. This will simplify configuration and enable flexibility in the use of XML Compare's Data Comparator. Details of this class can be found in the Java API documentation.

Using the DTCPConfiguration API, DTCP capabilities can be integrated directly into a GUI or command-line interface. Examples for each of these are included in the XML Compare distribution.

Command-line

When the command-line app is invoked it shows a list of DTCP files and their descriptions. DTCP files are then selected by an end-user by specifying the 'configuration-id' which corresponds to the 'id' attribute on the dataComparator root element in the DTCP file. Command-line parameters are used to control comparison settings for a specific configuration.

When to use DTCP

DTCP allows a Data Comparator pipeline to be specified in declarative XML. Comparisons based on this pipeline can be initialized through the command-line, or through the DTCPConfiguration class's simple high-level Java API. As such, DTCP can be used in most cases where you would use Java, because of its declarative nature, DTCP files should be easier to maintain than the equivalent Java.

XPath expressions embedded within DTCP allow for relatively sophisticated conditional processing. However, in more complex cases where the processing pipeline is dependent on many external factors, projects may benefit from the flexibility and extra diagnostics and testing that low-level coding in Java brings.

Editing DTCP with the DTCP Schema

The XML vocabulary used for DTCP is defined in the DTCP XML Schema and is summarized in the DTCP Schema Guide. XML Schema (XSD) 1.0 and 1.1 versions of the DTCP schema are included with the XML Compare distribution. Auto-completion and context-assistance features can be exploited when editing a DTCP file in your XML editor, by associating the DTCP file with the schema, many XML editors observe the 'xsi:noNamespaceSchemaLocation' attribute that provides a 'hint' to the XSD file location. The XSD 1.1 version of the schema is preferred as this provides additional checking, for example, type-checking on values referenced using 'parameterRef' attributes.

An example showing the XSD schema file associated with a DTCP XML file:

XML
<dataComparator xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation="data-comparator-1_0.xsd" 
  version="1.0" 
  id="example" 
  description="Example of a DTCP definition">...

Summary of DTCP

Here is a quick summary of DTCP:

  • It is a tool customization/extension language, not a general purpose XML pipelining language.

  • DTCP is a data-driven way of configuring and extending a DataComparator object which can then be used by a Java program.

  • DTCP defines extensions to the Data Comparator pipeline using filters at specified extensionPoints.

  • Using DTCP is generally much simpler than Java programming.

  • All Data Comparator features - barring low-level components such as progress listeners - are accessible via DTCP.

  • Parameters with default values can be defined in DTCP, such values can then be overridden externally.

  • All significant values within DTCP can reference declared parameters instead of literal values.

  • XPath 2.0 expressions that reference declared parameters can be used instead of literal values.

The Data Comparator Pipeline Model

Underlying the DTCP is a model. The example below illustrates how key parts of the model are used to produce a solution for comparing documents of two custom types: 'major' and 'minor'.

An example

In this particular example, we will:

  • Add XML attributes on the input pipeline so whitespace-normalization is optimized for the type of XML.

  • Add XML attributes on the input pipeline to mark formatting-only elements.

  • Optionally convert the XML delta format of the comparison output to a folding-html rendering.

  • Enable all lexical preservation - and keep information on changes.

  • Define parameters so different behaviours can be achieved using the same DTCP but with different parameter overrides.

Details of the features and concepts used in this example are described after the example DTCP.

DTCP for example pipeline

XML
<?xml version="1.0" encoding="UTF-8"?>
<dataComparator xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    version="1.0"
    xsi:noNamespaceSchemaLocation="data-comparator-1_0.xsd"
    id="example"
    description="Example of a DTCP Configuration" >
  
  <pipelineParameters>
    <booleanParameter name="word-by-word" defaultValue="false">
      <description>Allow word by word comparison</description>
    </booleanParameter>
    <booleanParameter name="attribute-splitting" defaultValue="false">
      <description>Allow attribute splitting</description>
    </booleanParameter>
    <stringParameter name="formatting-element-list" defaultValue="b,i,u,em,strong,emphasis">
      <description>
        Comma-separated list of formatting elements defined in the input XML grammar.
      </description>
    </stringParameter>
    <stringParameter name="orphan-length" defaultValue="2"/>
    <stringParameter name="orphan-threshold" defaultValue="20"/>
  </pipelineParameters>
  
  <advancedConfig>
    <outputProperties>
      <property name="indent" literalValue="yes"/>
    </outputProperties>
    <transformerConfigurationProperties>
      <stringProperty name="http://saxon.sf.net/feature/xsltVersion" literalValue="3.0"/>
    </transformerConfigurationProperties>
  </advancedConfig>
  
  <standardConfig>
    <comparisonReport>
      <generateReport literalValue="true"/>
      <reportDirectory literalValue="target/reports"/>
    </comparisonReport>
    
    <namespaceConfiguration>
      <defaultNamespace uri="http://my-default.com"/>
      <userNamespaces>
        <userNamespace prefix="people" uri="http://people.com"/>
      </userNamespaces>
    </namespaceConfiguration>
    
    <ignoreChangesConfig>
      <locations>
        <location ignoreXpath="/addressBook/people:person/log/lastLoggedIn" resultRule="B"/>
      </locations>
    </ignoreChangesConfig>
    
    <moveDetectionConfig>
      <isEnabled literalValue="true"/>
      <showMoveSource literalValue="true"/>
      <moveDetectionType literalValue="unrestricted"/>
      <moveCandidates>
        <moveCandidate elemXpath="*"/>
      </moveCandidates>
    </moveDetectionConfig>
    
    <subtreeProcessingMode>
        <subtrees>
          <subtree elemXpath="/document/div" ordered="false"/>
        </subtrees>
    </subtreeProcessingMode>
  </standardConfig>
  
  <extensionPoints>
    <inputExtensionPoints>
      <inputPreFlatteningPoint>
        <filter><file path="preprocess.xsl"/></filter>
      </inputPreFlatteningPoint>
      <inputPoint>
        <filter><file path="modify-inputs.xsl"/></filter>
      </inputPoint>
    </inputExtensionPoints>
    <outputExtensionPoints>
      <finalPoint>
        <filter><file path="post-process.xsl"/></filter>
      </finalPoint>
    </outputExtensionPoints>
  </extensionPoints>

</dataComparator>

Data Comparator

The root element for the DTCP is dataComparator. The description and id attributes here can be used by applications to summarize a DTCP and help select it from a set of other DTCP files.

A fullDescription child element could also be used here to provide a longer description of the DTCP for use by external systems.

XML
<dataComparator 
    version="1.0" 
    id="example" 
    description="Example of a DTCP definition" >...

Pipeline Parameters

By using parameters we allow a DTCP-defined pipeline to be reconfigured by an external system or even an end-user, avoiding the need to construct several similar pipeline definitions. There are potential performance benefits also, because only a single set of XSLT filters needs to be compiled ready for running different types of comparison.

Extract from the example showing DTCP pipeline parameter declarations

XML
  <pipelineParameters>
    ...
    <booleanParameter name="word-by-word" defaultValue="false">
      <description>Allow word by word comparison</description>
    </booleanParameter>
    ...
    <booleanParameter name="formatting-element-list" defaultValue="b,i,u,em,strong,emphasis"/>
    <stringParameter name="orphan-length" defaultValue="2"/>
    <stringParameter name="orphan-threshold" defaultValue="20"/>
    ...
  </pipelineParameters>

The pipelineParameters element contains a set of named string and boolean parameters (elements stringParameter and booleanParameter), these parameters set the default behaviour of this example. In this case, some parameters are referenced directly using attributes, while others are referenced as XPath variables within attributes containing XPath expressions.

Advanced Use

XPath expressions in the form of XSLT attribute value templates can be embedded in the defaultValue attribute of stringParameter elements. Here, XPath variables may reference previously defined parameters.

When using the Java API, the DTCPConfiguration object can be initialized with two maps supplied as arguments, one map for string parameters and the other for boolean parameters. A setParams method can be called on this object to supply a new set of parameter value overrides.

Setting DTCP Property Values

Throughout the DTCP file, one of three possible attributes must be used to set DTCP properties of filter parameter values on an element. In the example XML snippet below, the resultReadabilityOptions child properties have values set using all three of these in turn:

XML
1      <modifiedWhitespaceBehaviour xpath="if ($normalize-whitespace) 
                                           then 'normalize' else 'show'"/>
2      <orphanedWordDetectionEnabled parameterRef="orphaned-words"/>
3      <orphanedWordLengthLimit literalValue="2"/>

The attributes:

  1. xpath contains an XPath expression that references the 'normalize-whitespace' boolean parameter as an XPath variable to conditionally set the value to 'normalize' or 'show'.

  2. parameterRef contains the name of the boolean parameter 'orphaned-words' that is used to set this value.

  3. literalValue contains '2', the actual value for the property

One of these three attributes must always be used when setting a DTCP property or filter parameter. They are mutually exclusive, so validation of the DTCP will fail if you attempt to use more than one of these attributes on the same element.

Attribute Value Templates

For attributes other than 'xpath' and 'when', the '{' and '}' characters have special significance, they are treated as XSLT 'attribute value templates' (AVTs). So, if you need to use these characters literally, they should be escaped as '{{' and '}}' respectively.

Filter names, paths, URLs or classes can potentially be set using AVTs. However, these are only evaluated with the initialising set of parameters in the evaluation context, because all filters are loaded only once. Filter parameters though are re-evaluated each time the parameter set changes.

Advanced Configuration

The advancedConfig element is used to set properties and features of low-level components used by the Data Comparator. In this example, to prevent indentation of the XML output, the child outputProperties element is used to set the 'indent' property of the built-in Saxon Serializer instance to 'no'. Also, to prevent issues when DTDs are not available, we can prevent the parser from attempting to load the DTD; here, the relevant parserFeatures apache property is set to 'false'.

XML
  <advancedConfig>
    <outputProperties>
      <property name="indent" literalValue="no"/>
    </outputProperties>
    <parserFeatures>
      <feature
        name="http://apache.org/xml/features/nonvalidating/load-external-dtd"
        literalValue="false"/>
    </parserFeatures>
  </advancedConfig>

Child elements of the advancedConfig element determine factors such as how DTDs or schemas are loaded and used, what collations are used for sorting and how XML is serialized. Full details can be found in the referenced documentation in the table below:

Properties and features managed via the advancedConfig element

Standard Configuration

The standardConfig element is used for setting properties that would otherwise be set via the DataComparator API. In this example, the child resultReadabilityOptions is used to configure corresponding properties available in the DataComparator class.

XML
  <standardConfig>
    <resultReadabilityOptions>
      ...
      <orphanedWordMaxPercentage parameterRef="orphan-threshold"/>
    </resultReadabilityOptions>
        
  </standardConfig>

To help illustrate the relationship between DTCP and the DataComparator API, here is the equivalent Java code for setting the 'orphan-threshold' value:

JAVA
DataComparator dc= new DataComparator();
int orphanThreshold= 20;
dc.getResultReadabilityOptions().setOrphanedWordMaxPercentage(orphanThreshold);

Extension Points

The extensionPoints element contains elements defining all filters to be inserted in to the XML processing pipeline. The parents of each filter element determines the extension point at which filters should be inserted. With the exception of the 'inputPreFlatteningPoint' element, this extension point element needs a further parent element to specify the general extensionPoints group within the pipeline, that is: inputPoint, or output finalPoint.

XML
  <extensionPoints>
    <inputPreFlatteningPoint>
      <filter ...
    </inputPreFlatteningPoint>

    <outputExtensionPoints>
      <finalPoint>
        <filter ...
      </finalPoint>
    </outputExtensionPoints>
  </extensionPoints>

The diagram below shows the basic DTCP pipeline model, with two input pipes (A and B), a comparator in the middle and a single output pipe. The location of named extension points is also shown.

DataComparaotrExtPointsDiagram.drawio.png

Filters are added to the Data Comparator pipeline at the extension points labelled in the diagram above.

Both of the XML inputs to a Comparison are passed through chains of input filters. These filters can add, remove or change information as data passes through them. Each filter operates by modifying a Stream of SAX events (or callbacks to an SAX ContentHandler).

The operation of these filters can be defined using Java or XSLT. In contrast to DCP, the input filters for DTCP are all symmetrical (the same filters apply to both A and B inputs).

In our example above, we only use the extension points: inputPreFlatteningPoint and finalPoint.

Filters

A DTCP filter is represented by a filter element that must be contained within an element representing an extension point in the pipeline to which the filter should be added.

More generally, a filter is a component in a pipeline which processes XML data in some way.

Input and output filters can be implemented using XSLT or Java. The use of Java for output filtering is facilitated by the use of the XMLOutputFilter class and associated adapters provided in the XML Compare API. These supplant the JAXP mechanism and are described in more detail in Powering Pipelines with JAXP.

Java filters

A Java filter is one which implements the org.xml.sax.XMLFilter interface, typically by extending the XMLFilterImpl class. It is used in compiled form. The associated class file must be available to the classloader of the application. To use a Java filter its fully qualified class is specified in a class element added as a child to the filter element, as in the following example . This example demonstrates the use of one of the filters included in the deltaxml-x.y.z.jar file included in the release, replacing x.y.z with the major.minor.patch version number of your release e.g. deltaxml-17.0.0.jar

Using a Java filter

XML
<filter>
  <class name="com.deltaxml.pipe.filters.WordByWordInfilter"/>
</filter>

XSLT filters

There are a number of ways to locate an XSLT filter, including:

  • Specify a URL in a http element

  • Specify a file path in a file element

  • Include the filter in a Jar file and use a resource element

HTTP URL support is based on the java.net.URL class. The following example shows how a filter can be addressed using a URL.

Referring to an XSLT filter by HTTP URL

XML
<filter>
  <http url="http://www.example.com/samples/filter.xsl"/>
</filter>

Files can also be used to specify XSLT filter locations. The underlying support for this type of filter specification is based on the java.io.File class and any file specifications should be compatible with the pathnames used with this Java class. See the following example

Referring to an XSLT filter by File location

XML
<filter if="normalize-whitespace">
  <file path="mark-ws-preserved.xsl" relBase="dxp"/>
</filter>

The above example uses a relative path to specify the location of the file. For such relative paths, the relBase attribute is used to specify how the path is resolved. This attribute uses one of these 3 values:

  • current - resolve using the current working directory, obtained from the Java user.dir system property

  • home - resolve using the user's home directory, corresponding to the Java property user.home

  • dxp - resolve using the directory containing the DTCP file, when it is loaded from a File (note: 'dxp' is used here to maintain compatibility with the filter element structure used in DXP.).

The final way of locating XSLT scripts is the resource mechanism. This allows XSLT files to be located on the classpath, and in particular in .jar files. The path used is the location of the XSLT script within the jar file, and more precisely is the path used as an argument to the ClassLoader.getResource(String) method.

This mechanism is provided so that you can deliver, to an end-user, a single jar file containing both code and data for one or more DTCP pipeline. The following example XML snippet shows how a reference to a filter located in a jar file is added.

Referring to an XSLT filters inside a Jar File

XML
<filter if="render-as-folding-html">
    <resource name="xsl/dx2-deltaxml-folding-html" />
    ...
</filter>

Filter Parameters

The operation of a filter may be controlled by parameters passed to the filter. Any number of parameters may be supplied to a filter, but their names must match those defined within the filter. Parameters are listed as child parameter elements within the filter element. An example:

XML
<filter if="render-as-folding-html">
  <resource name="xsl/dx2-deltaxml-folding-html" />
  <parameter name="smart-whitespace-normalization"
             xpath="not($normalize-whitespace)"/>
</filter>

DTCP filter parameter values are set using 'literalValue', 'parameterRef' or 'xpath' attributes as described in the Setting DTCP Property Values section above.

When setting parameter values for XSLT filters, the 'xpath' attribute has special significance because the result of evaluating the expression is passed directly to the XSLT filter as an XPath Data Model value (Saxon XdmValue), this means that parameter values are not restricted to simple strings, they may for example evaluate to a sequence of xs:integer values. When using non-string values, the corresponding xsl:param instruction in the XSLT should by typed with an appropriate type, for example:

XML
<parameter name="heading-levels" xpath="(1,2,3)"/>

Should have a corresponding declaration in the XSLT, such as:

XML
<xsl:param name="heading-levels" as="xs:integer*"/>

To supply parameters to Java filters a parameter setting, or set method, should be provided. This method must conform to certain requirements, its name must be the string set followed by the exact DTCP parameter name string. It should also take a single boolean or String parameter.

Please consult the sample filters and pipelines provided in Bitbucket, here, for examples.

Conditional Filter Processing

While filters are always loaded when a DTCP is first initialized, externally supplied pipeline parameter values can be used to enable or disable these filters for specific comparisons.

Two attributes, 'if' and 'unless' may be added to any filter element. Their values should refer to one boolean formal parameter by name. In the case of the if attribute, when the associated parameter is true then the filter is applied. Conversely, the unless attribute applies the filter when the referenced parameter is false. If both pipeline control parameters are used (and hopefully refer to different parameters!) the application of the pipeline stage is determined by the boolean-and of both conditions.

The 'when' attribute must be used on its own. Its value should be an XPath expression that evaluates to an xs:boolean. All pipeline parameters are part of the evaluation context, but there is no context item so expressions should be 'context-free'.

The following snippet from the full example shows how filters in the 'inputPreFlatteningPoint' extension point are enabled or disabled according to the values of pipeline parameters 'formatting-elements' 'document-type' and 'normalize-whitespace'.

Conditional Filter Example

XML
<inputPreFlatteningPoint>
  <filter when="$formatting-elements 
          and $document-type eq 'major'">
  <file path="mark-major-formatting.xsl" relBase="dxp"/>
  </filter>
  <filter when="$formatting-elements 
          and $document-type eq 'minor'">
  <file path="mark-minor-formatting.xsl" relBase="dxp"/>
  </filter>
  <filter if="normalize-whitespace">
  <file path="mark-mixed-content.xsl" relBase="dxp"/>
  </filter>
  <filter if="normalize-whitespace">
  <file path="mark-ws-preserved.xsl" relBase="dxp"/>
  </filter>
</inputPreFlatteningPoint>

Processing Instructions

For applications exploiting different DTCP configurations, it may help to be able to read application-specific information from a DTCP file. To assist with this, processing instructions added as immediate children of the root element may be read using the getProcessingInstruction(String instructionName) method of the DTCPConfiguration class.

For example, using this processing-instruction as a child of the DataComparator element:

XML
<?deltaxml.outputType xml?>

The following Java code can be used to retrieve the value:

JAVA
DTCPConfiguration dtcp = new DTCPConfiguration(new File("sample.dxp"));
String piValue= dtcp.getProcessingInstruction("deltaxml.outputType");

Differences between the DTCP (DataComparator), DCP (DocumentComparator) and DXP (PipelinedComparatorS9).

This section describes the differences between DTCP covered in this document, DCP as described in the DCP User Guide, and DXP as described in the DXP User Guide document.

DXP is an XML format used to define the filters and settings for a PipelinedComparator object instead of the DataComparator object configured with DTCP.

The PipelinedComparatorS9 object is the most light-weight XML comparator. DataComparator is the latest comparator that combines the convenience of DocumentComparator DCP with a smaller set of options for a more light-weight comparator option. PipelinedComparatorS9 and DataComparator are both suited to the comparison of XML data documents, while DocumentComparator has additional document-centric functionality.

The main differences between DTCP, DCP and DXP are outlined below:

  • DTCP defines a DataComparator object, DCP defines a DocumentComparator object, DXP defines a PipelinedComparator object.

  • DCP was introduced to XML Compare after DXP. DTCP is the most recent addition to XML Compare.

  • Both the DCP and DTCP grammars are defined using XML Schema (1.0 or 1.1) instead of the DTD used for DXP.

  • DTCP configures all options for the DataComparator as well as defining the pipeline itself, DXP supports a more limited set of options for the PipelinedComparator.

  • Filters are defined in the same way in DCP/DTCP, but they are applied only at specific extension points in the pipeline.

  • DTCP has less pipeline extension points compared to DCP

  • DCP/DTCP allows XPath 2.0 evaluation, either in dedicated attributes or attribute value templates. Unlike DXP, DCP and DTCP do not support XQuery expressions.

  • A DocumentComparator defined using DCP has a highly featured built-in pipeline and therefore takes longer to initialize than a PipelinedComparator.

  • A DataComparator defined using DTCP has less features built-in compared to DCP, so takes less time to initialize than a DocumentComparator.

Initiating a DataComparator with DTCP

A DTCP file can be loaded when performing a comparison using one of the following methods:

  1. With XML Compare's command-line processor

  2. Using the Java API, construct a DTCPConfiguration object using the DTCP file

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.