Attribute Splitting
What is Attribute Splitting?
Critical information about a file can be stored in the attributes. The Attribute splitting feature is designed to allow the ‘splitting’ of such information in the attributes for more granular comparison. The Attribute Splitting Config hence controls the granularity and method used for comparing and describing differences inside attribute values.
Attribute Splitting Modes
The attribute splitting modes can be any one of the following:
Narrative Text - this mode enables a word by word comparison of the text within the attribute.
Dataset - this mode enables an ordered comparison of the text within the attribute.
Datalist - this mode enables an unordered comparison of the text within the attribute.
Please note that separators can be provided to tokenise the data into discreet values for comparison in the dataset or datalist modes.
Example 1. This example illustrates the effects of the various Attribute Splitting Modes.
Input A
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person>
<address value="1 East Beverley Drive, AZ, Phoenix, 12345">true</address>
</person>
</persons>
Input B
<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person>
<address value="2 East Beverley Drive, Phoenix, AZ, 12345">true</address>
</person>
</persons>
Note that critical information - in this case an address - is stored in the value attribute of the address
element. Below is the change representation when specifying the various attribute modes on the value
attribute in the address.
Narrative Text
If the Narrative Text attribute splitting mode is used, the following result would be produced:
<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
xmlns:dxx="http://www.deltaxml.com/ns/xml-namespaced-attribute"
xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute" deltaxml:deltaV2="A!=B"
deltaxml:content-type="full-context" deltaxml:version="2.0">
<person deltaxml:deltaV2="A!=B">
<address deltaxml:deltaV2="A!=B">
<deltaxml:attributes deltaxml:deltaV2="A!=B">
<dxa:value deltaxml:deltaV2="A!=B">
<deltaxml:attributeValueWords deltaxml:deltaV2="A!=B">
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">1</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">2</deltaxml:text>
</deltaxml:textGroup>
East Beverley Drive,
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">AZ, Phoenix</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">Phoenix, AZ</deltaxml:text>
</deltaxml:textGroup>
, 12345
</deltaxml:attributeValueWords>
</dxa:value>
</deltaxml:attributes>
true
</address>
</person>
</persons>
While this is not a normal use case for the narrative text mode, but it shows that it tokenises the sentence based on words using language-specific information on punctuation, number formats and whitespace separators as defined in the ICU4J library.
Dataset
But if the value
attribute was compared using a Dataset mode along with using the comma separator (setting the separator to a specific value will be explained in the next section) you would get the following result:
<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
xmlns:dxx="http://www.deltaxml.com/ns/xml-namespaced-attribute"
xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute" deltaxml:deltaV2="A!=B"
deltaxml:content-type="full-context" deltaxml:version="2.0">
<person deltaxml:deltaV2="A!=B">
<address deltaxml:deltaV2="A!=B">
<deltaxml:attributes deltaxml:deltaV2="A!=B">
<dxa:value deltaxml:deltaV2="A!=B">
<deltaxml:attributeTokenSet deltaxml:separator="," deltaxml:deltaV2="A!=B">
<deltaxml:token deltaxml:deltaV2="B">2 East Beverley Drive</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A">1 East Beverley Drive</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">Phoenix</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">AZ</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">12345</deltaxml:token>
</deltaxml:attributeTokenSet>
</dxa:value>
</deltaxml:attributes>
true
</address>
</person>
</persons>
This shows that although it still tokenises and detects changes between the two attributes, the order is no longer taken into account and ‘Phoenix’ and ‘AZ’ are shown to be unchanged between the two files.
Datalist
Lastly, marking the value
attribute as a datalist with a comma separator will separate the string into tokens on every comma and then perform an ordered comparison. The following result would be produced:
<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
xmlns:dxx="http://www.deltaxml.com/ns/xml-namespaced-attribute"
xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute" deltaxml:deltaV2="A!=B"
deltaxml:content-type="full-context" deltaxml:version="2.0">
<person deltaxml:deltaV2="A!=B">
<address deltaxml:deltaV2="A!=B">
<deltaxml:attributes deltaxml:deltaV2="A!=B">
<dxa:value deltaxml:deltaV2="A!=B">
<deltaxml:attributeTokenList deltaxml:separator="," deltaxml:deltaV2="A!=B">
<deltaxml:token deltaxml:deltaV2="A">1 East Beverley Drive</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="B">2 East Beverley Drive</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A">AZ</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">Phoenix</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="B">AZ</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">12345</deltaxml:token>
</deltaxml:attributeTokenList>
</dxa:value>
</deltaxml:attributes>
true
</address>
</person>
</persons>
As can be seen above, it was indeed an ordered comparison in this case ‘AZ’ has been moved between the two files, and this change is displayed in the deltaV2 output.
Configuration
When comparing two files, a default attribute splitting configuration can be set for any attributes in the inputs. Additionally, individual attributes located by XPath can be configured differently from the default or excluded from attribute splitting altogether. In order to configure the default settings and the settings for individual attribute locations, the following settings can be used (alongside the attribute splitting mode defined above):
Enabled
This option sets whether attribute splitting is enabled. It can be set as a default and specified separately for individual attribute locations - the value set on the the individual locations will override the default.
Attribute XPath
This is what defines what attribute a location is referring to. An absolute or relative XPath can be used.
Separator
For datasets and datalists a separator needs to be defined to split the data into discreet values. This setting defines the character that each attribute token should be split on. Multiple possible separator characters can be set by just placing them next to each other in the variable, and the tokens will be split on any of those separators. This defaults to a comma if not specified.
Regex
This is used to tokenise the attribute value using regular expression syntax. This allows for a lot more customisation in how tokens are separated. Regex will take priority over a separator where both are defined.
Output Separator
This will define the value for an attribute called output separator in the output of the location. This can be used for post processing, for example a list can be reconstructed from the deltaV2 output with the changes highlighted in red and green and the specified output separator used to separate the tokens.
Example 2. This example illustrates configuring the attribute splitting for a comparison using the attributeSplittingConfig
property
Input A - a small person directory as an XML file (documentA.xml in the sample on Bitbucket)
<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns="test-ns">
<person desc="Description of John Michael William Doe">
<name middleNames="Michael / William">John Doe</name>
<email domain="@hotmail.co.uk">jdoe</email>
<address value="1, East Beverley Drive, AZ, Phoenix, 12345"/>
<dateOfBirth content="12th May 1963"/>
</person>
</persons>
Input B - the modified person directory as an XML file (documentB.xml in the sample on Bitbucket)
<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns="test-ns">
<person desc="Description of John Matthew Doe">
<name middleNames="William / Michael / Matthew">John Doe</name>
<email domain="@gmail.com, @hotmail.com">jdoe</email>
<address value="5, East Beverley Drive, Phoenix, AZ, 12345"/>
<dateOfBirth content="13th May 1963"/>
</person>
</persons>
The below code shows how to set the configuration for a comparison using both the Java API as well the DCP.
Java API
DocumentComparator dc = new DocumentComparator();
final AttributeSplittingConfig attributeSplittingConfig = new AttributeSplittingConfig(true, AttributeSplittingMode.narrativeText);
List<AttributeLocation> attributeLocations = new ArrayList<>();
attributeLocations.add(new AttributeLocation("@desc"));
attributeLocations.add(new AttributeLocation("@middleNames", AttributeSplittingMode.dataSet, "/", ","));
attributeLocations.add(new AttributeLocation("@domain", false));
attributeLocations.add(new AttributeLocation("@value", AttributeSplittingMode.dataList, ","));
AttributeLocation location = new AttributeLocation("@content", AttributeSplittingMode.dataList);
location.setRegex("(?:th)?\\s+");
attributeLocations.add(location);
attributeSplittingConfig.setAttributeLocations(attributeLocations);
dc.setAttributeSplittingConfig(attributeSplittingConfig);
// to generate your result file from comparison
dc.compare(input1, input2, new File(outputFileName));
DCP
<?xml version="1.0" encoding="UTF-8"?>
<documentComparator xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.0"
xsi:noNamespaceSchemaLocation="../dcp/core-dcp-v1_0-schema1_1.xsd"
id="attribute-splitting-demo"
description="Attribute Splitting Demo" >
<!--
for schema v1.1 validation of this file, change xsi:noNamespaceSchemaLocation above to:
xsi:noNamespaceSchemaLocation="../dcp/core-dcp-v1_0-schema1_1.xsd"
-->
<fullDescription>
Attribute Splitting API Configuration
</fullDescription>
<standardConfig>
<attributeSplittingConfig enabled="true" defaultMode="narrativeText">
<attributeLocations>
<attributeLocation attributeXpath="@desc"/>
<attributeLocation attributeXpath="@middleNames" mode="dataSet" separator="/" outputTokenSeparator=","/>
<attributeLocation attributeXpath="@domain" enabled="false"/>
<attributeLocation attributeXpath="@value" mode="dataList" separator=","/>
<attributeLocation attributeXpath="@content" mode="dataList" regex="(?:th)?\s+"/>
</attributeLocations>
</attributeSplittingConfig>
</standardConfig>
</documentComparator>
This configuration showcases all of the features of attribute splitting:
The default attribute splitting mode for the comparison is narrative text, and attribute splitting is set to be enabled by default.
The attribute desc
has been added as an attribute location but without any settings applied so the default attribute splitting mode will be used here, which is in this case narrative text. Narrative text is the most appropriate mode here as the attribute contains a sentence.
The middleNames
attribute has the attribute splitting mode set to dataset with a forward slash as the separator and a comma as the output separator. The dataset mode is useful here for checking if the names remain unchanged between the two files regardless of the order. Since an unconventional separator is used, the output token separator can be defined to replace this in the post processing of the comparison output.
The attribute domain
has attribute splitting disabled, so attribute splitting is disabled for this attribute in particular but it doesn’t affect other attributes.
The attribute value
has been set as a datalist with a comma separator. A datalist is the best mode here as the attribute contains an address and it is important to show if the order has changed between the two files.
The content
attribute is also a datalist but instead of using a separator character, the custom XPath(?:th)?\s+
is used to separate the date 12th May 1963
into 12
, May
, and 1963
. It works by separating on whitespace characters optionally including the preceding th
if present.
Using the above mentioned inputs and the configuration described above, the following result would be produced:
Result - the deltaV2 result as an XML file (result.xml in the sample on Bitbucket)
<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns="test-ns" xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1"
xmlns:dxx="http://www.deltaxml.com/ns/xml-namespaced-attribute"
xmlns:dxa="http://www.deltaxml.com/ns/non-namespaced-attribute" deltaxml:deltaV2="A!=B"
deltaxml:content-type="full-context" deltaxml:version="2.0">
<person deltaxml:deltaV2="A!=B">
<deltaxml:attributes deltaxml:deltaV2="A!=B">
<dxa:desc deltaxml:deltaV2="A!=B">
<deltaxml:attributeValueWords deltaxml:deltaV2="A!=B">Description of John
<deltaxml:textGroup deltaxml:deltaV2="A!=B">
<deltaxml:text deltaxml:deltaV2="A">Michael William</deltaxml:text>
<deltaxml:text deltaxml:deltaV2="B">Matthew</deltaxml:text>
</deltaxml:textGroup>
Doe
</deltaxml:attributeValueWords>
</dxa:desc>
</deltaxml:attributes>
<name deltaxml:deltaV2="A!=B">
<deltaxml:attributes deltaxml:deltaV2="A!=B">
<dxa:middleNames deltaxml:deltaV2="A!=B">
<deltaxml:attributeTokenSet deltaxml:outputTokenSeparator="," deltaxml:deltaV2="A!=B">
<deltaxml:token deltaxml:deltaV2="A=B">William</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">Michael</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="B">Matthew</deltaxml:token>
</deltaxml:attributeTokenSet>
</dxa:middleNames>
</deltaxml:attributes>
John Doe
</name>
<email deltaxml:deltaV2="A!=B">
<deltaxml:attributes deltaxml:deltaV2="A!=B">
<dxa:domain deltaxml:deltaV2="A!=B">
<deltaxml:attributeValue deltaxml:deltaV2="A">@hotmail.co.uk</deltaxml:attributeValue>
<deltaxml:attributeValue deltaxml:deltaV2="B">@gmail.com, @hotmail.com</deltaxml:attributeValue>
</dxa:domain>
</deltaxml:attributes>
jdoe
</email>
<address deltaxml:deltaV2="A!=B">
<deltaxml:attributes deltaxml:deltaV2="A!=B">
<dxa:value deltaxml:deltaV2="A!=B">
<deltaxml:attributeTokenList deltaxml:outputTokenSeparator="," deltaxml:deltaV2="A!=B">
<deltaxml:token deltaxml:deltaV2="A">1</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="B">5</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">East Beverley Drive</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A">AZ</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">Phoenix</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="B">AZ</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">12345</deltaxml:token>
</deltaxml:attributeTokenList>
</dxa:value>
</deltaxml:attributes>
</address>
<dateOfBirth deltaxml:deltaV2="A!=B">
<deltaxml:attributes deltaxml:deltaV2="A!=B">
<dxa:content deltaxml:deltaV2="A!=B">
<deltaxml:attributeTokenList deltaxml:outputTokenSeparator="," deltaxml:deltaV2="A!=B">
<deltaxml:token deltaxml:deltaV2="A">12</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="B">13</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">May</deltaxml:token>
<deltaxml:token deltaxml:deltaV2="A=B">1963</deltaxml:token>
</deltaxml:attributeTokenList>
</dxa:content>
</deltaxml:attributes>
</dateOfBirth>
</person>
</persons>
Namespaces
If XML namespaces are being used within the input document then it is convenient to be able to use namespace prefixes within the attribute selection XPath to precisely indicate the attributes to be split. It is possible to define the XML namespace prefix and value pairs that will be available within XPath expressions using both the Java API and the DCP configuration file. Please see Using Namespaces Within XPath Expressions.