The Markup Worker is an implementation of the Abstract Worker. It can be used to identify constructs within a document or e-mail. When used with e-mails it performs email content segregation. The separated emails are marked up with XML and the worker returns a configurable set of data.
The worker's image is built by worker-markup-container and uses a base image opensuse-jdk11.
The Markup Worker reads JSON encoded input messages which contain metadata about the document to be processed. The metadata values are passed using ReferencedData
objects which means they may be passed by value (i.e. directly in the message) or by reference (i.e. where the message contains a location in central storage which contains the value). The messages also contain additional information about the request, such as how the client would like hashes to be generated, and what information they would like to see returned.
When used with e-mails, the Markup Worker will split the email chain and mark up individual emails with <email>
tags. It then generates <header>
tags with headers identified in the email text such as <To>
, <From>
, <Cc>
, etc. The remaining email text is placed within <body>
tags.
The hash configuration supplied with the worker task message determines individual hashes to perform, each of these can be named with the name
property. For each hash, the client supplies a list of tag elements to add to the hash, the method of normalization to perform on each tag element's value (i.e. REMOVE_WHITESPACE
) and the hash function to use to generate the hash (i.e. XXHASH64
). A scope
setting determines whether the hash is generated for each email or for the entire email chain. In the case of an EMAIL_SPECIFIC
scope, the specified fields in each email are included in the hash and <hash>
tags are added to each email. In the case of a EMAIL_THREAD
scope, the <hash>
tags are added under the <root>
element and include the specified fields for the entire email chain.
The client also supplies a list of OutputFields in the task message. These include a field name to be output and an XPath expression to retrieve a value from the XML document. The worker evaluates the XPath expressions configuration and returns a list of name-value pairs consisting of the field name and the results of each XPath expression execution. For sample XPath expressions see here.
The Markup Worker uses the standard CAF-API
system of ConfigurationSource
. The worker specific configuration is MarkupWorkerConfiguration which has the options:
outputQueue
: the output queue to return results to RabbitMQ.threads
: the number of threads to use in the worker.
There are additional configurations to be supplied by the user on a per-task basis. These are passed to the Worker in the JSON message. A description of the worker's task message is shown below, along with its constituent HashConfiguration and OutputField components.
Component | Description |
---|---|
sourceData |
A Multimap<String, ReferencedData> containing the document metadata. If the isEmail flag is set then the CONTENT key is expected to contain the e-mail chain. |
hashConfiguration |
The configuration used for hashing the XML tags. For more detail see HashConfiguration |
outputFields |
The fields to output and the XPath expression to obtain the field's value from the XML. For more detail see OutputField. |
isEmail |
A flag indicating whether the document is an e-mail thread. |
Component | Description |
---|---|
name |
The name of the hash to be included, which is added to the <hash> element as an attribute for identification purposes. |
scope |
The scope of the email chain to perform the hash, i.e. EMAIL_SPECIFIC :Include fields from individual emails in the hash and apply <hash> tags at email level.EMAIL_THREAD :Include the fields of the entire email thread in the hash and apply <hash> tags at a thread level.Note: Also use this value for non-emails, to generate hash digests for an entire document. |
fields |
A list of field objects, these represent tag elements to include in the hash. name : the name of the element tag as it appears in the XML. normalizationType : the type of normalization to be applied to the contents of this tag element i.e. NONE , REMOVE_WHITESPACE , REMOVE_WHITESPACE_AND_LINKS , NAME_ONLY , NORMALIZE_PRIORITY . |
hashFunctions |
A list of hash functions to be performed on the fields above. i.e. NONE or XXHASH64 . |
Component | Description |
---|---|
field |
The field name to be returned by the worker in the output message. |
xPathExpression |
The XPath expression which will be evaluated against the marked up XML to obtain the value for the output field. |
If the desired output is the entire XML document, this can be retrieved by supplying the following output field:
"outputFields": [{
"field": "XML",
"xPathExpression": "."
}]
This is a sample task message sent to the input queue of the Markup Worker. In normal use the "taskData"
would be Base64 encoded but here we have decoded it for exemplification purposes.
{
"version": 3,
"taskId": "SampleEmail.txt",
"taskClassifier": "MarkupWorker",
"taskApiVersion": 1,
"taskData": {
"sourceData": {
"CONTENT": [{
"reference": null,
"data": "From: Za M <zaramckeown@gmail.com>\nSent: 27 September 2016 12:30:24\nTo: McKeown, Zara\nSubject: Re: FW: From a mixture of email clients\n\nThank you!\n\nFrom: Rogan, Adam Pau\nSent: Fri, Oct 7, 2016 at 8:21 AM -0400\nTo: McKeown, Zara <zara.mckeown@hpe.com>\nSubject: RE: From a mixture of email clients\n\nHi back\n\nFrom: McKeown, Zara\nSent: 27 September 2016 12:20\nTo: Rogan, Adam Pau <adam.pau.rogan@hpe.com>\nSubject: From a mixture of email clients\n\nHi"
}]
},
"hashConfiguration": [{
"name": "Normalized",
"scope": "EMAIL_SPECIFIC",
"fields": [{
"name": "To",
"normalizationType": "NAME_ONLY"
}, {
"name": "From",
"normalizationType": "NAME_ONLY"
}, {
"name": "Body",
"normalizationType": "REMOVE_WHITESPACE_AND_LINKS"
}],
"hashFunctions": ["XXHASH64"]
}],
"outputFields": [{
"field": "SECTION_SORT",
"xPathExpression": "/root/email[1]/headers/Sent/@dateUtc"
}, {
"field": "SECTION_ID",
"xPathExpression": "/root/email[1]/hash/digest/@value"
}, {
"field": "PARENT_ID",
"xPathExpression": "/root/email[2]/hash/digest/@value"
}, {
"field": "ROOT_ID",
"xPathExpression": "/root/email[last()]/hash/digest/@value"
}, {
"field": "MARKUP_WORKER_XML",
"xPathExpression": "."
}],
"isEmail": true
},
"taskStatus": "NEW_TASK",
"context": {
"context": "integration-test"
},
"to": "markupworker-test-input-1",
"tracking": null,
"sourceInfo": null
}
The result class is MarkupWorkerResult and is shown below.
Component | Description |
---|---|
workerStatus |
The worker specific return code depicting the processing result status. Any other value than COMPLETED means failure. The possible worker statuses are: - COMPLETED : the worker processed the task successfully. - WORKER_FAILED : the worker failed in an unexpected way. |
fieldList |
A list of name-value pairs which specify the output fields and their corresponding values. These values were retieved by the XPath expression in the OutputField. name : the name of the field to output. value : the value returned from the XPath expression. |
This is a sample output message sent to the output queue from the Markup Worker. In normal use the "taskData"
would be Base64 encoded but here we have decoded it for exemplification purposes.
{
"version": 3,
"taskId": "SampleEmail.txt",
"taskClassifier": "MarkupWorker",
"taskApiVersion": 1,
"taskData": {
"workerStatus": "COMPLETED",
"fieldList": [{
"name": "SECTION_SORT",
"value": "2016-09-27T12:30:24Z"
}, {
"name": "SECTION_ID",
"value": "ca17a060c9e2ff28"
}, {
"name": "PARENT_ID",
"value": "a1425efb02868dfd"
}, {
"name": "ROOT_ID",
"value": "969b26c06e65c4e9"
}, {
"name": "MARKUP_WORKER_XML",
"value": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<root><email><hash name=\"Normalized\"><config><fields><field><name>To</name><normalizationType>NAME_ONLY</normalizationType></field><field><name>From</name><normalizationType>NAME_ONLY</normalizationType></field><field><name>Body</name><normalizationType>REMOVE_WHITESPACE_AND_LINKS</normalizationType></field></fields></config><digest function=\"XXHASH64\" value=\"ca17a060c9e2ff28\" /></hash><headers><From>From: Za M <zaramckeown@gmail.com>
</From>\r\n<Sent dateUtc=\"2016-09-27T12:30:24Z\">Sent: 27 September 2016 12:30:24
</Sent>\r\n<To>To: McKeown, Zara
</To>\r\n<Subject>Subject: Re: FW: From a mixture of email clients
</Subject>\r\n</headers><body>
\r\nThank you!
\r\n
</body></email><email><hash name=\"Normalized\"><config><fields><field><name>To</name><normalizationType>NAME_ONLY</normalizationType></field><field><name>From</name><normalizationType>NAME_ONLY</normalizationType></field><field><name>Body</name><normalizationType>REMOVE_WHITESPACE_AND_LINKS</normalizationType></field></fields></config><digest function=\"XXHASH64\" value=\"a1425efb02868dfd\" /></hash><headers><From>From: Rogan, Adam Pau
</From>\r\n<Sent dateUtc=\"2016-10-07T12:21:00Z\">Sent: Fri, Oct 7, 2016 at 8:21 AM -0400
</Sent>\r\n<To>To: McKeown, Zara <zara.mckeown@hpe.com>
</To>\r\n<Subject>Subject: RE: From a mixture of email clients
</Subject>\r\n</headers><body>
\r\nHi back
\r\n
</body></email><email><hash name=\"Normalized\"><config><fields><field><name>To</name><normalizationType>NAME_ONLY</normalizationType></field><field><name>From</name><normalizationType>NAME_ONLY</normalizationType></field><field><name>Body</name><normalizationType>REMOVE_WHITESPACE_AND_LINKS</normalizationType></field></fields></config><digest function=\"XXHASH64\" value=\"969b26c06e65c4e9\" /></hash><headers><From>From: McKeown, Zara
</From>\r\n<Sent dateUtc=\"2016-09-27T12:20:00Z\">Sent: 27 September 2016 12:20
</Sent>\r\n<To>To: Rogan, Adam Pau <adam.pau.rogan@hpe.com>
</To>\r\n<Subject>Subject: From a mixture of email clients
</Subject>\r\n</headers><body>
\r\nHi</body></email></root>\r\n"
}]
},
"taskStatus": "RESULT_SUCCESS",
"context": {
"context": "integration-test"
},
"to": "markupworker-test-output-1",
"tracking": null,
"sourceInfo": {
"name": "MarkupWorker",
"version": "1.0.0"
}
}
The following hash configuration is recommended to generate hashes which can be used to identity related e-mails, especially replied-to and forwarded e-mails:
"hashConfiguration": [{
"name": "Normalized",
"scope": "EMAIL_SPECIFIC",
"fields": [
{ "name": "To", "normalizationType": "NAME_ONLY" },
{ "name": "From", "normalizationType": "NAME_ONLY" },
{ "name": "Body", "normalizationType": "REMOVE_WHITESPACE_AND_LINKS" }
],
"hashFunctions": [
"XXHASH64"
]
}]
XPath Expression | Returns |
---|---|
. |
the entire XML |
/root/email[1]/hash/digest/@value |
the value attribute of the hash of the first email under <root> tags |
/root/email[2]/hash/digest/@value |
the value attribute of the hash of the second email under <root> tags |
/root/email[last()]/hash/digest/@value |
the value attribute of the hash of the last email under <root> tags |
/root/email[1]/headers/Sent/text() |
the text of the Sent element of the first email |
/root/CAF_MAIL_MESSAGE_ID/text() |
the text of the CAF_MAIL_MESSAGE_ID element |
This worker provides a basic health check by returning HEALTHY
if it can communicate with Marathon.
The number of worker threads is configured using the threads
setting in the Markup Worker Configuration.
Memory usage will vary significantly with the size of the input message.
There are three main places where this worker can fail:
- Configuration errors: these will manifest on startup and cause the worker not to start at all. Check the logs for clues, and double check your configuration files.
WORKER_FAILED
: Tasks coming from the worker with this worker status have failed during processing in some unexpected way. This could be due to a number of reasons:- no hash configuration has been specified,
- a failure to separate emails,
- a failure to parse the XML into a document,
- a failure to acquire data from datastore.
These follow standard CAF Worker upgrade procedures. If the version of worker-markup-shared
has not changed then an upgrade to worker-markup
is an in-place upgrade.
If you need to do a rolling upgrade when worker-markup-shared
has not changed then:
- Spin up containers of the new version of
worker-markup
. - Replace old versions of producers of
MarkupWorkerTask
with new ones. - Allow the queue with the old versions of
MarkupWorkerTask
to drain then shut down the old worker containers.
The following people are responsible for maintaining this code:
- Andy Reid (Belfast, UK, andrew.reid@microfocus.com)
- Dermot Hardy (Belfast, UK, dermot.hardy@microfocus.com)
- Anthony Mcgreevy (Belfast, UK, anthony.mcgreevy@microfocus.com)
- Davide Giorgio Picchione (Belfast, UK, davide-giorgio.picchione@microfocus.com)
- Thilagavathi Santhoshkumar (Belfast, UK, thilagavathi.santhoshkumar@microfocus.com)
- Radoslav Straka (Belfast, UK, radoslav.straka@microfocus.com)
- Michael Bryson (Belfast, UK, michael.bryson@microfocus.com)
- Rahul Kulkarni (Chicago, USA, rahul.kulkarni@microfocus.com)
- Kusuma Ghosh Dastidar (Pleasanton, USA, vgkusuma@microfocus.com)
- Om Mariappan (Pleasanton, USA, omkumar.mariappan@microfocus.com)
- Morvin Shah (Pleasanton, USA, morivn.pan.shah@microfocus.com)