Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xml parser #482

Merged
merged 9 commits into from
Nov 5, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Added the `xml_parser`

## 1.2.13 - 2021-10-29

### Added

- Added the `lazy_quotes` parameter to the csv parser [PR472](https://github.com/observIQ/stanza/pull/472)

### Removed
Expand Down
1 change: 1 addition & 0 deletions cmd/stanza/init_common.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import (
_ "github.com/observiq/stanza/operator/builtin/parser/syslog"
_ "github.com/observiq/stanza/operator/builtin/parser/time"
_ "github.com/observiq/stanza/operator/builtin/parser/uri"
_ "github.com/observiq/stanza/operator/builtin/parser/xml"

_ "github.com/observiq/stanza/operator/builtin/transformer/add"
_ "github.com/observiq/stanza/operator/builtin/transformer/copy"
Expand Down
1 change: 1 addition & 0 deletions docs/operators/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Parsers:
- [Syslog](/docs/operators/syslog_parser.md)
- [Severity](/docs/operators/severity_parser.md)
- [Time](/docs/operators/time_parser.md)
- [XML](/docs/operators/xml_parser.md)

Outputs:
- [Google Cloud Logging](/docs/operators/google_cloud_output.md)
Expand Down
153 changes: 153 additions & 0 deletions docs/operators/xml_parser.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
## `xml_parser` operator

The `xml_parser` operator parses the string-type field selected by `parse_from` as XML.

### Configuration Fields

| Field | Default | Description |
| --- | --- | --- |
| `id` | `xml_parser` | A unique identifier for the operator |
| `output` | Next in pipeline | The connected operator(s) that will receive all outbound entries |
| `parse_from` | $ | A [field](/docs/types/field.md) that indicates the field to be parsed as XML |
| `parse_to` | $ | A [field](/docs/types/field.md) that indicates where to parse structured data to |
| `preserve_to` | | Preserves the unparsed value at the specified [field](/docs/types/field.md) |
| `on_error` | `send` | The behavior of the operator if it encounters an error. See [on_error](/docs/types/on_error.md) |
| `if` | | An [expression](/docs/types/expression.md) that, when set, will be evaluated to determine whether this operator should be used for the given entry. This allows you to do easy conditional parsing without branching logic with routers. |
| `timestamp` | `nil` | An optional [timestamp](/docs/types/timestamp.md) block which will parse a timestamp field before passing the entry to the output operator |
| `severity` | `nil` | An optional [severity](/docs/types/severity.md) block which will parse a severity field before passing the entry to the output operator |


### Example Configurations


#### Parse the field `message` as XML

Configuration:
```yaml
- type: xml_parser
parse_from: message
```

<table>
<tr><td> Input record </td> <td> Output record </td></tr>
<tr>
<td>

```json
{
"timestamp": "",
"record": {
"message": "<person age='30'>Jon Smith</person>"
}
}
```

</td>
<td>

```json
{
"timestamp": "",
"record": {
"tag": "person",
"attributes": {
"age": "30"
},
"content": "Jon Smith"
}
}
```

</td>
</tr>
</table>

#### Parse multiple xml elements

Configuration:
```yaml
- type: xml_parser
parse_from: message
```

<table>
<tr><td> Input record </td> <td> Output record </td></tr>
<tr>
<td>

```json
{
"timestamp": "",
"record": {
"message": "<person age='30'>Jon Smith</person><person age='28'>Sally Smith</person>"
}
}
```

</td>
<td>

```json
{
"timestamp": "",
"record": [
{
"tag": "person",
"attributes": {
"age": "30"
},
"content": "Jon Smith"
},
{
"tag": "person",
"attributes": {
"age": "28"
},
"content": "Sally Smith"
}
]
}
```

#### Parse embedded xml elements

Configuration:
```yaml
- type: xml_parser
parse_from: message
```

<table>
<tr><td> Input record </td> <td> Output record </td></tr>
<tr>
<td>

```json
{
"timestamp": "",
"record": {
"message": "<worker><person age='30'>Jon Smith</person></worker>"
}
}
```

</td>
<td>

```json
{
"timestamp": "",
"record": {
"tag": "worker",
"children": [
{
"tag": "person",
"attributes": {
"age": "30"
},
"content": "Jon Smith"
}
]
}
}
```
73 changes: 73 additions & 0 deletions operator/builtin/parser/xml/element.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
package xml

import (
"bytes"
"encoding/xml"
)

// Element represents an XML element
type Element struct {
Tag string
Content string
Attributes map[string]string
Children []*Element
Parent *Element
}

// convertToMap converts an element to a map
func convertToMap(element *Element) map[string]interface{} {
results := map[string]interface{}{}
results["tag"] = element.Tag

if element.Content != "" {
results["content"] = element.Content
}

if len(element.Attributes) > 0 {
results["attributes"] = element.Attributes
}

if len(element.Children) > 0 {
results["children"] = convertToMaps(element.Children)
}

return results
}

// convertToMaps converts a slice of elements to a slice of maps
func convertToMaps(elements []*Element) []map[string]interface{} {
results := []map[string]interface{}{}
for _, e := range elements {
results = append(results, convertToMap(e))
}

return results
}

// newElement creates a new element for the given xml start element
func newElement(element xml.StartElement) *Element {
return &Element{
Tag: element.Name.Local,
Attributes: getAttributes(element),
}
}

// getAttributes returns the attributes of the given element
func getAttributes(element xml.StartElement) map[string]string {
if len(element.Attr) == 0 {
return nil
}

attributes := map[string]string{}
for _, attr := range element.Attr {
key := attr.Name.Local
attributes[key] = attr.Value
}

return attributes
}

// getValue returns value of the given char data
func getValue(data xml.CharData) string {
return string(bytes.TrimSpace(data))
}
110 changes: 110 additions & 0 deletions operator/builtin/parser/xml/xml.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
package xml

import (
"context"
"encoding/xml"
"fmt"
"io"
"strings"

"github.com/observiq/stanza/entry"
"github.com/observiq/stanza/operator"
"github.com/observiq/stanza/operator/helper"
)

func init() {
operator.Register("xml_parser", func() operator.Builder { return NewXMLParserConfig("") })
}

// NewXMLParserConfig creates a new XML parser config with default values
func NewXMLParserConfig(operatorID string) *XMLParserConfig {
return &XMLParserConfig{
ParserConfig: helper.NewParserConfig(operatorID, "xml_parser"),
}
}

// XMLParserConfig is the configuration of an XML parser operator.
type XMLParserConfig struct {
helper.ParserConfig `yaml:",inline"`
}

// Build will build an XML parser operator.
func (c XMLParserConfig) Build(context operator.BuildContext) ([]operator.Operator, error) {
parserOperator, err := c.ParserConfig.Build(context)
if err != nil {
return nil, err
}

xmlParser := &XMLParser{
ParserOperator: parserOperator,
}

return []operator.Operator{xmlParser}, nil
}

// XMLParser is an operator that parses XML.
type XMLParser struct {
helper.ParserOperator
}

// Process will parse an entry for XML.
func (x *XMLParser) Process(ctx context.Context, entry *entry.Entry) error {
return x.ParserOperator.ProcessWith(ctx, entry, parse)
}

// parse will parse an xml value
func parse(value interface{}) (interface{}, error) {
strValue, ok := value.(string)
if !ok {
return nil, fmt.Errorf("value passed to parser is not a string")
}

reader := strings.NewReader(strValue)
decoder := xml.NewDecoder(reader)
jmwilliams89 marked this conversation as resolved.
Show resolved Hide resolved
token, err := decoder.Token()
if err != nil {
return nil, fmt.Errorf("failed to decode as xml: %w", err)
}

elements := []*Element{}
var parent *Element
var current *Element

for token != nil {
switch token := token.(type) {
case xml.StartElement:
parent = current
current = newElement(token)
current.Parent = parent

if parent != nil {
parent.Children = append(parent.Children, current)
} else {
elements = append(elements, current)
}
case xml.EndElement:
current = parent
if parent != nil {
parent = parent.Parent
}
case xml.CharData:
if current != nil {
current.Content = getValue(token)
}
}

token, err = decoder.Token()
if err != nil && err != io.EOF {
return nil, fmt.Errorf("failed to get next xml token: %w", err)
}
}

switch len(elements) {
case 0:
return nil, fmt.Errorf("no xml elements found")
case 1:
return convertToMap(elements[0]), nil
default:
return convertToMaps(elements), nil
}
}
Loading