Skip to content

Commit

Permalink
✨ Fix handler and text simplification tests
Browse files Browse the repository at this point in the history
  • Loading branch information
wesen committed Jan 26, 2025
1 parent 8c0273c commit f005cef
Show file tree
Hide file tree
Showing 14 changed files with 1,848 additions and 0 deletions.
125 changes: 125 additions & 0 deletions cmd/tools/simplify-html/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# HTML Simplification Tool

A command-line tool to simplify and minimize HTML documents by removing unnecessary elements, shortening content, and providing a clean YAML representation of the document structure.

## Features

- Strip script and style tags
- Remove SVG elements
- Shorten long text content
- Limit list items and table rows
- Filter elements using CSS and XPath selectors
- Simplify text-only nodes
- Compact attribute representation

## Installation

```bash
go install ./cmd/tools/simplify-html
```

## Usage

Basic usage:
```bash
simplify-html input.html > output.yaml
```

With configuration file:
```bash
simplify-html --config filters.yaml input.html > output.yaml
```

## Options

- `--strip-scripts` (default: true): Remove `<script>` tags
- `--strip-css` (default: true): Remove `<style>` tags and style attributes
- `--strip-svg` (default: true): Remove SVG elements
- `--shorten-text` (default: true): Shorten text content longer than 200 characters
- `--simplify-text` (default: true): Collapse nodes with only text/br children into a single text field
- `--compact-svg` (default: true): Simplify SVG elements by removing detailed attributes
- `--max-list-items` (default: 4): Maximum number of items to show in lists and select boxes (0 for unlimited)
- `--max-table-rows` (default: 4): Maximum number of rows to show in tables (0 for unlimited)
- `--config`: Path to YAML configuration file containing selectors to filter out

## Configuration File Format

The configuration file uses YAML format and supports both CSS and XPath selectors:

```yaml
selectors:
# CSS selectors
- type: css
selector: ".advertisement"
- type: css
selector: "#sidebar"

# XPath selectors
- type: xpath
selector: "//*[@data-analytics]"
- type: xpath
selector: "//div[contains(@class, 'social-media')]"
```
## Output Format
The tool outputs a YAML representation of the HTML document structure:
```yaml
tag: div
attrs: class=content
text: Simple text content # For text-only nodes
children: # For nodes with children
- tag: p
text: First paragraph
- tag: ul
children:
- tag: li
text: List item 1
- tag: li
text: List item 2
- tag: li
text: ... # Truncation indicator
```
## Examples
The `examples/` directory contains sample HTML files demonstrating different features:

- `simple.html`: Basic text and inline elements
- `lists.html`: Various types of lists and nesting
- `table.html`: Tables with simple and complex content

Try them out:
```bash
# Basic simplification
simplify-html examples/simple.html
# Limit list items
simplify-html --max-list-items=2 examples/lists.html
# Complex table handling
simplify-html --max-table-rows=3 examples/table.html
```

## Text Simplification

The `--simplify-text` option collapses nodes that contain only text and `<br>` elements into a single text field. This helps reduce the complexity of the output while preserving the content and line breaks.

For example, this HTML:
```html
<div class="content">
First line<br>
Second line<br>
Third line
</div>
```

Becomes:
```yaml
tag: div
attrs: class=content
text: "First line\nSecond line\nThird line"
```

Note: Text simplification is only applied when a node contains exclusively text nodes and `<br>` elements. If a node contains any other elements (like links or formatting), it will preserve the full structure.
20 changes: 20 additions & 0 deletions cmd/tools/simplify-html/example-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Example configuration file for HTML simplification
# List of CSS and XPath selectors to remove from the HTML before processing
selectors:
# CSS selectors
- type: css
selector: ".advertisement" # Remove elements with class 'advertisement'
- type: css
selector: "#sidebar" # Remove element with ID 'sidebar'
- type: css
selector: "div.navigation" # Remove div elements with class 'navigation'
- type: css
selector: "footer" # Remove footer elements

# XPath selectors
- type: xpath
selector: "//*[@data-analytics]" # Remove elements with data-analytics attribute
- type: xpath
selector: "//div[contains(@class, 'social-media')]" # Remove social media divs
- type: xpath
selector: "//script[contains(@src, 'analytics')]" # Remove analytics scripts
40 changes: 40 additions & 0 deletions cmd/tools/simplify-html/examples/footer.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
<!DOCTYPE html>
<html>
<head>
<title>Footer Example</title>
</head>
<body>
<section>
<div class="container">
<div class="row">
<div class="col-lg-3 col-12 centered-lg">
<p>
<a href="https://www.nlm.nih.gov/web_policies.html" class="text-white">Web Policies</a><br>
<a href="https://www.nih.gov/institutes-nih/nih-office-director/office-communications-public-liaison/freedom-information-act-office" class="text-white">FOIA</a><br>
<a href="https://www.hhs.gov/vulnerability-disclosure-policy/index.html" class="text-white" id="vdp">HHS Vulnerability Disclosure</a>
</p>
</div>
<div class="col-lg-3 col-12 centered-lg">
<p>
<a class="supportLink text-white" href="https://support.nlm.nih.gov/">Help</a><br>
<a href="https://www.nlm.nih.gov/accessibility.html" class="text-white">Accessibility</a><br>
<a href="https://www.nlm.nih.gov/careers/careers.html" class="text-white">Careers</a>
</p>
</div>
</div>
<div class="row">
<div class="col-lg-12 centered-lg">
<nav class="bottom-links">
<ul class="mt-3">
<li><a class="text-white" href="//www.nlm.nih.gov/">NLM</a></li>
<li><a class="text-white" href="https://www.nih.gov/">NIH</a></li>
<li><a class="text-white" href="https://www.hhs.gov/">HHS</a></li>
<li><a class="text-white" href="https://www.usa.gov/">USA.gov</a></li>
</ul>
</nav>
</div>
</div>
</div>
</section>
</body>
</html>
52 changes: 52 additions & 0 deletions cmd/tools/simplify-html/examples/lists.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
<!DOCTYPE html>
<html>
<head>
<title>List Examples</title>
</head>
<body>
<!-- Unordered list -->
<ul class="menu">
<li>First item</li>
<li>Second item</li>
<li>Third item</li>
<li>Fourth item</li>
<li>Fifth item</li>
<li>Sixth item</li>
</ul>

<!-- Ordered list -->
<ol class="steps">
<li>Step one</li>
<li>Step two</li>
<li>Step three</li>
<li>Step four</li>
<li>Step five</li>
</ol>

<!-- Select box -->
<select name="options">
<option value="1">Option 1</option>
<option value="2">Option 2</option>
<option value="3">Option 3</option>
<option value="4">Option 4</option>
<option value="5">Option 5</option>
</select>

<!-- Nested list -->
<ul class="nested">
<li>Parent 1
<ul>
<li>Child 1.1</li>
<li>Child 1.2</li>
<li>Child 1.3</li>
</ul>
</li>
<li>Parent 2
<ul>
<li>Child 2.1</li>
<li>Child 2.2</li>
</ul>
</li>
</ul>
</body>
</html>
26 changes: 26 additions & 0 deletions cmd/tools/simplify-html/examples/simple.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
<!DOCTYPE html>
<html>
<head>
<title>Simple Example</title>
</head>
<body>
<!-- Simple text -->
<p>This is a simple paragraph.</p>

<!-- Text with link -->
<p>This is a paragraph with a <a href="https://example.com">link</a> in it.</p>

<!-- Text with multiple inline elements -->
<p>This has <strong>bold</strong> and <em>italic</em> text.</p>

<!-- Text with line breaks -->
<div class="with-breaks">
First line<br>
Second line<br>
Third line
</div>

<!-- Long text that should be shortened -->
<p>This is a very long paragraph that should be shortened when the shorten-text option is enabled. It contains a lot of text that goes on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on.</p>
</body>
</html>
67 changes: 67 additions & 0 deletions cmd/tools/simplify-html/examples/table.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
<!DOCTYPE html>
<html>
<head>
<title>Table Example</title>
</head>
<body>
<!-- Simple table -->
<table class="data-table">
<thead>
<tr>
<th>ID</th>
<th>Name</th>
<th>Email</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>John Doe</td>
<td>john@example.com</td>
</tr>
<tr>
<td>2</td>
<td>Jane Smith</td>
<td>jane@example.com</td>
</tr>
<tr>
<td>3</td>
<td>Bob Johnson</td>
<td>bob@example.com</td>
</tr>
<tr>
<td>4</td>
<td>Alice Brown</td>
<td>alice@example.com</td>
</tr>
<tr>
<td>5</td>
<td>Charlie Wilson</td>
<td>charlie@example.com</td>
</tr>
</tbody>
</table>

<!-- Table with complex cells -->
<table class="complex-table">
<tr>
<td>Simple text</td>
<td>Text with <strong>bold</strong> and <em>italic</em></td>
<td>Cell with <a href="#">link</a></td>
</tr>
<tr>
<td>Multi-line<br>content</td>
<td>
<ul>
<li>List item 1</li>
<li>List item 2</li>
</ul>
</td>
<td>
<p>Paragraph in cell</p>
<p>Another paragraph</p>
</td>
</tr>
</table>
</body>
</html>
34 changes: 34 additions & 0 deletions pkg/htmlsimplifier/helpers_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
package htmlsimplifier

import (
"strings"
"testing"

"github.com/stretchr/testify/assert"
"golang.org/x/net/html"
)

func parseHTML(t *testing.T, htmlContent string) *html.Node {
node, err := html.Parse(strings.NewReader(htmlContent))
assert.NoError(t, err)
return node
}

func findFirstElement(node *html.Node, tag string) *html.Node {
if node == nil {
return nil
}

if node.Type == html.ElementNode && node.Data == tag {
return node
}

for child := node.FirstChild; child != nil; child = child.NextSibling {
if found := findFirstElement(child, tag); found != nil {
return found
}
}

return nil
}

Loading

0 comments on commit f005cef

Please sign in to comment.