generated from wesen/wesen-go-template
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
✨ Fix handler and text simplification tests
- Loading branch information
Showing
14 changed files
with
1,848 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
# HTML Simplification Tool | ||
|
||
A command-line tool to simplify and minimize HTML documents by removing unnecessary elements, shortening content, and providing a clean YAML representation of the document structure. | ||
|
||
## Features | ||
|
||
- Strip script and style tags | ||
- Remove SVG elements | ||
- Shorten long text content | ||
- Limit list items and table rows | ||
- Filter elements using CSS and XPath selectors | ||
- Simplify text-only nodes | ||
- Compact attribute representation | ||
|
||
## Installation | ||
|
||
```bash | ||
go install ./cmd/tools/simplify-html | ||
``` | ||
|
||
## Usage | ||
|
||
Basic usage: | ||
```bash | ||
simplify-html input.html > output.yaml | ||
``` | ||
|
||
With configuration file: | ||
```bash | ||
simplify-html --config filters.yaml input.html > output.yaml | ||
``` | ||
|
||
## Options | ||
|
||
- `--strip-scripts` (default: true): Remove `<script>` tags | ||
- `--strip-css` (default: true): Remove `<style>` tags and style attributes | ||
- `--strip-svg` (default: true): Remove SVG elements | ||
- `--shorten-text` (default: true): Shorten text content longer than 200 characters | ||
- `--simplify-text` (default: true): Collapse nodes with only text/br children into a single text field | ||
- `--compact-svg` (default: true): Simplify SVG elements by removing detailed attributes | ||
- `--max-list-items` (default: 4): Maximum number of items to show in lists and select boxes (0 for unlimited) | ||
- `--max-table-rows` (default: 4): Maximum number of rows to show in tables (0 for unlimited) | ||
- `--config`: Path to YAML configuration file containing selectors to filter out | ||
|
||
## Configuration File Format | ||
|
||
The configuration file uses YAML format and supports both CSS and XPath selectors: | ||
|
||
```yaml | ||
selectors: | ||
# CSS selectors | ||
- type: css | ||
selector: ".advertisement" | ||
- type: css | ||
selector: "#sidebar" | ||
|
||
# XPath selectors | ||
- type: xpath | ||
selector: "//*[@data-analytics]" | ||
- type: xpath | ||
selector: "//div[contains(@class, 'social-media')]" | ||
``` | ||
## Output Format | ||
The tool outputs a YAML representation of the HTML document structure: | ||
```yaml | ||
tag: div | ||
attrs: class=content | ||
text: Simple text content # For text-only nodes | ||
children: # For nodes with children | ||
- tag: p | ||
text: First paragraph | ||
- tag: ul | ||
children: | ||
- tag: li | ||
text: List item 1 | ||
- tag: li | ||
text: List item 2 | ||
- tag: li | ||
text: ... # Truncation indicator | ||
``` | ||
## Examples | ||
The `examples/` directory contains sample HTML files demonstrating different features: | ||
|
||
- `simple.html`: Basic text and inline elements | ||
- `lists.html`: Various types of lists and nesting | ||
- `table.html`: Tables with simple and complex content | ||
|
||
Try them out: | ||
```bash | ||
# Basic simplification | ||
simplify-html examples/simple.html | ||
# Limit list items | ||
simplify-html --max-list-items=2 examples/lists.html | ||
# Complex table handling | ||
simplify-html --max-table-rows=3 examples/table.html | ||
``` | ||
|
||
## Text Simplification | ||
|
||
The `--simplify-text` option collapses nodes that contain only text and `<br>` elements into a single text field. This helps reduce the complexity of the output while preserving the content and line breaks. | ||
|
||
For example, this HTML: | ||
```html | ||
<div class="content"> | ||
First line<br> | ||
Second line<br> | ||
Third line | ||
</div> | ||
``` | ||
|
||
Becomes: | ||
```yaml | ||
tag: div | ||
attrs: class=content | ||
text: "First line\nSecond line\nThird line" | ||
``` | ||
|
||
Note: Text simplification is only applied when a node contains exclusively text nodes and `<br>` elements. If a node contains any other elements (like links or formatting), it will preserve the full structure. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Example configuration file for HTML simplification | ||
# List of CSS and XPath selectors to remove from the HTML before processing | ||
selectors: | ||
# CSS selectors | ||
- type: css | ||
selector: ".advertisement" # Remove elements with class 'advertisement' | ||
- type: css | ||
selector: "#sidebar" # Remove element with ID 'sidebar' | ||
- type: css | ||
selector: "div.navigation" # Remove div elements with class 'navigation' | ||
- type: css | ||
selector: "footer" # Remove footer elements | ||
|
||
# XPath selectors | ||
- type: xpath | ||
selector: "//*[@data-analytics]" # Remove elements with data-analytics attribute | ||
- type: xpath | ||
selector: "//div[contains(@class, 'social-media')]" # Remove social media divs | ||
- type: xpath | ||
selector: "//script[contains(@src, 'analytics')]" # Remove analytics scripts |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
<!DOCTYPE html> | ||
<html> | ||
<head> | ||
<title>Footer Example</title> | ||
</head> | ||
<body> | ||
<section> | ||
<div class="container"> | ||
<div class="row"> | ||
<div class="col-lg-3 col-12 centered-lg"> | ||
<p> | ||
<a href="https://www.nlm.nih.gov/web_policies.html" class="text-white">Web Policies</a><br> | ||
<a href="https://www.nih.gov/institutes-nih/nih-office-director/office-communications-public-liaison/freedom-information-act-office" class="text-white">FOIA</a><br> | ||
<a href="https://www.hhs.gov/vulnerability-disclosure-policy/index.html" class="text-white" id="vdp">HHS Vulnerability Disclosure</a> | ||
</p> | ||
</div> | ||
<div class="col-lg-3 col-12 centered-lg"> | ||
<p> | ||
<a class="supportLink text-white" href="https://support.nlm.nih.gov/">Help</a><br> | ||
<a href="https://www.nlm.nih.gov/accessibility.html" class="text-white">Accessibility</a><br> | ||
<a href="https://www.nlm.nih.gov/careers/careers.html" class="text-white">Careers</a> | ||
</p> | ||
</div> | ||
</div> | ||
<div class="row"> | ||
<div class="col-lg-12 centered-lg"> | ||
<nav class="bottom-links"> | ||
<ul class="mt-3"> | ||
<li><a class="text-white" href="//www.nlm.nih.gov/">NLM</a></li> | ||
<li><a class="text-white" href="https://www.nih.gov/">NIH</a></li> | ||
<li><a class="text-white" href="https://www.hhs.gov/">HHS</a></li> | ||
<li><a class="text-white" href="https://www.usa.gov/">USA.gov</a></li> | ||
</ul> | ||
</nav> | ||
</div> | ||
</div> | ||
</div> | ||
</section> | ||
</body> | ||
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
<!DOCTYPE html> | ||
<html> | ||
<head> | ||
<title>List Examples</title> | ||
</head> | ||
<body> | ||
<!-- Unordered list --> | ||
<ul class="menu"> | ||
<li>First item</li> | ||
<li>Second item</li> | ||
<li>Third item</li> | ||
<li>Fourth item</li> | ||
<li>Fifth item</li> | ||
<li>Sixth item</li> | ||
</ul> | ||
|
||
<!-- Ordered list --> | ||
<ol class="steps"> | ||
<li>Step one</li> | ||
<li>Step two</li> | ||
<li>Step three</li> | ||
<li>Step four</li> | ||
<li>Step five</li> | ||
</ol> | ||
|
||
<!-- Select box --> | ||
<select name="options"> | ||
<option value="1">Option 1</option> | ||
<option value="2">Option 2</option> | ||
<option value="3">Option 3</option> | ||
<option value="4">Option 4</option> | ||
<option value="5">Option 5</option> | ||
</select> | ||
|
||
<!-- Nested list --> | ||
<ul class="nested"> | ||
<li>Parent 1 | ||
<ul> | ||
<li>Child 1.1</li> | ||
<li>Child 1.2</li> | ||
<li>Child 1.3</li> | ||
</ul> | ||
</li> | ||
<li>Parent 2 | ||
<ul> | ||
<li>Child 2.1</li> | ||
<li>Child 2.2</li> | ||
</ul> | ||
</li> | ||
</ul> | ||
</body> | ||
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
<!DOCTYPE html> | ||
<html> | ||
<head> | ||
<title>Simple Example</title> | ||
</head> | ||
<body> | ||
<!-- Simple text --> | ||
<p>This is a simple paragraph.</p> | ||
|
||
<!-- Text with link --> | ||
<p>This is a paragraph with a <a href="https://example.com">link</a> in it.</p> | ||
|
||
<!-- Text with multiple inline elements --> | ||
<p>This has <strong>bold</strong> and <em>italic</em> text.</p> | ||
|
||
<!-- Text with line breaks --> | ||
<div class="with-breaks"> | ||
First line<br> | ||
Second line<br> | ||
Third line | ||
</div> | ||
|
||
<!-- Long text that should be shortened --> | ||
<p>This is a very long paragraph that should be shortened when the shorten-text option is enabled. It contains a lot of text that goes on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on.</p> | ||
</body> | ||
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
<!DOCTYPE html> | ||
<html> | ||
<head> | ||
<title>Table Example</title> | ||
</head> | ||
<body> | ||
<!-- Simple table --> | ||
<table class="data-table"> | ||
<thead> | ||
<tr> | ||
<th>ID</th> | ||
<th>Name</th> | ||
<th>Email</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td>1</td> | ||
<td>John Doe</td> | ||
<td>john@example.com</td> | ||
</tr> | ||
<tr> | ||
<td>2</td> | ||
<td>Jane Smith</td> | ||
<td>jane@example.com</td> | ||
</tr> | ||
<tr> | ||
<td>3</td> | ||
<td>Bob Johnson</td> | ||
<td>bob@example.com</td> | ||
</tr> | ||
<tr> | ||
<td>4</td> | ||
<td>Alice Brown</td> | ||
<td>alice@example.com</td> | ||
</tr> | ||
<tr> | ||
<td>5</td> | ||
<td>Charlie Wilson</td> | ||
<td>charlie@example.com</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
<!-- Table with complex cells --> | ||
<table class="complex-table"> | ||
<tr> | ||
<td>Simple text</td> | ||
<td>Text with <strong>bold</strong> and <em>italic</em></td> | ||
<td>Cell with <a href="#">link</a></td> | ||
</tr> | ||
<tr> | ||
<td>Multi-line<br>content</td> | ||
<td> | ||
<ul> | ||
<li>List item 1</li> | ||
<li>List item 2</li> | ||
</ul> | ||
</td> | ||
<td> | ||
<p>Paragraph in cell</p> | ||
<p>Another paragraph</p> | ||
</td> | ||
</tr> | ||
</table> | ||
</body> | ||
</html> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
package htmlsimplifier | ||
|
||
import ( | ||
"strings" | ||
"testing" | ||
|
||
"github.com/stretchr/testify/assert" | ||
"golang.org/x/net/html" | ||
) | ||
|
||
func parseHTML(t *testing.T, htmlContent string) *html.Node { | ||
node, err := html.Parse(strings.NewReader(htmlContent)) | ||
assert.NoError(t, err) | ||
return node | ||
} | ||
|
||
func findFirstElement(node *html.Node, tag string) *html.Node { | ||
if node == nil { | ||
return nil | ||
} | ||
|
||
if node.Type == html.ElementNode && node.Data == tag { | ||
return node | ||
} | ||
|
||
for child := node.FirstChild; child != nil; child = child.NextSibling { | ||
if found := findFirstElement(child, tag); found != nil { | ||
return found | ||
} | ||
} | ||
|
||
return nil | ||
} | ||
|
Oops, something went wrong.