Skip to content

Commit

Permalink
✨ Fix smaller things
Browse files Browse the repository at this point in the history
  • Loading branch information
wesen committed Jan 26, 2025
1 parent 0345ad4 commit 0a914e0
Show file tree
Hide file tree
Showing 4 changed files with 73 additions and 24 deletions.
10 changes: 9 additions & 1 deletion changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -632,4 +632,12 @@ Added comprehensive tutorial examples to demonstrate HTML selector usage:
- Tables and lists examples showing structured data extraction
- XPath examples showing advanced selection techniques

Each example includes both HTML and YAML files with detailed descriptions and comments.
Each example includes both HTML and YAML files with detailed descriptions and comments.

# Raw Data Extraction Option

Added a new flag to allow extracting raw data without applying templates.

- Added --extract-data flag to skip template processing and output raw YAML data
- Updated documentation to reflect new option
- Maintains backwards compatibility with existing template functionality
67 changes: 46 additions & 21 deletions cmd/tools/test-html-selector/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ A command-line tool for testing CSS and XPath selectors against HTML documents.
- Parent context for each match
- Extract and print all matches for each selector
- HTML simplification options for cleaner output
- Template-based output formatting

## Installation

Expand All @@ -23,44 +24,63 @@ go install ./cmd/tools/test-html-selector
1. Create a YAML configuration file:

```yaml
description: |
Description of what these selectors are trying to match
selectors:
- name: product_titles
selector: .product-card h2
type: css
description: Extracts product titles from cards
- name: prices
selector: //div[@class='price']
type: xpath
description: Extracts price elements
config:
sample_count: 5
context_chars: 100
template: | # Optional Go template for formatting output
{{- range $name, $matches := . }}
## {{ $name }}
{{- range $matches }}
- {{ . }}
{{- end }}
{{ end }}
```
2. Run the tool:
```bash
# Basic usage
# Basic usage with config file
test-html-selector --config config.yaml --input path/to/input.html

# Override sample count and context size
test-html-selector --config config.yaml --input path/to/input.html --sample-count 10 --context-chars 200
# Use individual selectors without config file
test-html-selector --input path/to/input.html \
--select-css ".product-card h2" \
--select-xpath "//div[@class='price']"

# Extract and print all matches
test-html-selector --config config.yaml --input path/to/input.html --extract
# Extract all matches with template formatting
test-html-selector --config config.yaml --input path/to/input.html \
--extract --extract-template template.tmpl

# Use HTML simplification options
# Show context and customize output
test-html-selector --config config.yaml --input path/to/input.html \
--strip-scripts --strip-css --simplify-text --markdown
--show-context --sample-count 10 --context-chars 200
```

## Configuration Options

### Command Line Flags

#### Basic Options
- `--config`: Path to YAML config file (required)
- `--config`: Path to YAML config file
- `--input`: Path to HTML input file (required)
- `--extract`: Extract and print all matches for each selector
- `--sample-count`: Maximum number of examples to show (default: 5)
- `--select-css`: CSS selectors to test (can be specified multiple times)
- `--select-xpath`: XPath selectors to test (can be specified multiple times)
- `--extract`: Extract all matches into a YAML map of selector name to matches (ignores sample-count limit)
- `--extract-template`: Go template file to render with extracted data
- `--show-context`: Show context around matched elements (default: false)
- `--show-path`: Show path to matched elements (default: true)
- `--sample-count`: Maximum number of examples to show in normal mode (default: 3)
- `--context-chars`: Number of characters of context to include (default: 100)

#### HTML Simplification Options
Expand All @@ -76,33 +96,38 @@ test-html-selector --config config.yaml --input path/to/input.html \

### YAML Configuration

- `selectors`: List of selectors to test
- `name`: Friendly name for the selector
- `selector`: CSS or XPath selector string
- `type`: Either "css" or "xpath"
- `config`:
- `sample_count`: Maximum number of examples to show (can be overridden by --sample-count)
- `context_chars`: Number of characters of context to include (can be overridden by --context-chars)
```yaml
description: String describing the purpose of these selectors
selectors:
- name: Friendly name for the selector
selector: CSS or XPath selector string
type: "css" or "xpath"
description: Description of what this selector matches
config:
sample_count: Maximum number of examples to show
context_chars: Number of characters of context to include
template: Optional Go template for formatting extracted data
```
## Example Output
```yaml
- name: product_titles
selector: .product-card h2
type: css
count: 3
samples:
- html:
- tag: h2
text: "Awesome Product 1"
context:
context: # Only shown with --show-context
- tag: div.info
children:
- tag: h2
text: "Awesome Product 1"
- tag: div.price
text: "$19.99"
path: "html > body > div > div > div > h2"
# ... more samples ...
path: "html > body > div > div > div > h2" # Only shown with --show-path
```
The output shows the full HTML structure in a simplified YAML format. Both `html` and `context` fields contain arrays of documents, allowing for multiple elements to be represented in their full structure. When using `--markdown` or `--simplify-text`, the output will be converted to the appropriate format while preserving important elements.
When using `--extract` with a template, the output format will be determined by your template. The template has access to a map of selector names to their matches, containing ALL matches found (not limited by sample-count). The matches can be text content, markdown, or full document structures depending on your simplification settings.
18 changes: 17 additions & 1 deletion cmd/tools/test-html-selector/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ type TestHTMLSelectorSettings struct {
SelectXPath []string `glazed.parameter:"select-xpath"`
InputFile string `glazed.parameter:"input"`
Extract bool `glazed.parameter:"extract"`
ExtractData bool `glazed.parameter:"extract-data"`
ExtractTemplate string `glazed.parameter:"extract-template"`
ShowContext bool `glazed.parameter:"show-context"`
ShowPath bool `glazed.parameter:"show-path"`
Expand Down Expand Up @@ -113,6 +114,12 @@ It provides match counts and contextual examples to verify selector accuracy.`),
parameters.WithHelp("Extract all matches into a YAML map of selector name to matches"),
parameters.WithDefault(false),
),
parameters.NewParameterDefinition(
"extract-data",
parameters.ParameterTypeBool,
parameters.WithHelp("Extract raw data without applying any templates"),
parameters.WithDefault(false),
),
parameters.NewParameterDefinition(
"extract-template",
parameters.ParameterTypeString,
Expand Down Expand Up @@ -260,6 +267,10 @@ func (c *TestHTMLSelectorCommand) RunIntoWriter(
MaxTableRows: s.MaxTableRows,
})

sampleCount := s.SampleCount
if s.Extract || s.ExtractTemplate != "" {
sampleCount = 0
}
tester, err := NewSelectorTester(&Config{
File: s.InputFile,
Selectors: selectors,
Expand All @@ -268,7 +279,7 @@ func (c *TestHTMLSelectorCommand) RunIntoWriter(
ContextChars int `yaml:"context_chars"`
Template string `yaml:"template"`
}{
SampleCount: s.SampleCount,
SampleCount: sampleCount,
ContextChars: s.ContextChars,
Template: "",
},
Expand Down Expand Up @@ -305,6 +316,11 @@ func (c *TestHTMLSelectorCommand) RunIntoWriter(
extractedData[result.Name] = matches
}

// If extract-data is true, output raw data regardless of templates
if s.ExtractData {
return yaml.NewEncoder(w).Encode(extractedData)
}

// First try command line template
if s.ExtractTemplate != "" {
// Load and execute template
Expand Down
2 changes: 1 addition & 1 deletion cmd/tools/test-html-selector/selector.go
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ func (st *SelectorTester) Run(ctx context.Context) ([]SelectorResult, error) {
totalCount := len(samples)

// Limit samples to configured count
if len(samples) > st.config.Config.SampleCount {
if st.config.Config.SampleCount > 0 && len(samples) > st.config.Config.SampleCount {
samples = samples[:st.config.Config.SampleCount]
}

Expand Down

0 comments on commit 0a914e0

Please sign in to comment.