The unhtml
utility removes HTML markup from a document and outputs plain
text on stdout in the UTF-8 charset.
unhtml
is an extractor not a renderer; other tools such as w3m
are
more appropriate if the output is to be embellished with rendering based
on HTML markup.
Content within <SCRIPT>
and <STYLE>
elements is ignored.
unhtml
can be built with libxml2
and/or libgumbo
parsers to cover the
set of HTML, XHTML and HTML 5 tag-soup documents.
This version of unhtml is a complete rewrite of and drop-in replacement for unhtml (originally known as 'clean') by Kevin Swan in 1998. The original program appears to be abandoned upstream and is currently maintained in Debian as unhtml at https://salsa.debian.org/debian/unhtml/-/tree/upstream?ref_type=heads
The current unhtml
package in Debian has significant flaws which in my
view aren't sensibly fixable by patching the old version. This new version
uses proper HTML and XHTML parser libraries with complete capability to
handle entities.
The major flaws of the old package are:
- Only understands and emits ISO-8859-1, not UTF-8.
- Has handling limitations with some constructs such as DOCTYPE, CDATA sections, comments and STYLE elements.
The existing package has a non-trivial
popcon count if I have
understood the metric correctly, so there evidently is some demand for a
standalone utility even though it would probably be simpler to run a command
like xmllint -html -xpath //text()
. So I thought it would be a good idea
to come up with a drop-in replacement that is at least fit for purpose
for that constituency of users.
- Does not convert output to current locale - it will always be UTF-8.
- If the Gumbo parser for HTML 5 tag soup is used then the input must be UTF-8.
Contributions and criticisms welcome!
It is suggested that downstream packagers use signed tags from this repository as the canonical upstream form, rather than any tarball artefacts. Releases will be signed by the following PGP key:
pub rsa4096/0x4510339430FC9F34 2022-08-16 [C]
Key fingerprint = 06AB 786E 936C 6C73 F6D8 130C 4510 3394 30FC 9F34
This project shares nothing in common with its predecessor apart from having a backwards compatible interface. It is licensed with the MIT licence and Copyright (c) 2024, Andrew Bower andrew@bower.uk