Another tool of ours, txt2stix, takes a .txt file input and then extracts IoCs (indicators of compromise) and TTPs (tactics, techniques and procedures).
However, in many cases the file a user wants to process is not usually in structured plain text file format (e.g. is usually in pdf, docx, etc. formats).
These files also commonly contain images with text that are useful to extract too.
file2txt is a Python library takes common file formats and turns them into plain text (a .md
file) with Markdown styling to make it as nice as possible to read.
In addition to the printed text, file2txt can also extract text from images found in the input file.
Essentially file2txt is used by us to produce a text output that can be scanned for IoCs and TTPs (by txt2stix), but could be used for a variety of other use-cases as you see fit.
The general flow of the file2txt is as follows:
https://miro.com/app/board/uXjVKZXyIxA=/
# clone the latest code
git clone https://github.com/signalscorps/file2txt
# create a venv
cd file2txt
python3 -m venv file2txt-venv
source file2txt-venv/bin/activate
# install requirements
pip3 install -r requirements.txt
file2txt has various settings that are defined in an .env
file.
To create a template for the file:
cp .env.example .env
To see more information about how to set the variables, and what they do, read the .env.markdown
file.
file2txt uses the Marker API to process the following filetypes;
- PDF
.pdf
(application/pdf
) - Word
.doc
(application/msword
),.docx
(application/vnd.openxmlformats-officedocument.wordprocessingml.document
) - Powerpoint
.ppt
(application/vnd.ms-powerpoint
),.pptx
(application/vnd.openxmlformats-officedocument.presentationml.presentation
)
You only need a Marker API key if you intend to use one of the above modes.
Once it's generated add your API key using the MARKER_API_KEY
variable in the .env
file.
You do not need a Marker API key if you only intend to convert the following file types;
- HTML
html
(text/html
) - HTML Article
html_article
(text/html
) - CSV
csv
(text/csv
) - Image
jpg
(image/jpg
),.jpeg
(image/jpeg
),.png
(image/png
),.webp
(image/webp
)
file2txt uses Cloud Vision to text from images found in the input documents. This feature is optional. If you do not set a Cloud Vision key, you will not be able to use the extract_text_from_image
feature.
If you want to use this feature you must set your Cloud Vision credentials as follows...
Login to the GCP Console.
The project name can be anything you want. It will only be visible to you in the GCP Console.
This app requires the following Google API's to work:
Go to APIs and Services and create a new API Key. It's a good idea to limit the keys scope to the Cloud Vision API.
Once it's generated add your API key using the GOOGLE_VISION_API_KEY
variable in the .env
file.
You do not need a Google API key if you don't want to convert images to text.
python3 file2txt.py \
--mode mode \
--file path/to/file.extension \
--output my_document \
--defang boolean \
--extract_text_from_image boolean
To upload a new file to be processed to text the following flags are used;
--mode
(required, dictionary): must be supported mode. Mode must support the filetype being used, else an error will be returned.txt
md
image
csv
html
html_article
pdf
(requires marker api key)
word
(requires marker api key)powerpoint
(requires marker api key)
--file
(required, string): path to file to be converted to text. Note, if the filetype and mimetype of the document submitted does not match one of those supported by file2txt (and set formode
, an error will be returned.--output
(optional, string): name of output directory name. Default is{input_filename}/
.--defang
(optional, boolean): if output should be defanged. Default istrue
.--extract_text_from_image
(optional, boolean, required Google Vision api key): if images should be converted to text using OCR. Default istrue
. This flag MUST befalse
withcsv
mode and MUST betrue
withimage
mode.
The script will output all files to the output/
directory in the following structure;
output
├── {input_filename}
│ ├── {input_filename}.md
│ ├── EXTRACTED_IMAGE_1.FORMAT
│ └── EXTRACTED_IMAGE_2.FORMAT
To ensure images are not lost (in modes that support images), the script also extracts and stores a copy of all identified images in the directory created for the input file.
You can see the output from the commands below in the examples/
directory of this repository.
If you want to try with the same files I used, read how to download them in tests/README.md
Turn a CSV into markdown table;
python3 file2txt.py \
--mode csv \
--file tests/files/csv/csv-test.csv \
--output examples/csv_input \
--defang true \
--extract_text_from_image false
And a spreadsheet;
python3 file2txt.py \
--mode excel \
--file tests/files/xls/fanged_data.xlsx \
--output examples/xls_input \
--defang true \
--extract_text_from_image false
Convert a PDF document to human friendly markdown, extract text from images, and defang the text (the most common use-case for cyber-security);
python3 file2txt.py \
--mode pdf \
--file tests/files/pdf-real/bitdefender-rdstealer.pdf \
--output examples/pdf_input \
--defang true \
--extract_text_from_image true
Only convert the text in the main article on the webpage into markdown, also extract text from images, and defang the text;
python3 file2txt.py \
--mode html_article \
--file tests/files/html-real/unit42-Fighting-Ursa-Luring-Targets-With-Car-for-Sale.html \
--output examples/html_article_input \
--defang true \
--extract_text_from_image true
Now convert the entire HTML content, not just the article
python3 file2txt.py \
--mode html \
--file tests/files/html-real/unit42-Fighting-Ursa-Luring-Targets-With-Car-for-Sale.html \
--output examples/html_input \
--defang true \
--extract_text_from_image true
Do not defang this Word file;
python3 file2txt.py \
--mode word \
--file tests/files/doc/fanged_data.docx \
--output examples/word_input_defang_f \
--defang false \
--extract_text_from_image true
Defang this word file;
python3 file2txt.py \
--mode word \
--file tests/files/doc/fanged_data.docx \
--output examples/word_input_defang_t \
--defang true \
--extract_text_from_image true
Now try a Powerpoint
python3 file2txt.py \
--mode powerpoint \
--file tests/files/ppt/fanged_data.pptx \
--output examples/ppt_input \
--defang true \
--extract_text_from_image true
Extract data from an png image;
python3 file2txt.py \
--mode image \
--file tests/files/image/example-1.png \
--output examples/image_input \
--defang true \
--extract_text_from_image true
See how file2txt deals with markdown inputs;
python3 file2txt.py \
--mode md \
--file tests/files/markdown/threat-report.md \
--output examples/markdown_input \
--defang true \
--extract_text_from_image true
For more examples, you can also run our automated scripts to generate files.
python3 -m unittest tests/test_1_output_file_generation.py
You will need a Google Vision in your .env
file when running this test.
This script generates output files using a combination of file2txt settings.
You need to check the output manually to ensure it matches expectations.
python3 -m unittest tests/test_2_negative_tests.py
This will test invalid file input settings. All tests are expected to fail.
If the script is failing, you can examine the log file printed in logs/
to try and resolve any issues. Each run has its own log, named using execution time (e.g. file2txt_20231127-205228_846248.log
).
You can upload a range of filetypes to file2txt.
File extensions and mimetypes are validated on input for security, if they are not supported an error is returned.
The input file type determines how the files should be handled.
- Filetypes supported (mime-type):
txt
(text/plain
) - Embedded images processed using
image
mode and stored locally: FALSE - Supports paging: FALSE
- Python library used for conversion to markdown: n/a
- Filetypes supported (mime-type):
.md
(text/markdown
),.markdown
(text/markdown
) - Embedded images processed using
image
mode and stored locally: TRUE - Supports paging: FALSE
- Python library used for conversion to markdown: n/a
- Filetypes supported (mime-type):
jpg
(image/jpg
),.jpeg
(image/jpeg
),.png
(image/png
),.webp
(image/webp
) - Embedded images processed using
image
mode and stored locally: TRUE - Supports paging: FALSE
- Python library used for conversion to markdown: n/a
- Filetypes supported (mime-type):
csv
(text/csv
) - Embedded images processed using
image
mode and stored locally: FALSE - Supports paging: FALSE
- Python library used for conversion to markdown:
pandas
- Filetypes supported (mime-type):
xls
(application/vnd.ms-excel
),xlsx
(application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
) - Embedded images processed using
image
mode and stored locally: FALSE - Supports paging: FALSE (only considers first tab)
- Python library used for conversion to markdown:
pandas
- Filetypes supported (mime-type):
html
(text/html
) - Embedded images processed using
image
mode and stored locally: TRUE - Supports paging: FALSE
- Python library used for conversion to markdown:
python-markdownify
This will consider the entire HTML of the page (e.g. nav bars, footers, etc.). Generally you do not want this extra data (mainly problematic when exporting webpages as HTML from a browser).
Such HTML outputs can get very messy (stylesheets, javascript, etc). As such for exported pages we generally recommend users using html_article
mode, or a html to pdf tool (e.g. printfriendly or similar) and uploading the page in pdf_mode
for best results.
- Filetypes supported (mime-type):
html
(text/html
) - Embedded images processed using
image
mode and stored locally: TRUE - Supports paging: FALSE
- Python library used for conversion to markdown: see
html
mode
Many of our use-cases call for the actual article in the HTML of a website to be considered (versus the entirety of a page, e.g. the nav bar, the footer, advertisements, etc.). This mode will attempt to remove anything not considered the core content of the page.
- Filetypes supported (mime-type):
pdf
(application/pdf
) - Embedded images processed using
image
mode and stored locally: TRUE - Supports paging: TRUE
- Marker API
- Filetypes supported (mime-type):
docx
(application/vnd.openxmlformats-officedocument.wordprocessingml.document
),doc
(application/msword
) - Embedded images processed using
image
mode and stored locally: TRUE - Supports paging: TRUE
- Marker API
- Filetypes supported (mime-type):
ppt
(application/vnd.ms-powerpoint
),.jpeg
(application/vnd.openxmlformats-officedocument.presentationml.presentation
), - Embedded images processed using
image
mode and stored locally: TRUE - Supports paging: TRUE (one page = one slide)
- Marker API
Generally, where a filetype is not supportedt
Similarly, for other word processing formats, there is usually the option to save to pdf, which can then be uploaded to file2txt.
HTML can be printed in a variety of ways
- Encoded: e.g. contains
>
vs>
- Decoded Raw: standard HTML tags
- Decoded CDATA: the actual Decoded Raw HTML is inside
<![CDATA[Decoded Raw HTML]]>
tags (most common in XML/JSON docs containing HTML, most commonly RSS/ATOM feeds)
As an example, endcoded
>img src="https://cms.therecord.media/uploads/2023_0706_Ransomware_Tracker_Most_Prolific_Groups_6a567c11da.jpg"<
Which as decoded raw html looks as follows
<img src="https://cms.therecord.media/uploads/2023_0706_Ransomware_Tracker_Most_Prolific_Groups_6a567c11da.jpg">
Which as decoded CDATA looks like
<![CDATA[<img src="https://cms.therecord.media/uploads/2023_0706_Ransomware_Tracker_Most_Prolific_Groups_6a567c11da.jpg">]]>
IMPORTANT: file2txt only supports decoded HTML.
If your HTML data is encoded, or wrapped in CDATA tags, you must first convert it to decoded HTML before processing with file2txt otherwise you the output will not be converted correctly.
In some filetype inputs, file2txt will convert text found in embedded images to text, if enabled by the user.
Embedded images are defined images that exist inline with the text. That is; when you read the document, you can see the actual image.
Images are identified using image tags in the supported modes (HTML and PDF). file2txt will always store a copy of identified images locally, even if you do not enable image text extraction.
To make it clear where the text shown is from an embedded image (i.e. where the original image was found in doc) in the output, when file2txt detects an image in a document it will insert the following tags; [comment]: <> (===START EMBEDDED IMAGE EXTRACTION===)
and [comment]: <> (===END EMBEDDED IMAGE EXTRACTION===)
. Between the tags will be any text identified in an embedded image, e.g.
Once upon a time
[comment]: <> (===START IMAGE DETECTED===)
![](local/path/to/image_url.png)
[comment]: <> (===START EMBEDDED IMAGE EXTRACTION===)
<TEXT IN IMAGE>
[comment]: <> (===END EMBEDDED IMAGE EXTRACTION===)
[comment]: <> (===END IMAGE DETECTED===)
The end
Due to the way text extraction from images is performed, it means the text identified in an image is outputted into the final markdown with no structure (e.g. if a table can be seen in an image found in a doc, it will not be a markdown table in text output).
If extract text from images is set to false, a user will only see the pure markdown image ref in the output, e.g. for the above
![](local/path/to/image_url.png)
The image URL always points to the locally stored copy of the image.
PDF input types support paging. This is done by splitting a pdf of many pages, into individual pdf files per page.
file2txt will output a single text file with all the pdf page text extractions merged. Pages (i.e. each txt file for each pdf page produced) is shown in the output document with breaks to show where a page starts and ends.
For example;
[comment]: <> (===START PAGE 1===)
Some content
OK
[comment]: <> (===END PAGE 1===)
[comment]: <> (===START PAGE 2===)
Another page
[comment]: <> (===END PAGE 2===)
Note, if the input type does not support paging, page breaks are included, but will only ever show page 1 start and end.
Fanging obfuscates indicators into a safer representations so that a user reading a report does not accidentally click on a malicious URL or inadvertently run malicious code. Many cyber threat intelligence reports shared electronically employ fanging.
Typical types of fanged Observables include IPv4 addresses (e.g. 1.1.1[.]1
), IPv6 addresses (e.g. 2001:0db8:85a3:0000:0000:8a2e:0370[:]7334
), domain names (e.g. example[.]com
), URLs (e.g. https[:]//example.com/research/index.html
), email addresses (e.g. example[@]example.com
), file extensions (e.g. malicious[.]exe
), and directory paths (e.g. [C:]\\Windows\\System32
).
As file2txt was built for security reports, it is very likely the input will be fanged. Therefore, file2txt offers users the ability to defang all text in the input.
Unfortunately, there is no universal standard for fanging, although there are some common methods. Some samples of fanging I have observed include the following:
- Wrapping one or more special characters in
[
]
- e.g.
www[.]example[.]com
- e.g.
http[:]//example.com
- e.g.
http[://]example.com
- e.g.
1.1.1.1[/]24
- e.g.
- Wrapping one or more special characters in
{
}
- Wrapping one or more special characters in
(
)
- Replacing
http
andhxxp
- e.g.
hxxps://google.com
- e.g.
- Replacing
.
withdot
- e.g.
example@example dot com
- e.g.
http://example dot com
- e.g.
- Replacing
.
with[dot]
(or(dot)
, or{dot}
)- e.g.
example@example[dot]com
- e.g.
- Replacing
@
withat
- e.g.
example at example.com
- e.g.
- Replacing
@
with[at]
(or(at)
, or{at}
)- e.g.
example[at]example.com
- e.g.
A combination of the above techniques are also commonly implemented for defanging. For example replacing .
with dot
and replacing @
with at
for an email like so; fanged = example at example dot com
, defanged = example@example.com
Another example using even more fanging technique combinations for a URL; fanged = hxxps[:]//test\.example[.)com[/]path
, defanged = https://test.example.com/path
At the final step of processing, file2txt will defang all txt outputs using a find and replace stratergy in the following order to defang:
- replaces the text
hxxp
withhttp
- replaces the text
{dot}
with.
- replaces the text
{dot}
with.
- replaces the text
[dot]
with.
- replaces the text
[dot]
with.
- replaces the text
(dot)
with.
- replaces the text
(dot)
with.
- replaces the text
{at}
with@
- replaces the text
{at}
with@
- replaces the text
[at]
with@
- replaces the text
[at]
with@
- replaces the text
(at)
with@
- replaces the text
(at)
with@
- removes the square bracket characters (
[
and]
) around the following special characters;.
.
@
@
/
/
\
\
://
://
- removes the curly bracket characters (
{
and}
) around the following special characters;.
.
@
@
/
/
\
\
://
://
- removes the parentheses characters (
(
and)
) around the following special characters;.
.
@
@
/
/
\
\
://
://
As an example, /tests/txt/fanged_data_good.txt
processed in defang
mode goes from
...
C: (\) Windows (\) System32
[2002(:)(:)abcd(:)ffff(:)c0a8(:)101](:)80
[2002 (:) (:) abcd (:) ffff (:) c0a8 (:) 101] (:) 80
someone(@)example.com
someone (@) example.com
http(://)example.com
http (://) example.com
http (://) example[.]something{dot}other(dot)com(/)this[/]file{.}html
...
to
C:\Windows\System32
[2002::abcd:ffff:c0a8:101]:80
[2002::abcd:ffff:c0a8:101]:80
someone@example.com
someone@example.com
http://example.com
http://example.com
http://example.something.other.com/this/file.html
The output of the script is always markdown.
It will generally markdown title tags (e.g. #
, ##
, ###
), table, image tags, link tags, etc as the output is designed to closely match the styling of the input document.
However, it is not always perfect. As a general rule, the cleaner the styling for input (i.e, report type structure is best -> interactive webpages worst) the better file2txt will handle the output.
This does not apply to images (see embedded images).