Skip to content

Commit

Permalink
refacto: align data api with edsnlp
Browse files Browse the repository at this point in the history
  • Loading branch information
percevalw committed Feb 7, 2024
1 parent ec083ed commit 78f634c
Show file tree
Hide file tree
Showing 29 changed files with 2,306 additions and 1,190 deletions.
Binary file added docs/assets/images/multiprocessing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 0 additions & 3 deletions docs/assets/images/multiprocessing.svg

This file was deleted.

10 changes: 4 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,12 +99,10 @@ See the [rule-based recipe](recipes/rule-based.md) for a step-by-step explanatio
If you use EDS-PDF, please cite us as below.

```bibtex
@software{edspdf,
author = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
doi = {10.5281/zenodo.6902977},
license = {BSD-3-Clause},
title = {{EDS-PDF: Smart text extraction from PDF documents}},
url = {https://github.com/aphp/edspdf}
@article{gerardin_wajsburt_pdf,
title={Bridging Clinical PDFs and Downstream Natural Language Processing: An Efficient Neural Approach to Layout Segmentation},
author={G{\'e}rardin, Christel Ducroz and Wajsburt, Perceval and Dura, Basile and Calliger, Alice and Mouchet, Alexandre and Tannier, Xavier and Bey, Romain},
journal={Available at SSRN 4587624}
}
```

Expand Down
108 changes: 78 additions & 30 deletions docs/inference.md
Original file line number Diff line number Diff line change
@@ -1,61 +1,109 @@
# Inference

Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.
Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference. This page answers the following questions :

> How do we leverage computational resources run a model on many documents?
> How do we connect to various data sources to retrieve documents?

## Inference on a single document

In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
In EDS-model, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:

- a sequence of bytes
- or a [PDFDoc][edspdf.structures.PDFDoc] object
- a bytes string
- or a [PDFDoc](https://spacy.io/api/doc) object

```python
```{ .python .no-check }
from pathlib import Path
pipeline = ...
content = Path("path/to/.pdf").read_bytes()
doc = pipeline(content)
model = ...
pdf_bytes = b"..."
doc = model(pdf_bytes)
```

If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the [multiprocessing accelerator][edspdf.accelerators.multiprocessing.MultiprocessingAccelerator] description below.
If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline.

```python
pipeline.to("cuda") # same semantics as pytorch
doc = pipeline(content)
```{ .python .no-check }
model.to("cuda") # same semantics as pytorch
doc = model(pdf_bytes)
```

## Inference on multiple documents
To leverage multiple GPUs when processing multiple documents, refer to the [multiprocessing backend][edspdf.processing.multiprocessing.execute_multiprocessing_backend] description below.

## Inference on multiple documents {: #edspdf.lazy_collection.LazyCollection }

When processing multiple documents, we can optimize the inference by parallelizing the computation on a single core, multiple cores and GPUs or even multiple machines.

### Lazy collection

These optimizations are enabled by performing *lazy inference* : the operations (e.g., reading a document, converting it to a PDFDoc, running the different pipes of a model or writing the result somewhere) are not executed immediately but are instead scheduled in a [LazyCollection][edspdf.lazy_collection.LazyCollection] object. It can then be executed by calling the `execute` method, iterating over it or calling a writing method (e.g., `to_pandas`). In fact, data connectors like `edspdf.data.read_files` return a lazy collection, as well as the `model.pipe` method.

A lazy collection contains :

- a `reader`: the source of the data (e.g., a file, a database, a list of strings, etc.)
- the list of operations to perform under a `pipeline` attribute containing the name if any, function / pipe, keyword arguments and context for each operation
- an optional `writer`: the destination of the data (e.g., a file, a database, a list of strings, etc.)
- the execution `config`, containing the backend to use and its configuration such as the number of workers, the batch size, etc.

All methods (`.map`, `.map_pipeline`, `.set_processing`) of the lazy collection are chainable, meaning that they return a new object (no in-place modification).

When processing multiple documents, it is usually more efficient to use the `pipeline.pipe(...)` method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the [Accelerators](#accelerators) section for more details). By default, the `.pipe()` method uses the [`simple` accelerator][edspdf.accelerators.simple.SimpleAccelerator] but you can switch to a different one by passing the `accelerator` argument.
For instance, the following code will load a model, read a folder of JSON files, apply the model to each document and write the result in a Parquet folder, using 4 CPUs and 2 GPUs.

```python
pipeline = ...
docs = pipeline.pipe(
[content1, content2, ...],
batch_size=16, # optional, default to the one defined in the pipeline
accelerator=my_accelerator,
```{ .python .no-check }
import edspdf
# Load or create a model
model = edspdf.load("path/to/model")
# Read some data (this is lazy, no data will be read until the end of of this snippet)
data = edspdf.data.read_files("path/to/pdf/files", converter="...")
# Apply each pipe of the model to our documents
data = data.map_pipeline(model)
# or equivalently : data = model.pipe(data)
# Configure the execution
data = data.set_processing(
# 4 CPUs to parallelize rule-based pipes, IO and preprocessing
num_cpu_workers=4,
# 2 GPUs to accelerate deep-learning pipes
num_gpu_workers=2,
)
# Write the result, this will execute the lazy collection
data.write_parquet("path/to/output_folder", converter="...", write_in_worker=True)
```

The `pipe` method supports the following arguments :
### Applying operations to a lazy collection

To apply an operation to a lazy collection, you can use the `.map` method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection.

To apply a model, you can use the `.map_pipeline` method. It takes a model as input and will add every pipe of the model to the scheduled operations.

In both cases, the operations will not be executed immediately but will be scheduled to be executed when iterating of the collection, or calling the `.execute`, `.to_*` or `.write_*` methods.

### Execution of a lazy collection {: #edspdf.lazy_collection.LazyCollection.set_processing }

You can configure how the operations performed in the lazy collection is executed by calling its `set_processing(...)` method. The following options are available :

::: edspdf.pipeline.Pipeline.pipe
::: edspdf.lazy_collection.LazyCollection.set_processing
options:
heading_level: 3
only_parameters: true
only_parameters: "no-header"

## Accelerators
## Backends

### Simple accelerator {: #edspdf.accelerators.simple.SimpleAccelerator }
### Simple backend {: #edspdf.processing.simple.execute_simple_backend }

::: edspdf.accelerators.simple.SimpleAccelerator
::: edspdf.processing.simple.execute_simple_backend
options:
heading_level: 3
only_class_level: true
show_source: false

### Multiprocessing accelerator {: #edspdf.accelerators.multiprocessing.MultiprocessingAccelerator }
### Multiprocessing backend {: #edspdf.processing.multiprocessing.execute_multiprocessing_backend }

::: edspdf.accelerators.multiprocessing.MultiprocessingAccelerator
::: edspdf.processing.multiprocessing.execute_multiprocessing_backend
options:
heading_level: 3
only_class_level: true
show_source: false
147 changes: 55 additions & 92 deletions docs/scripts/plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
import mkdocs.structure.files
import mkdocs.structure.nav
import mkdocs.structure.pages
import regex
from mkdocs_autorefs.plugin import AutorefsPlugin

try:
from importlib.metadata import entry_points
Expand Down Expand Up @@ -128,9 +130,13 @@ def on_page_read_source(page, config):
return None


HREF_REGEX = r'href=(?:"([^"]*)"|\'([^\']*)|[ ]*([^ =>]*)(?![a-z]+=))'
HREF_REGEX = (
r"(?<=<\s*(?:a[^>]*href|img[^>]*src)=)"
r'(?:"([^"]*)"|\'([^\']*)|[ ]*([^ =>]*)(?![a-z]+=))'
)


# Maybe find something less specific ?
PIPE_REGEX = r"(?<=[^a-zA-Z0-9._-])eds[.][a-zA-Z0-9._-]*(?=[^a-zA-Z0-9._-])"


@mkdocs.plugins.event_priority(-1000)
Expand All @@ -155,100 +161,57 @@ def on_post_page(
"""

autorefs = config["plugins"]["autorefs"]
edspdf_factories_entry_points = {
ep.name: ep.value for ep in entry_points()["edspdf_factories"]
autorefs: AutorefsPlugin = config["plugins"]["autorefs"]
factories_entry_points = {
ep.name: autorefs.get_item_url(ep.value.replace(":", "."))
for ep in entry_points()["edspdf_factories"]
}
factories_entry_points = {
k: "/" + v if not v.startswith("/") else v
for k, v in factories_entry_points.items()
}
factories_entry_points.update(
{
"mupdf-extractor": "https://aphp.github.io/edspdf-mupdf/latest/",
"poppler-extractor": "https://aphp.github.io/edspdf-poppler/latest/",
}
)

def get_component_url(name):
ep = edspdf_factories_entry_points.get(name)
if ep is None:
return None
try:
url = autorefs.get_item_url(ep.replace(":", "."))
except KeyError:
pass
else:
return url
return None
PIPE_REGEX_BASE = "|".join(regex.escape(name) for name in factories_entry_points)
PIPE_REGEX = f"""(?x)
(?<=")({PIPE_REGEX_BASE})(?=")
|(?<=&quot;)({PIPE_REGEX_BASE})(?=&quot;)
|(?<=')({PIPE_REGEX_BASE})(?=')
|(?<=<code>)({PIPE_REGEX_BASE})(?=</code>)
"""

def get_relative_link(url):
def replace_component(match):
name = match.group()
preceding = output[match.start(0) - 50 : match.start(0)]
if (
"DEFAULT:"
not in preceding
# and output[: match.start(0)].count("<code>")
# > output[match.end(0) :].count("</code>")
):
try:
ep_url = factories_entry_points[name]
except KeyError:
pass
else:
if ep_url.split("#")[0].strip("/") != page.file.url.strip("/"):
return "<a href={href}>{name}</a>".format(href=ep_url, name=name)
return name

def replace_link(match):
relative_url = url = match.group(1) or match.group(2) or match.group(3)
page_url = os.path.join("/", page.file.url)
if url.startswith("/"):
url = os.path.relpath(url, page_url)
return url

def replace_component_span(span):
content = span.text
if content is None:
return
link_url = get_component_url(content.strip("\"'"))
if link_url is None:
return
a = etree.Element("a", href="/" + link_url)
a.text = content
span.text = ""
span.append(a)

def replace_component_names(root):
# Iterate through all span elements
spans = list(root.iter("span", "code"))
for i, span in enumerate(spans):
prev = span.getprevious()
if span.getparent().tag == "a":
continue
# To avoid replacing default component name in parameter tables
if prev is None or prev.text != "DEFAULT:":
replace_component_span(span)
# if span.text == "add_pipe":
# next_span = span.getnext()
# if next_span is None:
# continue
# next_span = next_span.getnext()
# if next_span is None or next_span.tag != "span":
# continue
# replace_component_span(next_span)
# continue
# tokens = ["@", "factory", "="]
# while True:
# if len(tokens) == 0:
# break
# if span.text != tokens[0]:
# break
# tokens = tokens[1:]
# span = span.getnext()
# while span is not None and (
# span.text is None or not span.text.strip()
# ):
# span = span.getnext()
# if len(tokens) == 0:
# replace_component_span(span)

# Convert the modified tree back to a string
return root

def replace_absolute_links(root):
# Iterate through all a elements
for a in root.iter("a"):
href = a.get("href")
if href is None or href.startswith("http"):
continue
a.set("href", get_relative_link(href))
for img in root.iter("img"):
href = img.get("src")
if href is None or href.startswith("http"):
continue
img.set("src", get_relative_link(href))

# Convert the modified tree back to a string
return root
relative_url = os.path.relpath(url, page_url)
return f'"{relative_url}"'

# Replace absolute paths with path relative to the rendered page
from lxml.html import etree

root = etree.HTML(output)
root = replace_component_names(root)
root = replace_absolute_links(root)
doctype = root.getroottree().docinfo.doctype
res = etree.tostring(root, encoding="unicode", method="html", doctype=doctype)
return res
output = regex.sub(PIPE_REGEX, replace_component, output)
output = regex.sub(HREF_REGEX, replace_link, output)

return output
6 changes: 3 additions & 3 deletions docs/trainable-pipes.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,12 +97,12 @@ class MyComponent(TrainablePipe):
"my-feature": ...(doc),
}

def collate(self, batch, device: torch.device) -> Dict:
def collate(self, batch) -> Dict:
# Collate the features of the "embedding" subcomponent
# and the features of this component as well
return {
"embedding": self.embedding.collate(batch["embedding"], device),
"my-feature": torch.as_tensor(batch["my-feature"], device=device),
"embedding": self.embedding.collate(batch["embedding"]),
"my-feature": torch.as_tensor(batch["my-feature"]),
}

def forward(self, batch: Dict, supervision=False) -> Dict:
Expand Down
1 change: 1 addition & 0 deletions edspdf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from .pipeline import Pipeline, load
from .registry import registry
from .structures import Box, Page, PDFDoc, Text, TextBox, TextProperties
from . import data

from . import utils # isort:skip

Expand Down
Loading

0 comments on commit 78f634c

Please sign in to comment.