refacto: align data api with edsnlp

aphp · Feb 7, 2024 · 78f634c · 78f634c
1 parent ec083ed
commit 78f634c
Show file tree

Hide file tree

Showing 29 changed files with 2,306 additions and 1,190 deletions.
diff --git a/docs/assets/images/multiprocessing.png b/docs/assets/images/multiprocessing.png
diff --git a/docs/assets/images/multiprocessing.svg b/docs/assets/images/multiprocessing.svg
diff --git a/docs/index.md b/docs/index.md
@@ -99,12 +99,10 @@ See the [rule-based recipe](recipes/rule-based.md) for a step-by-step explanatio
 If you use EDS-PDF, please cite us as below.
 
 ```bibtex
-@software{edspdf,
-  author  = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
-  doi     = {10.5281/zenodo.6902977},
-  license = {BSD-3-Clause},
-  title   = {{EDS-PDF: Smart text extraction from PDF documents}},
-  url     = {https://github.com/aphp/edspdf}
+@article{gerardin_wajsburt_pdf,
+  title={Bridging Clinical PDFs and Downstream Natural Language Processing: An Efficient Neural Approach to Layout Segmentation},
+  author={G{\'e}rardin, Christel Ducroz and Wajsburt, Perceval and Dura, Basile and Calliger, Alice and Mouchet, Alexandre and Tannier, Xavier and Bey, Romain},
+  journal={Available at SSRN 4587624}
 }
 ```
 

diff --git a/docs/inference.md b/docs/inference.md
@@ -1,61 +1,109 @@
 # Inference
 
-Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference.
+Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference. This page answers the following questions :
+
+> How do we leverage computational resources run a model on many documents?
+
+> How do we connect to various data sources to retrieve documents?
+
 
 ## Inference on a single document
 
-In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
+In EDS-model, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either:
 
-- a sequence of bytes
-- or a [PDFDoc][edspdf.structures.PDFDoc] object
+- a bytes string
+- or a [PDFDoc](https://spacy.io/api/doc) object
 
-```python
+```{ .python .no-check }
 from pathlib import Path
 
-pipeline = ...
-content = Path("path/to/.pdf").read_bytes()
-doc = pipeline(content)
+model = ...
+pdf_bytes = b"..."
+doc = model(pdf_bytes)
 ```
 
-If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the [multiprocessing accelerator][edspdf.accelerators.multiprocessing.MultiprocessingAccelerator] description below.
+If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline.
 
-```python
-pipeline.to("cuda")  # same semantics as pytorch
-doc = pipeline(content)
+```{ .python .no-check }
+model.to("cuda")  # same semantics as pytorch
+doc = model(pdf_bytes)
 ```
 
-## Inference on multiple documents
+To leverage multiple GPUs when processing multiple documents, refer to the [multiprocessing backend][edspdf.processing.multiprocessing.execute_multiprocessing_backend] description below.
+
+## Inference on multiple documents {: #edspdf.lazy_collection.LazyCollection }
+
+When processing multiple documents, we can optimize the inference by parallelizing the computation on a single core, multiple cores and GPUs or even multiple machines.
+
+### Lazy collection
+
+These optimizations are enabled by performing *lazy inference* : the operations (e.g., reading a document, converting it to a PDFDoc, running the different pipes of a model or writing the result somewhere) are not executed immediately but are instead scheduled in a [LazyCollection][edspdf.lazy_collection.LazyCollection] object. It can then be executed by calling the `execute` method, iterating over it or calling a writing method (e.g., `to_pandas`). In fact, data connectors like `edspdf.data.read_files` return a lazy collection, as well as the `model.pipe` method.
+
+A lazy collection contains :
+
+- a `reader`: the source of the data (e.g., a file, a database, a list of strings, etc.)
+- the list of operations to perform under a `pipeline` attribute containing the name if any, function / pipe, keyword arguments and context for each operation
+- an optional `writer`: the destination of the data (e.g., a file, a database, a list of strings, etc.)
+- the execution `config`, containing the backend to use and its configuration such as the number of workers, the batch size, etc.
+
+All methods (`.map`, `.map_pipeline`, `.set_processing`) of the lazy collection are chainable, meaning that they return a new object (no in-place modification).
 
-When processing multiple documents, it is usually more efficient to use the `pipeline.pipe(...)` method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the [Accelerators](#accelerators) section for more details). By default, the `.pipe()` method uses the [`simple` accelerator][edspdf.accelerators.simple.SimpleAccelerator] but you can switch to a different one by passing the `accelerator` argument.
+For instance, the following code will load a model, read a folder of JSON files, apply the model to each document and write the result in a Parquet folder, using 4 CPUs and 2 GPUs.
 
-```python
-pipeline = ...
-docs = pipeline.pipe(
-    [content1, content2, ...],
-    batch_size=16,  # optional, default to the one defined in the pipeline
-    accelerator=my_accelerator,
+```{ .python .no-check }
+import edspdf
+
+# Load or create a model
+model = edspdf.load("path/to/model")
+
+# Read some data (this is lazy, no data will be read until the end of of this snippet)
+data = edspdf.data.read_files("path/to/pdf/files", converter="...")
+
+# Apply each pipe of the model to our documents
+data = data.map_pipeline(model)
+# or equivalently : data = model.pipe(data)
+
+# Configure the execution
+data = data.set_processing(
+    # 4 CPUs to parallelize rule-based pipes, IO and preprocessing
+    num_cpu_workers=4,
+    # 2 GPUs to accelerate deep-learning pipes
+    num_gpu_workers=2,
 )
+
+# Write the result, this will execute the lazy collection
+data.write_parquet("path/to/output_folder", converter="...", write_in_worker=True)
 ```
 
-The `pipe` method supports the following arguments :
+### Applying operations to a lazy collection
+
+To apply an operation to a lazy collection, you can use the `.map` method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection.
+
+To apply a model, you can use the `.map_pipeline` method. It takes a model as input and will add every pipe of the model to the scheduled operations.
+
+In both cases, the operations will not be executed immediately but will be scheduled to be executed when iterating of the collection, or calling the `.execute`, `.to_*` or `.write_*` methods.
+
+### Execution of a lazy collection {: #edspdf.lazy_collection.LazyCollection.set_processing }
+
+You can configure how the operations performed in the lazy collection is executed by calling its `set_processing(...)` method. The following options are available :
 
-::: edspdf.pipeline.Pipeline.pipe
+::: edspdf.lazy_collection.LazyCollection.set_processing
     options:
         heading_level: 3
-        only_parameters: true
+        only_parameters: "no-header"
 
-## Accelerators
+## Backends
 
-### Simple accelerator {: #edspdf.accelerators.simple.SimpleAccelerator }
+### Simple backend {: #edspdf.processing.simple.execute_simple_backend }
 
-::: edspdf.accelerators.simple.SimpleAccelerator
+::: edspdf.processing.simple.execute_simple_backend
     options:
         heading_level: 3
-        only_class_level: true
+        show_source: false
 
-### Multiprocessing accelerator {: #edspdf.accelerators.multiprocessing.MultiprocessingAccelerator }
+### Multiprocessing backend {: #edspdf.processing.multiprocessing.execute_multiprocessing_backend }
 
-::: edspdf.accelerators.multiprocessing.MultiprocessingAccelerator
+::: edspdf.processing.multiprocessing.execute_multiprocessing_backend
     options:
         heading_level: 3
-        only_class_level: true
+        show_source: false
diff --git a/docs/scripts/plugin.py b/docs/scripts/plugin.py
@@ -7,6 +7,8 @@
 import mkdocs.structure.files
 import mkdocs.structure.nav
 import mkdocs.structure.pages
+import regex
+from mkdocs_autorefs.plugin import AutorefsPlugin
 
 try:
     from importlib.metadata import entry_points
@@ -128,9 +130,13 @@ def on_page_read_source(page, config):
     return None
 
 
-HREF_REGEX = r'href=(?:"([^"]*)"|\'([^\']*)|[ ]*([^ =>]*)(?![a-z]+=))'
+HREF_REGEX = (
+    r"(?<=<\s*(?:a[^>]*href|img[^>]*src)=)"
+    r'(?:"([^"]*)"|\'([^\']*)|[ ]*([^ =>]*)(?![a-z]+=))'
+)
+
+
 # Maybe find something less specific ?
-PIPE_REGEX = r"(?<=[^a-zA-Z0-9._-])eds[.][a-zA-Z0-9._-]*(?=[^a-zA-Z0-9._-])"
 
 
 @mkdocs.plugins.event_priority(-1000)
@@ -155,100 +161,57 @@ def on_post_page(
 
     """
 
-    autorefs = config["plugins"]["autorefs"]
-    edspdf_factories_entry_points = {
-        ep.name: ep.value for ep in entry_points()["edspdf_factories"]
+    autorefs: AutorefsPlugin = config["plugins"]["autorefs"]
+    factories_entry_points = {
+        ep.name: autorefs.get_item_url(ep.value.replace(":", "."))
+        for ep in entry_points()["edspdf_factories"]
+    }
+    factories_entry_points = {
+        k: "/" + v if not v.startswith("/") else v
+        for k, v in factories_entry_points.items()
     }
+    factories_entry_points.update(
+        {
+            "mupdf-extractor": "https://aphp.github.io/edspdf-mupdf/latest/",
+            "poppler-extractor": "https://aphp.github.io/edspdf-poppler/latest/",
+        }
+    )
 
-    def get_component_url(name):
-        ep = edspdf_factories_entry_points.get(name)
-        if ep is None:
-            return None
-        try:
-            url = autorefs.get_item_url(ep.replace(":", "."))
-        except KeyError:
-            pass
-        else:
-            return url
-        return None
+    PIPE_REGEX_BASE = "|".join(regex.escape(name) for name in factories_entry_points)
+    PIPE_REGEX = f"""(?x)
+    (?<=")({PIPE_REGEX_BASE})(?=")
+    |(?<=&quot;)({PIPE_REGEX_BASE})(?=&quot;)
+    |(?<=')({PIPE_REGEX_BASE})(?=')
+    |(?<=<code>)({PIPE_REGEX_BASE})(?=</code>)
+    """
 
-    def get_relative_link(url):
+    def replace_component(match):
+        name = match.group()
+        preceding = output[match.start(0) - 50 : match.start(0)]
+        if (
+            "DEFAULT:"
+            not in preceding
+            # and output[: match.start(0)].count("<code>")
+            # > output[match.end(0) :].count("</code>")
+        ):
+            try:
+                ep_url = factories_entry_points[name]
+            except KeyError:
+                pass
+            else:
+                if ep_url.split("#")[0].strip("/") != page.file.url.strip("/"):
+                    return "<a href={href}>{name}</a>".format(href=ep_url, name=name)
+        return name
+
+    def replace_link(match):
+        relative_url = url = match.group(1) or match.group(2) or match.group(3)
         page_url = os.path.join("/", page.file.url)
         if url.startswith("/"):
-            url = os.path.relpath(url, page_url)
-        return url
-
-    def replace_component_span(span):
-        content = span.text
-        if content is None:
-            return
-        link_url = get_component_url(content.strip("\"'"))
-        if link_url is None:
-            return
-        a = etree.Element("a", href="/" + link_url)
-        a.text = content
-        span.text = ""
-        span.append(a)
-
-    def replace_component_names(root):
-        # Iterate through all span elements
-        spans = list(root.iter("span", "code"))
-        for i, span in enumerate(spans):
-            prev = span.getprevious()
-            if span.getparent().tag == "a":
-                continue
-            # To avoid replacing default component name in parameter tables
-            if prev is None or prev.text != "DEFAULT:":
-                replace_component_span(span)
-            # if span.text == "add_pipe":
-            #     next_span = span.getnext()
-            #     if next_span is None:
-            #         continue
-            #     next_span = next_span.getnext()
-            #     if next_span is None or next_span.tag != "span":
-            #         continue
-            #     replace_component_span(next_span)
-            #     continue
-            # tokens = ["@", "factory", "="]
-            # while True:
-            #     if len(tokens) == 0:
-            #         break
-            #     if span.text != tokens[0]:
-            #         break
-            #     tokens = tokens[1:]
-            #     span = span.getnext()
-            #     while span is not None and (
-            #       span.text is None or not span.text.strip()
-            #     ):
-            #         span = span.getnext()
-            # if len(tokens) == 0:
-            #     replace_component_span(span)
-
-        # Convert the modified tree back to a string
-        return root
-
-    def replace_absolute_links(root):
-        # Iterate through all a elements
-        for a in root.iter("a"):
-            href = a.get("href")
-            if href is None or href.startswith("http"):
-                continue
-            a.set("href", get_relative_link(href))
-        for img in root.iter("img"):
-            href = img.get("src")
-            if href is None or href.startswith("http"):
-                continue
-            img.set("src", get_relative_link(href))
-
-        # Convert the modified tree back to a string
-        return root
+            relative_url = os.path.relpath(url, page_url)
+        return f'"{relative_url}"'
 
     # Replace absolute paths with path relative to the rendered page
-    from lxml.html import etree
-
-    root = etree.HTML(output)
-    root = replace_component_names(root)
-    root = replace_absolute_links(root)
-    doctype = root.getroottree().docinfo.doctype
-    res = etree.tostring(root, encoding="unicode", method="html", doctype=doctype)
-    return res
+    output = regex.sub(PIPE_REGEX, replace_component, output)
+    output = regex.sub(HREF_REGEX, replace_link, output)
+
+    return output
diff --git a/docs/trainable-pipes.md b/docs/trainable-pipes.md
@@ -97,12 +97,12 @@ class MyComponent(TrainablePipe):
             "my-feature": ...(doc),
         }
 
-    def collate(self, batch, device: torch.device) -> Dict:
+    def collate(self, batch) -> Dict:
         # Collate the features of the "embedding" subcomponent
         # and the features of this component as well
         return {
-            "embedding": self.embedding.collate(batch["embedding"], device),
-            "my-feature": torch.as_tensor(batch["my-feature"], device=device),
+            "embedding": self.embedding.collate(batch["embedding"]),
+            "my-feature": torch.as_tensor(batch["my-feature"]),
         }
 
     def forward(self, batch: Dict, supervision=False) -> Dict:

diff --git a/edspdf/__init__.py b/edspdf/__init__.py
@@ -3,6 +3,7 @@
 from .pipeline import Pipeline, load
 from .registry import registry
 from .structures import Box, Page, PDFDoc, Text, TextBox, TextProperties
+from . import data
 
 from . import utils  # isort:skip