-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
29 changed files
with
2,306 additions
and
1,190 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,61 +1,109 @@ | ||
# Inference | ||
|
||
Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference. | ||
Once you have obtained a pipeline, either by composing rule-based components, training a model or loading a model from the disk, you can use it to make predictions on documents. This is referred to as inference. This page answers the following questions : | ||
|
||
> How do we leverage computational resources run a model on many documents? | ||
> How do we connect to various data sources to retrieve documents? | ||
|
||
## Inference on a single document | ||
|
||
In EDS-PDF, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either: | ||
In EDS-model, computing the prediction on a single document is done by calling the pipeline on the document. The input can be either: | ||
|
||
- a sequence of bytes | ||
- or a [PDFDoc][edspdf.structures.PDFDoc] object | ||
- a bytes string | ||
- or a [PDFDoc](https://spacy.io/api/doc) object | ||
|
||
```python | ||
```{ .python .no-check } | ||
from pathlib import Path | ||
pipeline = ... | ||
content = Path("path/to/.pdf").read_bytes() | ||
doc = pipeline(content) | ||
model = ... | ||
pdf_bytes = b"..." | ||
doc = model(pdf_bytes) | ||
``` | ||
|
||
If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. To leverage multiple GPUs, refer to the [multiprocessing accelerator][edspdf.accelerators.multiprocessing.MultiprocessingAccelerator] description below. | ||
If you're lucky enough to have a GPU, you can use it to speed up inference by moving the model to the GPU before calling the pipeline. | ||
|
||
```python | ||
pipeline.to("cuda") # same semantics as pytorch | ||
doc = pipeline(content) | ||
```{ .python .no-check } | ||
model.to("cuda") # same semantics as pytorch | ||
doc = model(pdf_bytes) | ||
``` | ||
|
||
## Inference on multiple documents | ||
To leverage multiple GPUs when processing multiple documents, refer to the [multiprocessing backend][edspdf.processing.multiprocessing.execute_multiprocessing_backend] description below. | ||
|
||
## Inference on multiple documents {: #edspdf.lazy_collection.LazyCollection } | ||
|
||
When processing multiple documents, we can optimize the inference by parallelizing the computation on a single core, multiple cores and GPUs or even multiple machines. | ||
|
||
### Lazy collection | ||
|
||
These optimizations are enabled by performing *lazy inference* : the operations (e.g., reading a document, converting it to a PDFDoc, running the different pipes of a model or writing the result somewhere) are not executed immediately but are instead scheduled in a [LazyCollection][edspdf.lazy_collection.LazyCollection] object. It can then be executed by calling the `execute` method, iterating over it or calling a writing method (e.g., `to_pandas`). In fact, data connectors like `edspdf.data.read_files` return a lazy collection, as well as the `model.pipe` method. | ||
|
||
A lazy collection contains : | ||
|
||
- a `reader`: the source of the data (e.g., a file, a database, a list of strings, etc.) | ||
- the list of operations to perform under a `pipeline` attribute containing the name if any, function / pipe, keyword arguments and context for each operation | ||
- an optional `writer`: the destination of the data (e.g., a file, a database, a list of strings, etc.) | ||
- the execution `config`, containing the backend to use and its configuration such as the number of workers, the batch size, etc. | ||
|
||
All methods (`.map`, `.map_pipeline`, `.set_processing`) of the lazy collection are chainable, meaning that they return a new object (no in-place modification). | ||
|
||
When processing multiple documents, it is usually more efficient to use the `pipeline.pipe(...)` method, especially when using deep learning components, since this allow matrix multiplications to be batched together. Depending on your computational resources and requirements, EDS-PDF comes with various "accelerators" to speed up inference (see the [Accelerators](#accelerators) section for more details). By default, the `.pipe()` method uses the [`simple` accelerator][edspdf.accelerators.simple.SimpleAccelerator] but you can switch to a different one by passing the `accelerator` argument. | ||
For instance, the following code will load a model, read a folder of JSON files, apply the model to each document and write the result in a Parquet folder, using 4 CPUs and 2 GPUs. | ||
|
||
```python | ||
pipeline = ... | ||
docs = pipeline.pipe( | ||
[content1, content2, ...], | ||
batch_size=16, # optional, default to the one defined in the pipeline | ||
accelerator=my_accelerator, | ||
```{ .python .no-check } | ||
import edspdf | ||
# Load or create a model | ||
model = edspdf.load("path/to/model") | ||
# Read some data (this is lazy, no data will be read until the end of of this snippet) | ||
data = edspdf.data.read_files("path/to/pdf/files", converter="...") | ||
# Apply each pipe of the model to our documents | ||
data = data.map_pipeline(model) | ||
# or equivalently : data = model.pipe(data) | ||
# Configure the execution | ||
data = data.set_processing( | ||
# 4 CPUs to parallelize rule-based pipes, IO and preprocessing | ||
num_cpu_workers=4, | ||
# 2 GPUs to accelerate deep-learning pipes | ||
num_gpu_workers=2, | ||
) | ||
# Write the result, this will execute the lazy collection | ||
data.write_parquet("path/to/output_folder", converter="...", write_in_worker=True) | ||
``` | ||
|
||
The `pipe` method supports the following arguments : | ||
### Applying operations to a lazy collection | ||
|
||
To apply an operation to a lazy collection, you can use the `.map` method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection. | ||
|
||
To apply a model, you can use the `.map_pipeline` method. It takes a model as input and will add every pipe of the model to the scheduled operations. | ||
|
||
In both cases, the operations will not be executed immediately but will be scheduled to be executed when iterating of the collection, or calling the `.execute`, `.to_*` or `.write_*` methods. | ||
|
||
### Execution of a lazy collection {: #edspdf.lazy_collection.LazyCollection.set_processing } | ||
|
||
You can configure how the operations performed in the lazy collection is executed by calling its `set_processing(...)` method. The following options are available : | ||
|
||
::: edspdf.pipeline.Pipeline.pipe | ||
::: edspdf.lazy_collection.LazyCollection.set_processing | ||
options: | ||
heading_level: 3 | ||
only_parameters: true | ||
only_parameters: "no-header" | ||
|
||
## Accelerators | ||
## Backends | ||
|
||
### Simple accelerator {: #edspdf.accelerators.simple.SimpleAccelerator } | ||
### Simple backend {: #edspdf.processing.simple.execute_simple_backend } | ||
|
||
::: edspdf.accelerators.simple.SimpleAccelerator | ||
::: edspdf.processing.simple.execute_simple_backend | ||
options: | ||
heading_level: 3 | ||
only_class_level: true | ||
show_source: false | ||
|
||
### Multiprocessing accelerator {: #edspdf.accelerators.multiprocessing.MultiprocessingAccelerator } | ||
### Multiprocessing backend {: #edspdf.processing.multiprocessing.execute_multiprocessing_backend } | ||
|
||
::: edspdf.accelerators.multiprocessing.MultiprocessingAccelerator | ||
::: edspdf.processing.multiprocessing.execute_multiprocessing_backend | ||
options: | ||
heading_level: 3 | ||
only_class_level: true | ||
show_source: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.