Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into trim_and_stem_in_se…
Browse files Browse the repository at this point in the history
…arch
  • Loading branch information
dhdaines committed Sep 9, 2024
2 parents d61e45d + d07b60f commit 7fddf16
Show file tree
Hide file tree
Showing 4 changed files with 121 additions and 10 deletions.
5 changes: 3 additions & 2 deletions .github/workflows/test-suite.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
name: "Python ${{ matrix.python-version }}"
runs-on: "ubuntu-latest"
env:
USING_COVERAGE: "3.10"
USING_COVERAGE: "3.11"

strategy:
matrix:
Expand Down Expand Up @@ -46,6 +46,7 @@ jobs:

- name: "Upload coverage to Codecov"
if: "contains(env.USING_COVERAGE, matrix.python-version)"
uses: "codecov/codecov-action@v3"
uses: "codecov/codecov-action@v4.5.0"
with:
fail_ci_if_error: true
token: ${{ secrets.CODECOV_TOKEN }}
38 changes: 36 additions & 2 deletions docs/customisation.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,11 @@ token list, and the token list itself.

## Skip a pipeline function for specific field names

The `Pipeline.skip()` method allows you to skip a pipeline function for specific field names.
This example skips the `stop_word_filter` pipeline function for the field `fullName`.
The `Pipeline.skip()` method allows you to skip a pipeline function
for specific field names. It takes the function itself (not its name
or its registered name) and the field name to skip as arguments. This
example skips the `stop_word_filter` pipeline function for the field
`fullName`.

```python
from lunr import lunr, get_default_builder, stop_word_filter
Expand All @@ -58,6 +61,37 @@ builder.pipeline.skip(stop_word_filter.stop_word_filter, ["fullName"])
idx = lunr(ref="id", fields=("fullName", "body"), documents=documents, builder=builder)
```

Importantly, if you are using language support, the above code will
not work, since there is a separate builder for each language, and the
pipeline functions are generated by the code and so cannot be
imported. Instead, you can access them by name. For instance to skip
the stop word filter and stemmer for French for the field `titre`, you
could do this:

```python
from lunr import lunr, get_default_builder, stop_word_filter

documents = [...]

builder = get_default_builder("fr")

for funcname in "stopWordFilter-fr", "stemmer-fr":
builder.pipeline.skip(
builder.pipeline.registered_functions[funcname], ["titre"]
)

idx = lunr(ref="id", fields=("titre", "texte"), documents=documents, builder=builder)
```

The current language support registers the functions
`lunr-multi-trimmer-{lang}`, `stopWordFilter-{lang}` and
`stemmer-{lang}` but these are by convention only. You can access the
full list through the `registered_functions` attribute of the
pipeline, but this is not necessarily the list of actual pipeline
steps, which is contained in a private field (though you can see them
in the string representation of the pipeline).


## Token meta-data

Lunr.py `Token` instances include meta-data information which can be used in
Expand Down
76 changes: 76 additions & 0 deletions docs/languages.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,82 @@ If you have documents in multiple language pass a list of language codes:
[{'ref': 'c', 'score': 1.106, 'match_data': <MatchData "english">}]
```

## Folding to ASCII

It is often useful to allow for transliterated or unaccented
characters when indexing and searching. This is not implemented in
the language support but can be done by adding a pipeline stage which
"folds" the tokens to ASCII. There are
[various](https://pypi.org/project/text-unidecode/)
[libraries](https://pypi.org/project/Unidecode/) to do this in Python
as well as in [JavaScript](https://www.npmjs.com/package/unidecode).

On the Python side, for example, to fold accents in French text using
`text-unidecode` or `unidecode` (depending on your licensing
preferences):

```python
import json
from lunr import lunr, get_default_builder
from lunr.pipeline import Pipeline
from text_unidecode import unidecode

def unifold(token, _idx=None, _tokens=None):
def wrap_unidecode(text, _metadata):
return unidecode(text)
return token.update(wrap_unidecode)

Pipeline.register_function(unifold, "unifold")
builder = get_default_builder("fr")
builder.pipeline.add(unifold)
builder.search_pipeline.add(unifold)
index = lunr(
ref="id",
fields=["titre", "texte"],
documents=[
{"id": "1314-2023-DEM", "titre": "Règlement de démolition", "texte": "Texte"}
],
languages="fr",
builder=builder,
)
print(index.search("reglement de demolition"))
# [{'ref': '1314-2023-DEM', 'score': 0.4072935059634513, 'match_data': <MatchData "demolit,regl">}]
print(index.search("règlement de démolition"))
# [{'ref': '1314-2023-DEM', 'score': 0.4072935059634513, 'match_data': <MatchData "demolit,regl">}]
with open("index.json", "wt") as outfh:
json.dump(index.serialize(), outfh)
```

Note that it is important to do folding on both the indexing and
search pipelines to ensure that users who have the right keyboard and
can remember which accents go where will still get the expected
results.

On the JavaScript side [the
API](https://lunrjs.com/docs/lunr.Pipeline.html) is of course quite
similar:

```js
const lunr = require("lunr");
const fs = require("fs");
const unidecode = require("unidecode");
require("lunr-languages/lunr.stemmer.support.js")(lunr);
require("lunr-languages/lunr.fr.js")(lunr);

lunr.Pipeline.registerFunction(token => token.update(unidecode), "unifold")
const index = lunr.Index.load(JSON.parse(fs.readFileSync("index.json", "utf8")));
console.log(JSON.stringify(index.search("reglement de demolition")));
# [{"ref":"1314-2023-DEM","score":0.4072935059634513,"matchData":{"metadata":{"regl":{"titre":{}},"demolit":{"titre":{}}}}}]
console.log(JSON.stringify(index.search("règlement de démolition")));
# [{"ref":"1314-2023-DEM","score":0.4072935059634513,"matchData":{"metadata":{"regl":{"titre":{}},"demolit":{"titre":{}}}}}]
```

There is also
[lunr-folding](https://www.npmjs.com/package/lunr-folding) for
JavaScript, but its folding is not the same as `unidecode` and it may
not be fully compatible with language support, so it is recommended to
use the above method.

## Notes on language support

- Using multiple languages means the terms will be stemmed once per language. This can yield unexpected results.
Expand Down
12 changes: 6 additions & 6 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,24 @@ commands =
pytest -m "acceptance"

[testenv:black]
basepython = python3.10
basepython = python3.11
deps=
black
commands={envbindir}/black --check lunr tests

[testenv:flake8]
basepython = python3.10
basepython = python3.11
deps=
flake8
commands={envbindir}/flake8 lunr tests

[testenv:docs]
basepython = python3.10
basepython = python3.11
extras = docs
commands={envbindir}/sphinx-build docs docs/_build/html

[testenv:mypy]
basepython = python3.10
basepython = python3.11
deps = mypy
commands={envbindir}/mypy lunr

Expand All @@ -45,6 +45,6 @@ python =
3.7: py37
3.8: py38
3.9: py39
3.10: py310,flake8,black,docs,mypy
3.11: py311
3.10: py310
3.11: py311,flake8,black,docs,mypy
pypy3: pypy3

0 comments on commit 7fddf16

Please sign in to comment.