Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to customize the ranking rules #128

Merged
merged 34 commits into from
Dec 9, 2024

Conversation

daun
Copy link
Contributor

@daun daun commented Dec 3, 2024

Tried my hand at implementing a withRankingRules config that mirrors Meilisearch ranking rules customization.

The default is: ['words', 'proximity', 'attribute'], in that order. Same as before. The weight of each rule is implicit and derived from a factor. I've had very similar results to the previous setup with 1.00.70.49 → etc. So that's what I went with for now.

Switched to SQLite native json functions in a few places for passing data to the custom relevance function, it's a bit more ergonomic. About as fast as the current implementation, even with json. Reverted back to string concatenation to keep things simple.

Usage

To ignore the proximity rule and prefer attribute weighing:

$configuration = Configuration::create()
    ->withSearchableAttributes(['title', 'content'])
    ->withRankingRules([
        'attribute',
        'words',
    ]);

Benchmarks

Queries are about 0.5% faster for simple queries, and about 0.75% slower for more complex queries. I could live with that cost for the added benefit of being able to control relevance ranking.

query develop branch this pull request
amakin dkywalker 164.1 ms ± 1.8 ms 165.2 ms ± 2.0 ms
famly dadd 676.4 ms ± 41.6 ms 679.9 ms ± 8.7 ms
once twice thrice 261.9 ms ± 2.2 ms 264.3 ms ± 8.1 ms
lighting 195.1 ms ± 5.0 ms 194.1 ms ± 1.8 ms

Closes #125

daun added 20 commits December 3, 2024 16:59
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
This reverts commit 1f28737.

Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Copy link
Contributor

@Toflar Toflar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general approach of extracting the rankers into proper classes and having separate tests, so I'm 👍 on those! I'm not quite sure about the json_decode() part. I don't think it can be as fast as exploding because there's a lot more going on when decoding JSON so I'm eager to learn about your performance results.

@daun
Copy link
Contributor Author

daun commented Dec 4, 2024

I like the general approach of extracting the rankers into proper classes and having separate tests, so I'm 👍 on those! I'm not quite sure about the json_decode() part. I don't think it can be as fast as exploding because there's a lot more going on when decoding JSON so I'm eager to learn about your performance results.

Right, my intuition says sqlite's json functions are quite optimized, but as always, the only way to know is by properly benchmarking. Maybe I was just hoping to make this as simple as possible :)

@Toflar
Copy link
Contributor

Toflar commented Dec 4, 2024

Yeah, they sure are but you are using PHP's json_decode() which by nature cannot be as quick as explode() :)

@daun
Copy link
Contributor Author

daun commented Dec 4, 2024

Just benchmarked this branch against develop, it's as fast or faster. So I'd say we shouldn't worry about the json part. I'll add the benchmark results to the pr in a minute, just need to properly format them somehow.

I did find some other thing — apparently sqlite will call the loupe_relevance function twice, once for creating the value itself and once for sorting. I guess this could be optimized away by marking it as deterministic. Not sure how that works in detail, though...

@daun
Copy link
Contributor Author

daun commented Dec 4, 2024

@Toflar So these are the benchmarks. With 50 runs per branch, and outliers removed. Posted them above as well. This is with the full database of 32K movies. About as close as it gets, I would say. Maybe a 1% decrease in performance?

query develop branch this pull request
amakin dkywalker 164.1 ms ± 1.8 ms 165.2 ms ± 2.0 ms
famly dadd 676.4 ms ± 41.6 ms 677.3 ms ± 11.3 ms
once twice thrice 260.7 ms ± 3.1 ms 265.2 ms ± 10.1 ms
lighting 198.0 ms ± 4.2 ms 196.8 ms ± 2.4 ms

@daun
Copy link
Contributor Author

daun commented Dec 4, 2024

(Just a side-note that I looked a bit into avoiding duplicate work in the loupe_relevance function by a) marking the sqlite custom function as deterministic, as well as offloading it into a materialized CTE. Both had no effect or made things worse.)

daun and others added 2 commits December 4, 2024 22:10
Co-authored-by: Yanick Witschi <yanick.witschi@terminal42.ch>
Co-authored-by: Yanick Witschi <yanick.witschi@terminal42.ch>
Co-authored-by: Yanick Witschi <yanick.witschi@terminal42.ch>
@Toflar
Copy link
Contributor

Toflar commented Dec 5, 2024

It's indeed called twice. But how can it have no effect to get rid of those duplicate calls? :D

@daun
Copy link
Contributor Author

daun commented Dec 5, 2024

It's indeed called twice. But how can it have no effect to get rid of those duplicate calls? :D

I think in the first case it's because sqlite makes no promises about deterministic functions and it's more of an optional thing for it to honor or not.

In the second case it's probably because adding another CTE avoids duplicate function calls but makes the query a lot more complex to parse or run? A lot of guessing here on my end 🤠

It'd be interesting to see how much time is spent in the relevance function in relation to total query time.

@Toflar
Copy link
Contributor

Toflar commented Dec 5, 2024

It'd be interesting to see how much time is spent in the relevance function in relation to total query time.

That depends a lot on the total matches. If you have a query that matches many documents, for all of them the relevance has to be calculated. If you only have a few, then you won't see any noticeable difference. So if you're checking the deterministic case, I am pretty sure it should help but you have to test with a query that matches many documents 🤔

daun added 10 commits December 6, 2024 21:52
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
Signed-off-by: Philipp Daun <post@philippdaun.net>
This reverts commit 0647317.

Signed-off-by: Philipp Daun <post@philippdaun.net>
@daun
Copy link
Contributor Author

daun commented Dec 7, 2024

I've reverted back to simple string concatenation and implode/explode for the function parameters to keep things simple for now. JSON might be just as fast, but it muddies the picture somewhat here.

I've optimized a few loops and switched to passing all the arrays by reference. Queries are now about 0.5% faster for simple queries (not sure why), and about 0.75% slower for more complex queries (probably the added static method calls). I think that's about as good as it gets without doing further optimization elsewhere, e.g. in PR #130.

Ready for review :)

@daun daun marked this pull request as ready for review December 7, 2024 09:54
@Toflar Toflar changed the title Customize ranking rules Allow to customize the ranking rules Dec 9, 2024
@Toflar Toflar merged commit ba31c05 into loupe-php:develop Dec 9, 2024
18 checks passed
@Toflar
Copy link
Contributor

Toflar commented Dec 9, 2024

Thank you, @daun!!

@daun daun deleted the feat/ranking-rules branch December 9, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants