Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add index on timed_beliefs for faster search #167

Merged
25 changes: 12 additions & 13 deletions timely_beliefs/beliefs/classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,9 @@
)
from sqlalchemy.ext.declarative import declared_attr
from sqlalchemy.ext.hybrid import hybrid_method, hybrid_property
from sqlalchemy.orm import Session, backref, has_inherited_table, relationship
from sqlalchemy.orm import Session, backref, declarative_mixin, relationship
from sqlalchemy.orm.util import AliasedClass
from sqlalchemy.schema import UniqueConstraint
from sqlalchemy.schema import Index
from sqlalchemy.sql.elements import BinaryExpression
from sqlalchemy.sql.expression import Selectable

Expand Down Expand Up @@ -174,6 +174,7 @@ def source_id(self):
return None


@declarative_mixin
class TimedBeliefDBMixin(TimedBelief):
"""
Mixin class for a table with beliefs.
Expand All @@ -182,17 +183,15 @@ class TimedBeliefDBMixin(TimedBelief):

@declared_attr
def __table_args__(cls):
if has_inherited_table(cls):
return (
UniqueConstraint(
"event_start",
"belief_horizon",
"sensor_id",
"source_id",
name="_one_belief_by_one_source_uc",
),
)
return None
return (
Index(
f"{cls.__tablename__}_quad_search_idx",
"event_start",
"source_id",
"sensor_id",
"belief_horizon",
Copy link
Collaborator

@Flix6x Flix6x Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood it, the order matters here, and it should lead to some kind of a funnel from most unique values to fewest unique values? If that is the case, I'd expect the following ordering from most unique to fewest unique values in a typical database:

  • event start: data covers a large period and slowly but steadily grows over time
  • sensor: grows with the size of the system being serviced, but let's say with less than one (hourly) sensor per hour
  • belief horizon: not many unique values with respect to the previous two, and likely a quite constant number
  • source: may grow with the number of API users, but once you have the sensor, there are usually only a couple of sources

Just my two cents on the matter.

That said, I don't really understand why/how the order would matter and why the database doesn't take care of figuring out the best order of such things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I read somewhere that range columns (like event_start) are a good place to lead with, as well.

To be honest, I will simply add the query that Mike suggested in the end, but not because I see a performance difference. After #166, the performance for sensors in our dataset with little data is so fast, I suspect indexes are not visible. And for the sensor with > 50% of the data, Postgres seems to ignore the index by default.

Not listing belief_horizon in the index (but including it as column) makes sense, as we are using min() on it.

This PR is an improvement over the status quo, but we might revisit indexing when we have much more data.

),
)

event_start = Column(DateTime(timezone=True), primary_key=True, index=True)
belief_horizon = Column(Interval(), nullable=False, primary_key=True)
Expand Down
Loading