-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added DB tables for storing and interating with posts #313
Conversation
refs https://linear.app/ghost/issue/AP-714 We want to move away from storing raw ActivityPub objects in a KV store and instead start storing the data in the format which best suits our usecases. This means moving to a new schema where we're able to store our concept of a post and use joins to be able to look up information such as who has liked or reposted a post, as well as which posts belong in a users feed. This schema is the result of some long term performance testing as well as audits into our UI and the data it needs in order to render.
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, | ||
|
||
type TINYINT UNSIGNED NOT NULL CHECK (type IN (0,1,2,3,4)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using TINYINT
instead of BIT
- they would both take up a byte of space, and I've read that BIT
is difficult to work with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about using the CHECK
here? Maybe we leave this at the application level instead? 5 types was to have enough for growth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh I don't think BIT
would be suitable for this if we plan on having more than 2 types.
I'm not sure on the CHECK
, were probably going to end up doing this at the application level anyways right? But i suppose it does guarantee integrity. I guess it's just making sure its clear that allowing a new post type means updating the application level checks as well as the schema (which should be just dropping the constraint and re-applying?)
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, | ||
|
||
type TINYINT UNSIGNED NOT NULL CHECK (type IN (0,1,2,3,4)), | ||
audience TINYINT UNSIGNED NOT NULL CHECK (audience IN (0,1,2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to store the audience of the post - currently thinking public, private & followers-only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this affect the current feed queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment it wouldn't at all - we'd query everything - but eventually we could filter based on audience
reading_time_minutes INT UNSIGNED DEFAULT 0 NOT NULL, | ||
|
||
ap_id VARCHAR(1024) NOT NULL, | ||
ap_id_hash BINARY(32) GENERATED ALWAYS AS (UNHEX(SHA2(ap_id, 256))) STORED UNIQUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using VARCHAR(1024)
to store all our URL's which is very big, but we choose that on purpose because we ran into long URL's already (> 512 IIRC)
We want to index the ap_id for lookups, but VARCHAR(1024) is a large column to index, the idea here is to store a hash and do the indexing on this instead.
The GENERATED ALWAYS...
stuff was suggested by ChatGPT - but we can do that at the application level instead if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat idea 👍
title VARCHAR(256) NULL, | ||
excerpt VARCHAR(500) NULL, | ||
content TEXT NULL, | ||
url VARCHAR(1024) NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't put a UNIQUE
here to avoid 1. adding a large index and 2. are we 100% sure that URL's will be unique - ap_id
should be, but it might be possible to have different posts/post_types refer to same URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we have 2 different posts/post_types referring to same URL? By definition if they are different they should have different URLs right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The url
property isn't an identifier (https://www.w3.org/TR/activitystreams-vocabulary/#dfn-url)
So I think it's possible that you could have both an Article
and Note
with the same url
representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something here, but how can 1 URL resolve to 2 different things? (without additional context)
audience TINYINT UNSIGNED NOT NULL CHECK (audience IN (0,1,2)), | ||
|
||
author_id INT UNSIGNED NOT NULL, | ||
title VARCHAR(256) NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this long enough?
|
||
author_id INT UNSIGNED NOT NULL, | ||
title VARCHAR(256) NULL, | ||
excerpt VARCHAR(500) NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exceeds Ghost's custom_excerpt field so I think we're good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, i think Djordje mentioned on the frontend its expected not be greater than 400 chars max
id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY, | ||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
|
||
post_type TINYINT UNSIGNED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need for efficient queries when filtering feeds on post_type (feed vs inbox)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we enforce the same constraints as the posts
table for these columns?
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
|
||
post_type TINYINT UNSIGNED, | ||
audience TINYINT UNSIGNED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar idea for above - we might want to be able to filter feeds on whether or not they're private/public
closes https://linear.app/ghost/issue/AP-714
We want to move away from storing raw ActivityPub objects in a KV store and instead start storing the data in the format which best suits our usecases.
This means moving to a new schema where we're able to store our concept of a post and use joins to be able to look up information such as who has liked or reposted a post, as well as which posts belong in a users feed.
This schema is the result of some long term performance testing as well as audits into our UI and the data it needs in order to render.