Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-indexing terms on update (ex: slug) #750

Closed
kallehauge opened this issue Mar 2, 2017 · 10 comments
Closed

Re-indexing terms on update (ex: slug) #750

kallehauge opened this issue Mar 2, 2017 · 10 comments

Comments

@kallehauge
Copy link
Contributor

kallehauge commented Mar 2, 2017

Re-indexing terms on update (ex: slug)

Case

Hello ppl! A client changed a slug for a public term today and went to the new URL and no results showed up. What happens, like you probably know, is that the sync-manager only looks for changes on posts and not terms.

/**
 * Setup actions and filters
 *
 * @since 0.1.2
 */
public function setup() {
	add_action( 'wp_insert_post', array( $this, 'action_sync_on_update' ), 999, 3 );
	add_action( 'add_attachment', array( $this, 'action_sync_on_update' ), 999, 3 );
	add_action( 'edit_attachment', array( $this, 'action_sync_on_update' ), 999, 3 );
	add_action( 'delete_post', array( $this, 'action_delete_post' ) );
	add_action( 'delete_blog', array( $this, 'action_delete_blog_from_index') );
	add_action( 'make_spam_blog', array( $this, 'action_delete_blog_from_index') );
	add_action( 'archive_blog', array( $this, 'action_delete_blog_from_index') );
	add_action( 'deactivate_blog', array( $this, 'action_delete_blog_from_index') );
	add_action( 'updated_postmeta', array( $this, 'action_queue_meta_sync' ), 10, 4 );
	add_action( 'added_post_meta', array( $this, 'action_queue_meta_sync' ), 10, 4 );
	add_action( 'shutdown', array( $this, 'action_index_sync_queue' ) );
}

Why is this a problem?

A lot of 3. party plugins use slug's for tax_query's and this would make it fail because Elasticsearch still holds the old slug.
Although it would be possible to hook into pre_get_posts and alter the tax_query to fetch with term_id's. I feel like this solution would have a lot of edge-cases and it would make ElasticPress less "out of the box" ready – The worst case scenario is that archives/lists completely fail due to this.

I've looked in the the API and it's only possible to re-index entire posts which is extremely slow/heavy since it has to get and prepare all posts before and, in this example, there might be several thousand products for a term.

The question

Is this a scenario you have discussed before? And/or is it something you wish to support? If yes, then I already have some ideas of how this could be done and I would happily supply some of my time to cooperate on a PR.

In the meantime: have any of you done a workaround for this, previously? If yes, then I would be very interested in hearing how you handled it 😄

/ Best regards

@tlovett1
Copy link
Member

tlovett1 commented Mar 6, 2017

Definitely something we know about but have not fixed. I would love to see a PR for this :)

@mustafauysal
Copy link
Contributor

In the meantime: have any of you done a workaround for this, previously? If yes, then I would be very interested in hearing how you handled it 😄

not same issue but we handled some edge cases like this before by extending WP-CLI commands.

@kallehauge
Copy link
Contributor Author

kallehauge commented Mar 9, 2017

@tlovett1 have you discussed switching to a nested queries style? Reading their docs about joining queries, nested queries should be designed to scale better than "normal" relations and the way I see it, then I can only agree.

We would get another index called term like we have post right now but only reference the terms, from posts, by ID. So when we update a term, then it only happens one place and we can start instant-syncing terms like we're currently doing with posts.

Both ES2 and ES5 have nested query support:

@tlovett1
Copy link
Member

@kallehauge we have not. However, that would be a huge change.

@kallehauge
Copy link
Contributor Author

kallehauge commented Mar 16, 2017

I completely agree @tlovett1 - but the way I see it then there are 2 options:

  1. Relationships (okay, we might not have to make another index called term - it seems like nested objects are shadow objects. But it will still require a nested query).
  2. Do something like a _update_by_query call that includes a Groovy inline script that will loop through the hit's terms array for each post document and look for the term_id and replace the values if the condition is a success. The upside of this is that we can implement it as the plugin works now and it can all happen with one call to Elasticsearch. The downside is that we require 2 settings to be enabled on the Elasticsearch server ("update by query" and one that allows inline groovy scripts) and version 5.1+ (if I remember correctly. It might be 5.0 or 5.2) - this means that the "plug and play" style that exists now, disappears and require server setup.

Okay.. there is a 3. solution but this still doesn't scale very well:

  1. Listen to a terms hook and update all posts that include that term om edit. This might be done with something like a scroll API call to get the documents and then do the changes in a bulk update request with "partial updates" - I believe the index have to be locked while this performs which makes it impossible for anybody to update anything while the update happens (posts etc). And I guess this also requires the _update_by_query setting to be enabled.

What I've written here is something I looked up some time ago so take it as inspiration for areas to look into instead of actual facts. But, in my head, the ideal scenario would be something like relationships handled like this:

Term create: We cannot assume that all terms are created on the terms page. Ex: tags created directly in on posts. But I guess this isn't a real issue since they still go through the same term create hook. But we will have to create terms like we do with posts, here.
Term edit: Update independent term object referenced by all the posts so we only have to update it in one place despite having 100.000+ references.
Term delete: I guess this isn't a real issue since the term_id does not exist in the MySQL DB anymore so we will no longer get any requests for this ID and the posts will be cleanup up next time they're edited.

@kallehauge
Copy link
Contributor Author

@tlovett1 have you given this any additional thought? We have some clients that are still not THAT happy with the workflow of having to re-index their entire site when they alter terms, so I might be willing to slowly start building this feature in the near future if we can agree on a model ☺️

@ivankristianto
Copy link
Contributor

@kallehauge
I think this issue might be related to this #658
We have a plan to support Terms and Taxonomies to the query. But as @tlovett1 mention, this would be a huge change. We need to do an abstraction to the data model as currently it only support Post data.

@tlovett1
Copy link
Member

See #1443

@nickdaugherty
Copy link

Now that #1443 is merged, are there any plans to revisit this? Term deletions are also affected because posts remain indexed with now-deleted term and continue to match queries for it.

The scaling concerns raised in this thread are valid when many thousands of posts are attached to a single term - what if EP found all posts using the affected term, and if the count was over a threshold, it scheduled N async cron jobs to reindex those remaining posts (perhaps passing post ids to the cron job).

Alternatively, it could flag the posts as "needs reindex" in meta, or a better, a custom taxonomy, and have a recurring cron job that auto re-indexes those posts flagged as "dirty". Having this sort of system would be useful for more than just terms - any post that needed to be reindexed asynchronously could be flagged as such (such as during bulk operations), then cron would always be looking for them, achieving eventual consistency.

@felipeelia
Copy link
Member

Now that #2603 is merged we can go ahead and close this one. Thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants