-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
semaphore for async indexing and sync index in transaction after publish #10388
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick doc review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I guess the proof is in the QA to see how well this handles the issue.
thought: Do we need a counter and expose it as a metric to see if there are threads pilling up, waiting for their turn? It's hard to get a feeling for the right number here. Ideally you want to see in your monitoring system a correlation between the setting, the number of threads waiting and the loadavg on Solr... Microprofile Metrics to the rescue here, if we want such a metric done easily. |
Sorry, what is the status of this PR? - I see that it's been approved; but then it is still sitting in the "Ready for Review" column. And it's showing the red "Merging is blocked" and "Changes Requested" banners - but the latter appears to be pointing to the change @poikilotherm requested the other day, that appears to have been addressed ... Oliver, is it just waiting for you to approve the change? (or do you actually want more to be done, in response to the question about threads?) |
Thanks all, into QA it goes. |
ctxt.index().indexDataset(readyDataset, true); | ||
} catch (SolrServerException | IOException e) { | ||
throw new CommandException("Indexing failed: " + e.getMessage(), this); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand what's going on here. I understand why we may want to change indexing from asynchronous to synchronous. But why are we moving the indexing from onSuccess()
and back into the execute()
method?
Is this just to address the original issue with timestamps (#10381)?
Are we positive we want to do this? - If I understand correctly, this would bring us back to a failure to index resulting in a failure to publish. Which I thought we specifically wanted to avoid.
I may easily be missing something of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landreev @scolapasta I do not fully understand the issue myself. Moving the indexing back to execute makes the indexing to happen earlier, not later. Also making it synchronous instead of async makes it also happen earlier, not later. If the issue is that the index did happen, but its timestamp is before the publish moment, this PR would make the problem worse, not better. I just do not see yet why the publish date would be greater than the index moment, unless the index was interrupted (e.g., by a circuit breaker), or a wrong version of dataset was indexed (e.g., the draft and not the published version). I will go again through the code to see if I am missing something, and maybe wrong version gets indexed, etc. As its turns out, we just migrated to the version 6.1 (from 5.14) and recently published dataset has the same problem. I think that the core cause is still not addressed. I will keep digging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landreev @scolapasta I did not see any errors on solr, no failure log in payara, etc. Also, the index looks fine, with the right version etc. I assume that indexing went fine after publish, only the "indextime" was not updated on the DB. I reverted to async update in finalize after publish, and added em.flush()
after the indextime
update, which should fix the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if em.flush
changes anything, from what I understand the change should have been written anyway. I think I am still missing something. In the example on our server, the publishing took 20+ hours due to the checksum checks on 20+ TiB. I got e-mail that the dataset was published the next day after publishing started. From the code, the index should happen directly after sending the notifications. I would assume that it either succeeded, but the indextime
did not update, or it failed, but then I should see it in the process-failures
log, which is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I think that index was done, the correct indextime
was written, but then it got overwritten by the merge
in
try {
ExportService instance = ExportService.getInstance();
instance.exportAllFormats(dataset);
dataset = ctxt.datasets().merge(dataset);
}
I moved the indexing to the bottom, I am more confident now that it is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me that it is safe to merge the current code of this PR and leave it like this until we see how it will behave in production. I saved the link to this issue as it contains some interesting discussion, especially on running batches and metrics. We can revisit these ideas at some point. As for adding the sleep calls, I am against that since this would makes the queue holding indexing jobs longer than necessary. I like the idea of adding the index logic to the timer for fixing exports, but I agree it is out of scope for this PR. We will pick it up later, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just asked a question in dv-tech slack about simplifying the whole scheme somewhat radically. (but probably not in the context of this PR either).
I'm generally in favor of merging this PR as is (just want to run a couple of more experiments with multiple indexing attempts in parallel).
I would consider however including one small extra change - adding a few extra hours to the index timestamp in the test in the isIndexedVersion()
method in DatasetPage.java
:
return (workingVersion.getDataset().getIndexTime() != null) &&
workingVersion.getDataset().getIndexTime().after(workingVersion.getReleaseTime());
i.e., something like
(dataset.getIndexTime().getTime() + (a few hours worth of milliseconds)) > version.getReleaseTime().getTime()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ik have looked at the differences, but nothing stroke me as suspicious.
In solrconfig.xml some requesthandlers are disabled, but I do not think that these were used anyway. Those are the /browse, /spell, /update/extract, /tvhr and /elevate endpoints.
In schema.xml I did find it strange that the schema version in 6.1 is lower than that of 5.14. But I do not know what the impact of that version number is. Could be just informative. The other differences are replacing the now obsolete solr.TrieXXXField with solr.XXXPointField as one would expect for Solr 9. There are also some changes to the definitions of dynamic field definitions. AFAIK Dataverse does not use dynamic field types in Solr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be the move from Solr 8 to 9 itself that changed something dramatically?
From a brief look, I'm not seeing any information online suggesting 9 is dramatically slower than 8 in general. ... But could be something Dataverse specifically relies on when indexing that is behaving differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the indexed time in the isIndexedVersion
test by 3 hours as suggested. This should fix the problem for already published datasets with indexedtime
before releasetime
.
- Measure wait times for a permit, a measurement how many indexing requests are currently stuck because of high demand (so you can tune the amount of available parallel indexing threads) - Measure how long the actual indexing task runs, which might be an indication of lots of very large datasets or not enough resources for your index
May I suggest taking a look at ErykKul#10 (a PR to this PR) which adds timers? It's a very simple change but will provide a better way to understand if there actually are any load issues. These might be used in production as well as benchmarking, as it is a common "standard" for metrics exposure. |
…fter index time udate
Add timing metrics to measure indexing load
@poikilotherm Thanks! I have just merged your code, it looks great! I used prometheus with Grafana in the past, It would be cool to have something like that. It would help finding other issues as well. |
Splendid! Thanks for merging! Two more questions:
|
src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java
Outdated
Show resolved
Hide resolved
…l/dataverse into 10381_index_after_publish
@poikilotherm histogram would be more useful than raw data. Also, documentation would be great! Do we have any on how access metrics? If it is a new thing, it might be out of scope of this PR. I think it deserves a new section in the documentation specifically on metrics. We should also add more metrics at other places. |
There is |
I started testing the effects of the semaphores on our test instance running off a copy of the prod. database clone; with somewhat inconclusive results just yet. But I'm only doing it in parallel with having been gang-pressed into working on something else prod. and admin here. |
I added a mention in the |
The guide entry for the metrics is perfect. |
What this PR does / why we need it:
Hopefully this fixes indexing issues after publishing of datasets and lowers the risk of solr getting overwhelmed.
Which issue(s) this PR closes:
Closes #10381
Special notes for your reviewer:
I do not have much experience with Java semaphores yet, I used them extensively in other languages. We need to be sure it works as intended before merging. I tried to balance this solution and not to move the problem to having too many threads sleeping and waiting on semaphore i.s.o. relying on solr to cope with the load. The new parameter for max running async indexes should be then tuned for Harvard Dataverse, I think (guesstimate) 4 is OK for most instances, but Harvard might want a bigger number.
Suggestions on how to test this:
Ideally, under realistic load in an environment similar to production: publish and edit many datasets concurrently. Also, edit some collections/dataverses, as that appears to be the worst case scenario: after editing a large collection, all datasets get reindexed asynchronously causing large load on the server and solr. This might be the root cause of the problems on Harvard Dataverse.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:
Yes, it includes new property.
Additional documentation:
Needs testing. I hope it will pass unit and integration tests without problems. I will also try testing it on my local installation. Any help in testing it in realistic setting is welcome.