semaphore for async indexing and sync index in transaction after publish #10388

ErykKul · 2024-03-19T10:00:31Z

What this PR does / why we need it:
Hopefully this fixes indexing issues after publishing of datasets and lowers the risk of solr getting overwhelmed.

Which issue(s) this PR closes:

Closes #10381

Special notes for your reviewer:
I do not have much experience with Java semaphores yet, I used them extensively in other languages. We need to be sure it works as intended before merging. I tried to balance this solution and not to move the problem to having too many threads sleeping and waiting on semaphore i.s.o. relying on solr to cope with the load. The new parameter for max running async indexes should be then tuned for Harvard Dataverse, I think (guesstimate) 4 is OK for most instances, but Harvard might want a bigger number.

Suggestions on how to test this:
Ideally, under realistic load in an environment similar to production: publish and edit many datasets concurrently. Also, edit some collections/dataverses, as that appears to be the worst case scenario: after editing a large collection, all datasets get reindexed asynchronously causing large load on the server and solr. This might be the root cause of the problems on Harvard Dataverse.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.

Is there a release notes update needed for this change?:
Yes, it includes new property.

Additional documentation:
Needs testing. I hope it will pass unit and integration tests without problems. I will also try testing it on my local installation. Any help in testing it in realistic setting is welcome.

coveralls · 2024-03-19T10:09:41Z

coverage: 20.736% (+0.08%) from 20.659%
when pulling 1479403 on ErykKul:10381_index_after_publish
into 30666f9 on IQSS:develop.

pdurbin

Just a quick doc review.

doc/sphinx-guides/source/installation/config.rst

qqmyers

Looks good. I guess the proof is in the QA to see how well this handles the issue.

poikilotherm · 2024-03-20T19:38:48Z

thought: Do we need a counter and expose it as a metric to see if there are threads pilling up, waiting for their turn? It's hard to get a feeling for the right number here. Ideally you want to see in your monitoring system a correlation between the setting, the number of threads waiting and the loadavg on Solr...

Microprofile Metrics to the rescue here, if we want such a metric done easily.

landreev · 2024-03-20T21:41:51Z

Sorry, what is the status of this PR? - I see that it's been approved; but then it is still sitting in the "Ready for Review" column. And it's showing the red "Merging is blocked" and "Changes Requested" banners - but the latter appears to be pointing to the change @poikilotherm requested the other day, that appears to have been addressed ... Oliver, is it just waiting for you to approve the change? (or do you actually want more to be done, in response to the question about threads?)
This is somehow more confusing than it should be. 🙂 ... please drag into "ready for QA" one way or another.

pdurbin · 2024-03-21T01:10:36Z

Huh. Jim did approve the PR but maybe it didn't automatically move to "ready for QA" because Oliver requested changes?

It looks like we can dismiss Oliver's request for changes, if necessary:

Maybe that would unlock the merge button? Not sure.

landreev · 2024-03-21T12:58:39Z

Thanks all, into QA it goes.

landreev · 2024-03-24T22:58:14Z

...ain/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java

+            ctxt.index().indexDataset(readyDataset, true);
+        } catch (SolrServerException | IOException e) {
+            throw new CommandException("Indexing failed: " + e.getMessage(), this);
+        }


I am trying to understand what's going on here. I understand why we may want to change indexing from asynchronous to synchronous. But why are we moving the indexing from onSuccess() and back into the execute() method?
Is this just to address the original issue with timestamps (#10381)?
Are we positive we want to do this? - If I understand correctly, this would bring us back to a failure to index resulting in a failure to publish. Which I thought we specifically wanted to avoid.
I may easily be missing something of course.

@landreev @scolapasta I do not fully understand the issue myself. Moving the indexing back to execute makes the indexing to happen earlier, not later. Also making it synchronous instead of async makes it also happen earlier, not later. If the issue is that the index did happen, but its timestamp is before the publish moment, this PR would make the problem worse, not better. I just do not see yet why the publish date would be greater than the index moment, unless the index was interrupted (e.g., by a circuit breaker), or a wrong version of dataset was indexed (e.g., the draft and not the published version). I will go again through the code to see if I am missing something, and maybe wrong version gets indexed, etc. As its turns out, we just migrated to the version 6.1 (from 5.14) and recently published dataset has the same problem. I think that the core cause is still not addressed. I will keep digging.

@landreev @scolapasta I did not see any errors on solr, no failure log in payara, etc. Also, the index looks fine, with the right version etc. I assume that indexing went fine after publish, only the "indextime" was not updated on the DB. I reverted to async update in finalize after publish, and added em.flush() after the indextime update, which should fix the problem.

I'm not sure if em.flush changes anything, from what I understand the change should have been written anyway. I think I am still missing something. In the example on our server, the publishing took 20+ hours due to the checksum checks on 20+ TiB. I got e-mail that the dataset was published the next day after publishing started. From the code, the index should happen directly after sending the notifications. I would assume that it either succeeded, but the indextime did not update, or it failed, but then I should see it in the process-failures log, which is empty.

Now I think that index was done, the correct indextime was written, but then it got overwritten by the merge in

try { ExportService instance = ExportService.getInstance(); instance.exportAllFormats(dataset); dataset = ctxt.datasets().merge(dataset); }

I moved the indexing to the bottom, I am more confident now that it is fixed.

It looks to me that it is safe to merge the current code of this PR and leave it like this until we see how it will behave in production. I saved the link to this issue as it contains some interesting discussion, especially on running batches and metrics. We can revisit these ideas at some point. As for adding the sleep calls, I am against that since this would makes the queue holding indexing jobs longer than necessary. I like the idea of adding the index logic to the timer for fixing exports, but I agree it is out of scope for this PR. We will pick it up later, if needed.

I just asked a question in dv-tech slack about simplifying the whole scheme somewhat radically. (but probably not in the context of this PR either).
I'm generally in favor of merging this PR as is (just want to run a couple of more experiments with multiple indexing attempts in parallel).
I would consider however including one small extra change - adding a few extra hours to the index timestamp in the test in the isIndexedVersion() method in DatasetPage.java:

return (workingVersion.getDataset().getIndexTime() != null) && workingVersion.getDataset().getIndexTime().after(workingVersion.getReleaseTime());

i.e., something like

(dataset.getIndexTime().getTime() + (a few hours worth of milliseconds)) > version.getReleaseTime().getTime()

instead?

Ik have looked at the differences, but nothing stroke me as suspicious.

In solrconfig.xml some requesthandlers are disabled, but I do not think that these were used anyway. Those are the /browse, /spell, /update/extract, /tvhr and /elevate endpoints.

In schema.xml I did find it strange that the schema version in 6.1 is lower than that of 5.14. But I do not know what the impact of that version number is. Could be just informative. The other differences are replacing the now obsolete solr.TrieXXXField with solr.XXXPointField as one would expect for Solr 9. There are also some changes to the definitions of dynamic field definitions. AFAIK Dataverse does not use dynamic field types in Solr.

Could it be the move from Solr 8 to 9 itself that changed something dramatically?
From a brief look, I'm not seeing any information online suggesting 9 is dramatically slower than 8 in general. ... But could be something Dataverse specifically relies on when indexing that is behaving differently.

I moved the indexed time in the isIndexedVersion test by 3 hours as suggested. This should fix the problem for already published datasets with indexedtime before releasetime.

- Measure wait times for a permit, a measurement how many indexing requests are currently stuck because of high demand (so you can tune the amount of available parallel indexing threads) - Measure how long the actual indexing task runs, which might be an indication of lots of very large datasets or not enough resources for your index

poikilotherm · 2024-03-25T10:23:20Z

May I suggest taking a look at ErykKul#10 (a PR to this PR) which adds timers? It's a very simple change but will provide a better way to understand if there actually are any load issues. These might be used in production as well as benchmarking, as it is a common "standard" for metrics exposure.

…fter index time udate

Add timing metrics to measure indexing load

ErykKul · 2024-03-25T12:00:32Z

@poikilotherm Thanks! I have just merged your code, it looks great! I used prometheus with Grafana in the past, It would be cool to have something like that. It would help finding other issues as well.

poikilotherm · 2024-03-25T12:17:31Z

@poikilotherm Thanks! I have just merged your code, it looks great! I used prometheus with Grafana in the past, It would be cool to have something like that. It would help finding other issues as well.

Splendid! Thanks for merging! Two more questions:

Should we add some more details to the docs (guides/release notes) about the availability of these metrics?
Are there other metrics we should add now, as a part of this PR? Like histograms of sizes? XY data of size vs time, etc?

src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java

…erwritten

…l/dataverse into 10381_index_after_publish

ErykKul · 2024-03-25T13:07:55Z

@poikilotherm histogram would be more useful than raw data. Also, documentation would be great! Do we have any on how access metrics? If it is a new thing, it might be out of scope of this PR. I think it deserves a new section in the documentation specifically on metrics. We should also add more metrics at other places.

poikilotherm · 2024-03-25T13:22:19Z

There is doc/sphinx-guides/source/admin/monitoring. We could add a few words there that can be reformatted / moved / extended later on.

landreev · 2024-03-25T14:13:25Z

I started testing the effects of the semaphores on our test instance running off a copy of the prod. database clone; with somewhat inconclusive results just yet. But I'm only doing it in parallel with having been gang-pressed into working on something else prod. and admin here.

…Version test

src/main/java/edu/harvard/iq/dataverse/DatasetPage.java

ErykKul · 2024-03-27T16:22:34Z

I added a mention in the doc/sphinx-guides/source/admin/monitoring.rst about the newly added metrics. I am not sure if it is sufficient and I think it will need some cleaning up later on.

landreev · 2024-03-27T17:10:13Z

The guide entry for the metrics is perfect.
I'm going to merge the PR now. Will share some observations from running tests on a large database when I have a moment. But definitely looking forward to seeing the effects, and the metrics in real production use.
Thank you again, @ErykKul and @poikilotherm for this effort!

ErykKul added 2 commits March 19, 2024 10:40

semaphore for async indexing and sync index in transaction after publish

ff16a49

fixed too short underline in config doc

0002008

pdurbin reviewed Mar 19, 2024

View reviewed changes

doc/sphinx-guides/source/installation/config.rst Outdated Show resolved Hide resolved

poikilotherm requested changes Mar 19, 2024

View reviewed changes

doc/sphinx-guides/source/installation/config.rst Outdated Show resolved Hide resolved

ErykKul added 2 commits March 20, 2024 12:03

fixed new property name scope

714f9f6

added short release note

1a3783a

scolapasta added this to the 6.2 milestone Mar 20, 2024

qqmyers approved these changes Mar 20, 2024

View reviewed changes

poikilotherm approved these changes Mar 21, 2024

View reviewed changes

landreev self-assigned this Mar 21, 2024

landreev reviewed Mar 24, 2024

View reviewed changes

ErykKul and others added 3 commits March 25, 2024 12:39

reverted back to async index after publish, but now with em.flush() a…

77789db

…fter index time udate

Merge pull request #10 from poikilotherm/10381-timers

b830c82

Add timing metrics to measure indexing load

Merge branch 'IQSS:develop' into 10381_index_after_publish

a631444

poikilotherm reviewed Mar 25, 2024

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java Outdated Show resolved Hide resolved

ErykKul added 3 commits March 25, 2024 13:29

moved indexing after last dataset merge to assure indextime is not ov…

8945ec8

…erwritten

Merge branch '10381_index_after_publish' of https://github.com/ErykKu…

7f7fd89

…l/dataverse into 10381_index_after_publish

metric fix

053ebdb

moved indexed time by 3 hours to prevent false negatives in isIndexed…

797bc38

…Version test

landreev reviewed Mar 27, 2024

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/DatasetPage.java Outdated Show resolved Hide resolved

ErykKul added 2 commits March 27, 2024 17:00

nullpointer fix

0769ee9

quick info on the new metrics added for indexing

1479403

landreev merged commit d5bb0c2 into IQSS:develop Mar 27, 2024
11 of 12 checks passed

landreev removed their assignment Mar 27, 2024

landreev mentioned this pull request Apr 8, 2024

Investigate Solr performance issues (again) #10469

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

semaphore for async indexing and sync index in transaction after publish #10388

semaphore for async indexing and sync index in transaction after publish #10388

ErykKul commented Mar 19, 2024

coveralls commented Mar 19, 2024 •

edited

Loading

pdurbin left a comment

qqmyers left a comment

poikilotherm commented Mar 20, 2024 •

edited

Loading

landreev commented Mar 20, 2024

pdurbin commented Mar 21, 2024

landreev commented Mar 21, 2024

landreev Mar 24, 2024

ErykKul Mar 25, 2024

ErykKul Mar 25, 2024

ErykKul Mar 25, 2024

ErykKul Mar 25, 2024

ErykKul Mar 26, 2024

landreev Mar 26, 2024

Kris-LIBIS Mar 26, 2024

landreev Mar 26, 2024

ErykKul Mar 27, 2024

poikilotherm commented Mar 25, 2024 •

edited

Loading

ErykKul commented Mar 25, 2024

poikilotherm commented Mar 25, 2024

ErykKul commented Mar 25, 2024

poikilotherm commented Mar 25, 2024

landreev commented Mar 25, 2024

ErykKul commented Mar 27, 2024

landreev commented Mar 27, 2024

semaphore for async indexing and sync index in transaction after publish #10388

semaphore for async indexing and sync index in transaction after publish #10388

Conversation

ErykKul commented Mar 19, 2024

coveralls commented Mar 19, 2024 • edited Loading

pdurbin left a comment

Choose a reason for hiding this comment

qqmyers left a comment

Choose a reason for hiding this comment

poikilotherm commented Mar 20, 2024 • edited Loading

landreev commented Mar 20, 2024

pdurbin commented Mar 21, 2024

landreev commented Mar 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poikilotherm commented Mar 25, 2024 • edited Loading

ErykKul commented Mar 25, 2024

poikilotherm commented Mar 25, 2024

ErykKul commented Mar 25, 2024

poikilotherm commented Mar 25, 2024

landreev commented Mar 25, 2024

ErykKul commented Mar 27, 2024

landreev commented Mar 27, 2024

coveralls commented Mar 19, 2024 •

edited

Loading

poikilotherm commented Mar 20, 2024 •

edited

Loading

poikilotherm commented Mar 25, 2024 •

edited

Loading