Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Behavior Insights implementation for Apache Solr #2452

Draft
wants to merge 112 commits into
base: main
Choose a base branch
from

Conversation

epugh
Copy link
Contributor

@epugh epugh commented May 9, 2024

Description

I am working with other folks, especially Stavros Macrakis (macrakis@gmail.com), to come up with a solution for understanding what users are doing in response to search results. We have great visibility and understanding of an incoming query, what we do with it, and then what docs are sent back. We do NOT have a way of tying that search to then what does the user do next, and if the following query is connected to the original one.

Many teams lean on GA or Snowplow or custom code for tracking click through, add to cart, etc as signals, but nothing that is drop dead simple to use and open source.

Solution

User Behavior Insights is a shared schema for tracking search related activities. There is a basic implementation for OpenSearch and this is a version for Apache Solr.

Tasks to be done:

  • Demonstrate providing a .expr file and using it to write to another Solr collection.
  • Look at performance implications of the every query generates a streaming expressoin.
  • Check we only record on the main node, not the replicas when sharding.
  • How can I load test this?
  • Write up Ref Guide Docs
  • Can we add it to techproducts as an example?
  • Add UBI to Admin UI as flag
  • Add UBI to SolrJ basic client
  • Add UBI to SolrJ JSON Query client

Tests

Bats test to demonstrate the end to end use of UBI.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@HoustonPutman
Copy link
Contributor

I'd love to review here, but I think I need some more starting information either in a ref guide page or a JIRA, I'm kind of lost right now...

@epugh
Copy link
Contributor Author

epugh commented May 10, 2024

I'd love to review here, but I think I need some more starting information either in a ref guide page or a JIRA, I'm kind of lost right now...

Yeah... I'll go ahead and write up some ref guide docs! And finish the demo .bats script ;-)

@chatman
Copy link
Contributor

chatman commented May 20, 2024

Usually, features like these are discussed in the dev@ list, or in JIRA or a SIP.
Most important question I have in mind is whether this needs to be in the core search engine? If not, can this not be a plugin/package, shipped outside of solr-core?

@epugh
Copy link
Contributor Author

epugh commented May 20, 2024

This is definitely draft mode code... I opened it as a PR just to be able to track the work, and once it gets a bit furthur, I plan on opening a proper discussion about it. Module? Solr Sandbox? A Component? A full blown package? So many fun options...

@epugh
Copy link
Contributor Author

epugh commented Nov 27, 2024

A question for the smarter folks that me. Should the classes UBIQuery and UBIQueryStream be added to the UBIComponent.java? UBIQuery is just a pojo... And UBIQueryStream wires the use of the component up to a streaming expression. I don't see either ever being used elsewhere....

@epugh
Copy link
Contributor Author

epugh commented Nov 27, 2024

Just stubbed my toe on the "Distributed processing is harder than single core processing"! With a two node set up, I discovered that I am logging to a SINGLE userfiles/ubi_queries.jsonl file, and I log once for each shard.. instead of just logging on the collector step..

{"query_id":"c4e40af6-67b7-4824-8b63-5aae70a485f6","timestamp":"2024-11-27T13:42:19.121Z"}
{"query_id":"5dfedf02-fd89-4e40-b3aa-7700c162800b","timestamp":"2024-11-27T13:42:19.121Z"}

Sigh.

The fact that we are calling .keySet may be a problem...  Because that means other components might be in a random order?  Maybe we shouldn't even use a map of string/class, it should just be a list of classes?
@epugh
Copy link
Contributor Author

epugh commented Nov 28, 2024

Argh, a bit stuck. I can't figure out how to have the UBIComponent during a distributed query, look up the final doc id's and record them before sending them back to the user. With a single node single shard, it works great, but not in a distributed fashion.

I keep getting:

2024-11-28 12:36:31.368 ERROR (qtp428039780-40-localhost-11) [c:twoshard s:shard1 r:core_node4 x:twoshard_shard1_replica_n2 t:localhost-11] o.a.s.s.HttpSolrCall 500 Exception => java.lang.NullPointerException: Cannot read field "docList" because the return value of "org.apache.solr.handler.component.ResponseBuilder.getResults()" is null
	at org.apache.solr.handler.component.UBIComponent.doStuff(UBIComponent.java:315)
java.lang.NullPointerException: Cannot read field "docList" because the return value of "org.apache.solr.handler.component.ResponseBuilder.getResults()" is null
	at org.apache.solr.handler.component.UBIComponent.doStuff(UBIComponent.java:315) ~[?:?]
	at org.apache.solr.handler.component.UBIComponent.distributedProcess(UBIComponent.java:252) ~[?:?]
	at org.apache.solr.handler.component.SearchHandler.processComponents(SearchHandler.java:552) ~[?:?]
	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:429) ~[?:?]
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:238) ~[?:?]
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2875) ~[?:?]

Copy link
Member

@mkhludnev mkhludnev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @epugh !

}
}

ResultContext rc = (ResultContext) rb.rsp.getResponse();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it distributed, we need to split this method here, and execute parts per certain stage.
It seems, UBIComp needs to submit docids found. Right?
In distributed context, we need to record before-merge per-shards ids or resulting merged&cropped result ids? I bet the later, please confirm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkhludnev for looking at this, gives me renewed energy to know someone else is looking at it! Yes, it is the merged&cropped result ids. We want to record in the plugin what the candidate result ids that the user MIGHT have seen, which later is used to compare against clicks and impressions to identify which docs are NOT attractive to the user.

stream = constructStream(streamFactory, streamExpression);

streamContext.put("ubi-query", ubiQuery);
stream.setStreamContext(streamContext);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know nothing about streams, but is it possible to submit UbiQuery here, which isn't a subclass of any serializable framework? You know passing a ref might work locally, but not it remote/distributed env.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOod question. I believe that the getTuple method will be immediately run, which means this code doesn't actually get run in a distributed sense.. I.e, the ubiQuery object that we put into the streamContext is immediately read back out in getTuple() method.. That is the job of the UBIQueryStream class, to convert the UBIQuery found in the context into a Tuple that is used by streaming expressons.

I am going to try actually making UBIQuery and UBIQueryStream inner classes of UBIComponent to see how that looks...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair question... OMG... I can't believe that I forgot to add the damn doc ids... I suck.

Thanks for spotting that! How embarrassing! yeah...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And fixed.

Pardon, I barely understand what's going on there.
names.add(TermsComponent.COMPONENT_NAME);

return names;
List<String> l = new ArrayList<String>(SearchComponent.STANDARD_COMPONENTS.keySet());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result RTG occurs as a default component that causes a problem in an essential cloud test. 😭

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humm... Maybe I just back out this optimization...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out the changes I made to back this out, but still keep the name STANDARD_COMPONENTS...


Set<String> fields = Collections.singleton(schema.getUniqueKeyField().getName());
for (DocIterator iter = dl.iterator(); iter.hasNext(); ) {
sb.append(schema.printableUniqueKey(searcher.getDocFetcher().doc(iter.nextDoc(), fields)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about commas in id values? Isn't it safer to use json array as a convention for this field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

humm... maybe? Is that at all ever done? So, most folks are going to take this data, and load it into a pandas dataframe or soemthing like that... Now, since the doc is already JSON.. maybe it's okay to make this a array of JSON as well? Since you already have to parse some JSON to do anything with it......

epugh and others added 9 commits December 9, 2024 13:45
Neither UBIQuery or UBIQueryStream will ever be used outside of the UBICompoent.   Thought about a o.a.s.handler.component.ubi package as well, but this seems more specific...
This doesn't cover converting doc_ids into a JSON array of any type...
Exxcept that we had to change to support TEN standard components by not using Map.of.  Also, I couldn't stand the lower case + underscore "standard_components" object name.
I suspect lots of places for fixing.  Like pulling out field names into a UBIParams.java file.   May want to name space ubi query params under "ubi."..  what about in JSON query?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants