-
Notifications
You must be signed in to change notification settings - Fork 682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User Behavior Insights implementation for Apache Solr #2452
base: main
Are you sure you want to change the base?
Conversation
We have newer approaches such as the FileStoreAPI and the PackageManager.
… testing works with json query
I'd love to review here, but I think I need some more starting information either in a ref guide page or a JIRA, I'm kind of lost right now... |
Yeah... I'll go ahead and write up some ref guide docs! And finish the demo .bats script ;-) |
Usually, features like these are discussed in the dev@ list, or in JIRA or a SIP. |
This is definitely draft mode code... I opened it as a PR just to be able to track the work, and once it gets a bit furthur, I plan on opening a proper discussion about it. Module? Solr Sandbox? A Component? A full blown package? So many fun options... |
refer to the standard components using more normal pattern.
We are already in the UBI component!
A question for the smarter folks that me. Should the classes |
Just stubbed my toe on the "Distributed processing is harder than single core processing"! With a two node set up, I discovered that I am logging to a SINGLE
Sigh. |
… as we interleave data otherwise.
The fact that we are calling .keySet may be a problem... Because that means other components might be in a random order? Maybe we shouldn't even use a map of string/class, it should just be a list of classes?
Argh, a bit stuck. I can't figure out how to have the UBIComponent during a distributed query, look up the final doc id's and record them before sending them back to the user. With a single node single shard, it works great, but not in a distributed fashion. I keep getting:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @epugh !
} | ||
} | ||
|
||
ResultContext rc = (ResultContext) rb.rsp.getResponse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it distributed, we need to split this method here, and execute parts per certain stage.
It seems, UBIComp needs to submit docids found. Right?
In distributed context, we need to record before-merge per-shards ids or resulting merged&cropped result ids? I bet the later, please confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mkhludnev for looking at this, gives me renewed energy to know someone else is looking at it! Yes, it is the merged&cropped result ids. We want to record in the plugin what the candidate result ids that the user MIGHT have seen, which later is used to compare against clicks and impressions to identify which docs are NOT attractive to the user.
stream = constructStream(streamFactory, streamExpression); | ||
|
||
streamContext.put("ubi-query", ubiQuery); | ||
stream.setStreamContext(streamContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know nothing about streams, but is it possible to submit UbiQuery here, which isn't a subclass of any serializable framework? You know passing a ref might work locally, but not it remote/distributed env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GOod question. I believe that the getTuple
method will be immediately run, which means this code doesn't actually get run in a distributed sense.. I.e, the ubiQuery
object that we put into the streamContext
is immediately read back out in getTuple()
method.. That is the job of the UBIQueryStream
class, to convert the UBIQuery
found in the context into a Tuple
that is used by streaming expressons.
I am going to try actually making UBIQuery
and UBIQueryStream
inner classes of UBIComponent
to see how that looks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. got it. found toMap()
. Shouldn't it yield docIds as well? I can't see refs to this field there https://github.com/epugh/solr/blob/99d6b7a7eb7b28a92f4cb36d4a525f8b901ba93c/solr/core/src/java/org/apache/solr/handler/component/UBIQuery.java#L103
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair question... OMG... I can't believe that I forgot to add the damn doc ids... I suck.
Thanks for spotting that! How embarrassing! yeah...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And fixed.
Pardon, I barely understand what's going on there.
names.add(TermsComponent.COMPONENT_NAME); | ||
|
||
return names; | ||
List<String> l = new ArrayList<String>(SearchComponent.STANDARD_COMPONENTS.keySet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result RTG occurs as a default component that causes a problem in an essential cloud test. 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humm... Maybe I just back out this optimization...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check out the changes I made to back this out, but still keep the name STANDARD_COMPONENTS
...
|
||
Set<String> fields = Collections.singleton(schema.getUniqueKeyField().getName()); | ||
for (DocIterator iter = dl.iterator(); iter.hasNext(); ) { | ||
sb.append(schema.printableUniqueKey(searcher.getDocFetcher().doc(iter.nextDoc(), fields))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about commas in id values? Isn't it safer to use json array as a convention for this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humm... maybe? Is that at all ever done? So, most folks are going to take this data, and load it into a pandas dataframe or soemthing like that... Now, since the doc is already JSON.. maybe it's okay to make this a array of JSON as well? Since you already have to parse some JSON to do anything with it......
Neither UBIQuery or UBIQueryStream will ever be used outside of the UBICompoent. Thought about a o.a.s.handler.component.ubi package as well, but this seems more specific...
This doesn't cover converting doc_ids into a JSON array of any type...
Exxcept that we had to change to support TEN standard components by not using Map.of. Also, I couldn't stand the lower case + underscore "standard_components" object name.
I suspect lots of places for fixing. Like pulling out field names into a UBIParams.java file. May want to name space ubi query params under "ubi.".. what about in JSON query?
UBI goes distrib
Description
I am working with other folks, especially Stavros Macrakis (macrakis@gmail.com), to come up with a solution for understanding what users are doing in response to search results. We have great visibility and understanding of an incoming query, what we do with it, and then what docs are sent back. We do NOT have a way of tying that search to then what does the user do next, and if the following query is connected to the original one.
Many teams lean on GA or Snowplow or custom code for tracking click through, add to cart, etc as signals, but nothing that is drop dead simple to use and open source.
Solution
User Behavior Insights is a shared schema for tracking search related activities. There is a basic implementation for OpenSearch and this is a version for Apache Solr.
Tasks to be done:
.expr
file and using it to write to another Solr collection.techproducts
as an example?Tests
Bats test to demonstrate the end to end use of UBI.
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.