Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set default precisionStep for NumericField and NumericRangeFilter [LUCENE-1712] #2786

Closed
asfimport opened this issue Jun 22, 2009 · 22 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Jun 22, 2009

This is a spinoff from #2775.

A user using Numeric* should not need to understand what's
"under the hood" in order to do their indexing & searching.

They should be able to simply:

doc.add(new NumericField("price", 15.50);

And have a decent default precisionStep selected for them.

Actually, if we add ctors to NumericField for each of the supported
types (so the above code works), we can set the default per-type. I
think we should do that?

4 for int and 6 for long was proposed as good defaults.

The default need not be "perfect", as advanced users can always
optimize their precisionStep, and for users experiencing slow
RangeQuery performance, NumericRangeQuery with any of the defaults we
are discussing will be much faster.


Migrated from LUCENE-1712 by Michael McCandless (@mikemccand), resolved Jul 14 2009
Attachments: LUCENE-1712.patch (versions: 2)

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Should be:

doc.add(new NumericField("price").setFloatValue(15.50f));

With a direct constructor, there would be the problem of missing data type information.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

With a direct constructor, there would be the problem of missing data type information.

Sorry, what does that mean?

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

The methods setFloatValue() etc are for specifying the exact data type. If you would do it in the constructor, the resulting code would be very error-prone. e.g., new NumericField("price", 12.5): Does this mean 12.5 as double or float, is 15 alone meant as (in future) short, byte or int? OK users can add "f" in the float case to it, but this makes it very hard to prevent errors, because Java automatically casts all numeric types to each other suddenly. This is why I added these factories for NumericRangeQuery and the setters here.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, I agree it's dangerous to let javac auto cast.

So, can we set an across the board default of 4?

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

Am I misunderstanding something or the problem still persists?
Even if you use a common default, what is your base type - int or long? Are floats converted to ints, or to longs?

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

I think this is simple, we just overload all ctors and factories and remove the precisionStep there. These methods use NumericUtils.DEFAULT_PRECISION_STEP = 4 then. But thee should be a clear not, to do also tests with other step values.

additional variants:

  • NumericField(name), NumericField(name,store,index)
  • NumericTokenStream()
  • NumericRangeQuery.newXxxRange(field, min, max,...)

I am currentlly not sure (I was thinking the whole time during including into core) to also make NumericRangeQuery work like the other two classes: generic Constructor without datatype and then set the range explicit:

new NumericRangeQuery(fieldname[, precisionStep]).setFloatRange(....)

Not sure, in this case the API where similar and I have to override only one ctor for different construction parameters. My only problem is, that Queries are normally almost everywhere in Lucene static and unmodifable (beyond boost).

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Even if you use a common default, what is your base type - int or long? Are floats converted to ints, or to longs?

Float are indexed like ints and doubles like longs.

The problem here is more that if you would specify the value direct in the constructor, you cannot for sure always give the right type (because Java auto-casts). This is why I have these setFloatValue(), setLongValue() and so on.

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

Aha! And each time you invoke setFloatValue/setDoubleValue it switches base type behind the scenes? Eeek.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

each time you invoke setFloatValue/setDoubleValue it switches base type behind the scenes?

I think that's an acceptable risk. I suppose we could add checking to catch you but I don't think that's needed (we should document clearly that you can't "mix types").

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

behind the scenes precision changes based on which set*Value() method is called smells really wrong.

I'm not overly familiar with NumericField, but i'm i'm understanding the current suggestion, wouldn't that mkae situations like this come up...

NumericField a = new NumericField("price", MY_CUSTOM_PRECISION_STEP, ...);
a.setFloatValue(23.4f); // blows away my custom precision

NumericField b = new NumericField("price", ...);
b.setPrecisionStep(MY_CUSTOM_PRECISION_STEP);
b.setFloatValue(23.4f); // blows away my custom precision

NumericField c = new NumericField("price", ...);
c.setFloatValue(23.4f); 
c.setPrecisionStep(MY_CUSTOM_PRECISION_STEP); // only way to get my value used

...that seems sketchy, and really anoying if people try reusing NumericField instances.

If the goal is to have good "defaults" based on type then why not just have a constant per type that people can refer to explicitly? if they don't know what number to pick ... as well as a true "default" if they pick nothing.

int DEFAULT_STEP = ...;
int SUGGESTED_INT_STEP = ...;
int SUGGESTED_FLOAT_STEP = ...;

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

In my opinion, I would like to keep the precisionStep parameter required and give the 4 constants for each data type in NumericUtils.

On the other hand 4 is a (maybe) good default, so I propose, that all ctors/factories getting a precisionStep default it to 4, if left out. precisionStep is a final variable in NumericTokenStream (and so in NumericField), because it does not make sense to change it. If "field" is final, also precisionStep should be final (one field must always use the same precision step). In principle Mike is right, the type is also fixed after first calling setXxxValue, so I could throw an IAE, if somebody calles a setter for a different datatype after the first one. A IllegalStateEx is thrown, when the value was not initialized and the docinverter tries to use the token stream.

Here are two ideas to fix this the defaultPrecStep per type:

  1. Special value 0 as default precStep:
  • the no-precStep ctor sets the precStep in NumTokenStream to 0 (invalid value), if one is given it must be >0 and <=valsize
  • when delivering the tokens, NumTokenStream uses the default for this data type if precStep==0 and the given value in all other cases
    In this case the precStep is still final in NumericTokenStream, with 0 means "automatic".
  1. There is one other idea:
    NumericField/-TokenStream could have a required ctor param type that can be NumericField.Type.INT,... In this case the default could be choosen very simple at the beginning. And it also fixes the data type. If somebody calls setDoubleValue but has initialized the TokenStream with NumericField.Type.INT, he will get an UOE.

The javadocs should always clearly note, that one should check out a good precStep.


By the way: It is also a good idea to use valSize (32 or 64) as precisionStep in the case that you do not want to do range queries on the field (and use it only for sorting). RangeQueries would still work, but are as slow as conventional ones (current solr trunk contains this hint in its TrieField docs/schema)

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

Having half of your methods constantly fail with an exception depending on constructor parameter. That just screams - "Split me into two classes!"

@asfimport
Copy link
Author

asfimport commented Jun 23, 2009

Uwe Schindler (@uschindler) (migrated from JIRA)

Four classes! And with #2784 there will be six! Not a good idea. 6 classes for NumericTokenStream, 6 for NumericField and maybe 6 for NumRangeQuery/Filter. brrrrr

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

...and unmaintainable. I merged the two classes from contrib because of this. It was just duplicate code with some small variances. Always a problem for copy/paste operations.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I propose, that all ctors/factories getting a precisionStep default it to 4

+1, with javadocs encouraging experimentation.

I think the ideas to conditionalize the default according to type add
spooky complexity

By the way: It is also a good idea to use valSize (32 or 64) as precisionStep in the case that you do not want to do range queries on the field (and use it only for sorting).

Let's be sure to call out this use-case in the javadocs!

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

By the way: It is also a good idea to use valSize (32 or 64) as precisionStep in the case that you do not want to do range queries on the field (and use it only for sorting).

This screams out for additional (redundant) constants that are self documenting in their names...

int PRECISION_STEP_DEFAULT = 4; // i think?
int PRECISION_STEP_SUGGESTED_INT_RANGEANDSORT = ...; // no idea what this should be
int PRECISION_STEP_SUGGESTED_INT_SORTONLY = 32; // i think?
int PRECISION_STEP_SUGGESTED_FLOAT_RANGEANDSORT = ...; // no idea what this should be
int PRECISION_STEP_SUGGESTED_FLOAT_SORTONLY = 32; // i think?
...

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

How about we add PRECISION_STEP_DEFAULT=4, make that the default for all types, and then note in the javadocs the "interesting" values for precision step (ie for sorting only)?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm assuming this one is yours Uwe!

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Attached is a patch with the default precisionStep of 4. The javadocs of NumericRangeQuery list all possible and senseful values.

This patch also contains some cleanup in NumericUtils (rename constants) and a lot of other JavaDocs fixes. It also changes the token types of the TokenStream (no difference between 32/64 bit vals needed) and adds a test for them.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Uwe!

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Some updates:

  • remove the limitation of the precisionStep to the value size (32 or 64). To index a non-trie-encoded numeric field, you can simply use Integer.MAX_VALUE or any other value>=64 as precStep. The lower limit of precStep is 1.
  • add a test for 64 bit values with recommended precStep=6
  • add simple test for unlimited precStep as above
    I will commit this shortly.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed revision: 793823

Thanks Mike!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants