-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading CSV data by ETL override Float values #6839
Comments
If you know the types of the columns, you can configure the CSV extractor:
|
Thanks robfrank, you are absolutely right! |
I setup a test for this case, configuration is:
Which represents a csv like that:
So, the first line is "casted" to integer, the second to float. I guess you have configured ETL to create class properties setting Your suggestion could be dangerous. The ETL tries to guess types in the best effort way, but if you need precision and you know your data, configuring columns is be the best way to do it. ETL in general is a dirty work :) |
Sorry for my late response, but I beg to differ. The suggestion is not dangerous. We talk about column by column data. If someone mixes integers, strings and floats in the very same column, i.e. property, he or she got a problem. That is not what I am talking about. Well, yes ETL is dirty, by working with instruments(Mass Cytometry) in cancer research, which produces millions of data points, this approach could make a difference. Please also take a look at my dryrun suggestion in #6872 |
Configuring ETL with data types will speed-up the ingestion process because ETL doesn't need to guess the data type for every line. Changing data types on the go implies a priority schema for types: is float more important than integer? Maybe for your use case. Maybe for another user integer should be preferred. |
1 similar comment
Configuring ETL with data types will speed-up the ingestion process because ETL doesn't need to guess the data type for every line. Changing data types on the go implies a priority schema for types: is float more important than integer? Maybe for your use case. Maybe for another user integer should be preferred. |
Again I disagree. OrientDB is marketed as a NoSQL database, schema less, schema full, hybrid, what have you. That is actually the beauty of it, great flexibility. I find the talk about priority hard to understand. If one measurement, from one property(column) is 0, and the next row, same column/property, is 1.274284, priority is irrelevant. That property must be set to float, not integer as it is today. Current default behavior is a bug. You also argued that ETL is dirty work. Well, it doesn't have to be that way. You probably don't want to, |
Hi @austx, Many thanks for opening this issue and for your feedback I have discussed internally with @robfrank and we are considering to support the following additional case:
This will handle the case of the following situation: a column (property) where you have different data types, e.g. 0 will be imported in OrientDB in schema-less mode, with no errors, thus giving the users lot of flexibility In your specific case, you will have some integers and some decimal values (and no cast will happen), e.g. you fill find in the database the following values: 0 Note that after the ETL is complete, you can still, if you want, create a DECIMAL property on this column I believe that with this additional case we are considering to handle we will give more flexibility to our users. Obviously, one may still decide to define the data type before the import, and a cast will be done Warning: when using schema-less with no checks on the data type, users must know what they are doing (as they may end up with having strings, integers, decimals and other types on the same property. Sometime they may want right this - and some other times lack of schema will prevent the user to find that that "string" is a wrong value. I believe this is kind of compromise we have to accept to have maximum flexibility) Will it be a good compromise for you? Many thanks, |
Hi santo-it et al., thanks for getting back to me. I use schema less when loading new data from I could appreciate the fact that in some odd cases strings, e.g. "hello" Also, when it comes to bad data, the debug flag ERROR could perhaps be a Thanks for your understanding, very much appreciated. On 21 November 2016 at 12:04, santo-it notifications@github.com wrote:
Tore Austrått |
OrientDB Version, operating system, or hardware.
Operating System
Expected behavior and actual behavior
If first row (below column headers) is 0, i.e. not 0.0, then the next row values gets set to integer also. Used sed to circumvent during load prep, but how can this be handled otherwise?
Steps to reproduce the problem
Load CSV file with 0 in first line (below title-row) and on next line, i.e. below the column with 0, a float value, e.g. 2.82942 will become 2
The text was updated successfully, but these errors were encountered: