Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading CSV data by ETL override Float values #6839

Closed
4 tasks
ghost opened this issue Oct 24, 2016 · 9 comments
Closed
4 tasks

Loading CSV data by ETL override Float values #6839

ghost opened this issue Oct 24, 2016 · 9 comments
Assignees
Labels

Comments

@ghost
Copy link

ghost commented Oct 24, 2016

OrientDB Version, operating system, or hardware.

  • v2.0 SNAPSHOT[ ] - .18[ ] .17[ ] .16[ ] .15[ ] .14[ ] .13[ ] .12[ ] .11[ ] .10[ ] .9[ ] .8[ ] .7[ ] .6[ ] .5[ ] .4[ ] .3[ ] .2[ ] .1[ ] .0[ ]
  • v2.1 SNAPSHOT[ ] - .16[ ] .15[ ] .14[ ] .13[ ] .12[ ] .11[ ] .10[ ] .9[ ] .8[ ] .7[ ] .6[ ] .5[ ] .4[ ] .3[ ] .2[ ] .1[ ] .0[ ]
  • v2.2 SNAPSHOT[ 11] - .rc1[ ] .beta2[ ] .beta1[ ]

Operating System

  • [ x] Linux
  • MacOSX
  • Windows
  • Other Unix
  • Other, name?

Expected behavior and actual behavior

If first row (below column headers) is 0, i.e. not 0.0, then the next row values gets set to integer also. Used sed to circumvent during load prep, but how can this be handled otherwise?

Steps to reproduce the problem

Load CSV file with 0 in first line (below title-row) and on next line, i.e. below the column with 0, a float value, e.g. 2.82942 will become 2

@lvca lvca added the bug label Oct 25, 2016
@robfrank
Copy link
Contributor

If you know the types of the columns, you can configure the CSV extractor:
http://orientdb.com/docs/last/Extractor.html#csv-extractor

{ "csv": 
    {  "predefinedFormat": "DEFAULT",
       "nullValue" : "N/A",
       "dateFormat" : "dd-MM-yyyy",
       "dateTimeFormat" : "dd-MM-yyyy HH:mm",
       "columns": ["name:string","createdAt:date","updatedAt:dateTime","value:long"]
    }
}

@ghost
Copy link
Author

ghost commented Oct 28, 2016

Thanks robfrank, you are absolutely right!
However, there is a snag here; the nature of the data makes it impractical to specify each column and add the type. Setting up ETL jobs in such detail is not practical in the use cases here.
In my mind the ETL process should auto-detect float and perhaps revert that property to float.
Thanks anyway!

@ghost ghost closed this as completed Oct 28, 2016
@robfrank
Copy link
Contributor

robfrank commented Oct 28, 2016

I setup a test for this case, configuration is:

{source: { content: { value: 'firstNumber
10
"10,78"'}  }, extractor : { csv: {} }, loader: { test: {} } }

Which represents a csv like that:

firstNumber
10
10.78

So, the first line is "casted" to integer, the second to float. I guess you have configured ETL to create class properties setting dbAutoCreateProperties on the loader to true. So, it will map the first line to integer and will create the property in the DB as integer. Then on the second line the CSV extractor parse 10.78 to float, but it is stored in the db as integer.

Your suggestion could be dangerous.
Suppose you have multiple lines, some lines with decimals and some without decimals?
What is the intention of the user? Should the parser read all as integers? Or as floats? Or maybe the user want these values to be parsed as String.

The ETL tries to guess types in the best effort way, but if you need precision and you know your data, configuring columns is be the best way to do it.
It gives more control to you with dirty data sets.

ETL in general is a dirty work :)

@ghost
Copy link
Author

ghost commented Nov 4, 2016

Sorry for my late response, but I beg to differ. The suggestion is not dangerous. We talk about column by column data. If someone mixes integers, strings and floats in the very same column, i.e. property, he or she got a problem. That is not what I am talking about. Well, yes ETL is dirty, by working with instruments(Mass Cytometry) in cancer research, which produces millions of data points, this approach could make a difference. Please also take a look at my dryrun suggestion in #6872

@ghost ghost reopened this Nov 4, 2016
@robfrank
Copy link
Contributor

robfrank commented Nov 9, 2016

Configuring ETL with data types will speed-up the ingestion process because ETL doesn't need to guess the data type for every line. Changing data types on the go implies a priority schema for types: is float more important than integer? Maybe for your use case. Maybe for another user integer should be preferred.

1 similar comment
@robfrank
Copy link
Contributor

robfrank commented Nov 9, 2016

Configuring ETL with data types will speed-up the ingestion process because ETL doesn't need to guess the data type for every line. Changing data types on the go implies a priority schema for types: is float more important than integer? Maybe for your use case. Maybe for another user integer should be preferred.

@ghost
Copy link
Author

ghost commented Nov 9, 2016

Again I disagree. OrientDB is marketed as a NoSQL database, schema less, schema full, hybrid, what have you. That is actually the beauty of it, great flexibility.
You are voicing in an opposite direction. Most scientific data will contain different number of parameters
and types, depending on the nature of the study/project in question.
Hence manually (or auto generating) ETL/JSON file in such detail before each ETL job
will be impractical/expensive and error prone.

I find the talk about priority hard to understand. If one measurement, from one property(column) is 0, and the next row, same column/property, is 1.274284, priority is irrelevant. That property must be set to float, not integer as it is today. Current default behavior is a bug.

You also argued that ETL is dirty work. Well, it doesn't have to be that way. You probably don't want to,
but effectively you make things harder.
There is no way anyone will enter scientific data from the keyboard, hence ETL is a critical path to success for OrientDB, or any database for that matter. Bioinformatics is definitely on the rise and the
volumes do not get smaller. Please reconsider.

@santo-it
Copy link

Hi @austx,

Many thanks for opening this issue and for your feedback

I have discussed internally with @robfrank and we are considering to support the following additional case:

  • schema less (so no properties defined)
  • no cast (the values will be imported as they are)

This will handle the case of the following situation: a column (property) where you have different data types, e.g.

0
2.8
"hello"

will be imported in OrientDB in schema-less mode, with no errors, thus giving the users lot of flexibility

In your specific case, you will have some integers and some decimal values (and no cast will happen), e.g. you fill find in the database the following values:

0
2.82942

Note that after the ETL is complete, you can still, if you want, create a DECIMAL property on this column

I believe that with this additional case we are considering to handle we will give more flexibility to our users. Obviously, one may still decide to define the data type before the import, and a cast will be done

Warning: when using schema-less with no checks on the data type, users must know what they are doing (as they may end up with having strings, integers, decimals and other types on the same property. Sometime they may want right this - and some other times lack of schema will prevent the user to find that that "string" is a wrong value. I believe this is kind of compromise we have to accept to have maximum flexibility)

Will it be a good compromise for you?

Many thanks,

@ghost
Copy link
Author

ghost commented Nov 21, 2016

Hi santo-it et al.,

thanks for getting back to me. I use schema less when loading new data from
mass cytometry.
I have included an example(converted to csv). I use a shell script when
preparing data for ETL,
so this is the raw version before my prep. The prep does several
operations, one is substituting
0 by 0.0.
Please have a go at the included file and consider the options.

I could appreciate the fact that in some odd cases strings, e.g. "hello"
would end up in a number column.
Personally I believe this should ring an alarm, but then again, there is a
limit to what OrientDB/ETL should
be required to perform.

Also, when it comes to bad data, the debug flag ERROR could perhaps be a
bit more verbose, what about
switching on some of the DEBUG features when an error in the data is
detected?

Thanks for your understanding, very much appreciated.

On 21 November 2016 at 12:04, santo-it notifications@github.com wrote:

Hi @austx https://github.com/austx,

Many thanks for opening this issue and for your feedback

I have discussed internally with @robfrank https://github.com/robfrank
and we are considering to support the following additional case:

  • schema less (so no properties defined)
  • no cast (the values will be imported as they are)

This will handle the case of the following situation: a column (property)
where you have different data types, e.g.

0
2.8
"hello"

will be imported in OrientDB in schema-less mode, with no errors, thus
giving the users lot of flexibility

In your specific case, you will have some integers and some decimal values
(and no cast will happen), e.g. you fill find in the database the following
values:

0
2.82942

Note that after the ETL is complete, you can still, if you want, create a
DECIMAL property on this column

I believe that with this additional case we are considering to handle we
will give more flexibility to our users. Obviously, one may still decide to
define the data type before the import, and a cast will be done

Warning: when using schema-less with no checks on the data type, users
must know what they are doing (as they may end up with having strings,
integers, decimals and other types on the same property. Sometime they may
want right this - and some other times lack of schema will prevent the user
to find that that "string" is a wrong value. I believe this is kind of
compromise we have to accept to have maximum flexibility)

Will it be a good compromise for you?

Many thanks,


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#6839 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASkWeOmg1S8XzVK9Yxu6NZKAcrkwAb-aks5rAXrTgaJpZM4KfOoq
.

Tore Austrått
phone +47 90657231

@ghost ghost closed this as completed Apr 10, 2017
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants