Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETL giving up after one error #6872

Closed
4 tasks
ghost opened this issue Nov 2, 2016 · 14 comments
Closed
4 tasks

ETL giving up after one error #6872

ghost opened this issue Nov 2, 2016 · 14 comments
Assignees
Milestone

Comments

@ghost
Copy link

ghost commented Nov 2, 2016

OrientDB Version, operating system, or hardware.

  • v2.0 SNAPSHOT[ ] - .18[ ] .17[ ] .16[ ] .15[ ] .14[ ] .13[ ] .12[ ] .11[ ] .10[ ] .9[ ] .8[ ] .7[ ] .6[ ] .5[ ] .4[ ] .3[ ] .2[ ] .1[ ] .0[ ]
  • v2.1 SNAPSHOT[ ] - .16[ ] .15[ ] .14[ ] .13[ ] .12[ ] .11[ ] .10[ ] .9[ ] .8[ ] .7[ ] .6[ ] .5[ ] .4[ ] .3[ ] .2[ ] .1[ ] .0[ ]
  • v2.2 SNAPSHOT[ 12] - .rc1[ ] .beta2[ ] .beta1[ ]

Operating System

  • [x ] Linux
  • MacOSX
  • Windows
  • Other Unix
  • Other, name?

Error in Pipeline execution: com.orientechnologies.orient.core.exception.OCommandExecutionException: No edge has been created because no target vertices
Why give up(see below log)? Can't this process just report the problem and continue? With millions of edges in question this is behavior is inefficient and wasting lots of time. In the other ETL transformers you can specify "unresolvedLinkAction". Why not add this(or similar features) to the Command Transformer also?

Expected behavior and actual behavior

From a three column csv-file(arg1,arg2,arg3) arguments are taken to create edges with one property.

The essence of the actual ETL json config is this:
"command" :
create edge from (select from ClassX where $input.arg1..) to (select from ClassX where $input.arg2..) set something=$input.arg3

Here is what happens if one target vertice is not found:

excerpt from logging:

  • extracted 7,996 rows (5 rows/sec) - 7,996 rows -> loaded 7,493 vertices (5 vertices/sec) Total time: 1662154ms [0 warnings, 0 errors]
  • extracted 8,000 rows (4 rows/sec) - 8,000 rows -> loaded 7,497 vertices (4 vertices/sec) Total time: 1663154ms [0 warnings, 0 errors]
  • extracted 8,004 rows (4 rows/sec) - 8,004 rows -> loaded 7,501 vertices (4 vertices/sec) Total time: 1664154ms [0 warnings, 0 errors]
  • extracted 8,008 rows (4 rows/sec) - 8,008 rows -> loaded 7,505 vertices (4 vertices/sec) Total time: 1665154ms [0 warnings, 0 errors]
    Error in Pipeline execution: com.orientechnologies.orient.core.exception.OCommandExecutionException: No edge has been created because no target vertices
    DB name="XXXX"
    DB name="XXXX"
  • extracted 8,010 rows (1 rows/sec) - 8,010 rows -> loaded 7,507 vertices (1 vertices/sec) Total time: 1666155ms [0 warnings, 1 errors]
  • extracted 8,010 rows (0 rows/sec) - 8,010 rows -> loaded 7,507 vertices (0 vertices/sec) Total time: 1667155ms [0 warnings, 1 errors]
  • extracted 8,010 rows (0 rows/sec) - 8,010 rows -> loaded 7,507 vertices (0 vertices/sec) Total time: 1668155ms [0 warnings, 1 errors]
  • extracted 8,010 rows (0 rows/sec) - 8,010 rows -> loaded 7,507 vertices (0 vertices/sec) Total time: 1669155ms [0 warnings, 1 errors]
  • extracted 8,010 rows (0 rows/sec) - 8,010 rows -> loaded 7,507 vertices (0 vertices/sec) Total time: 1670156ms [0 warnings, 1 errors]

Steps to reproduce the problem

You need a class containing some 1000' records and a list of the vertices to be linked (see above).

@robfrank robfrank self-assigned this Nov 2, 2016
@robfrank robfrank added the bug label Nov 2, 2016
@robfrank
Copy link
Contributor

robfrank commented Nov 2, 2016

Can you provide me the configuration and sample data?

@ghost
Copy link
Author

ghost commented Nov 2, 2016

robfrank,

Linking items from the same class.

You need to edit filename in json and make your own for the inital data to be linked.
The linking will fail once it hits a col2-item that does not exist in the initially loaded data.
You need to edit/create your own json to load the classfile.

etlsamples.tar.gz

@robfrank
Copy link
Contributor

robfrank commented Nov 3, 2016

Sorry, but I don't understand how to use your samples. The config json seems truncated.
Could you please write down the entire procedure and provide a working configuration?

@ghost
Copy link
Author

ghost commented Nov 3, 2016

The json is deliberately truncated. I expect you to fill in the remaining stuff yourself, but here it is, passwords removed.

  1. run ETL to load the classfile - master data if you like.
    {
    "config": {
    "parallel": false,
    "log": "INFO"
    },
    "source": { "file": { "path": "/path-to-/" } },
    "extractor": { "csv": {"separator": ","} },
    "transformers": [
    { "vertex": { "class": "HumanProtein" } }
    ],
    "loader": {
    "orientdb": {
    "dbType": "graph",
    "dbURL": "remote:localhost/XXXXX",
    "dbUser": "admin",
    "dbPassword": "admin",
    "serverUser": "root",
    "serverPassword": "********",
    "dbAutoCreate": false,
    "wal": false,
    "dbAutoCreateProperties": true,
    "classes": [
    {name: 'HumanProtein', extends:"V"}
    ],
    "tx": false,
    "batchCommit": 1600
    }
    }
    }
  2. run ETL with the below json.

{
"config": {
"parallel": false,
"log": "DEBUG"
},
"source": { "file": { "path": "/path-to-/" }
},
"extractor": {
"row": {}
},
"transformers": [
{
"csv": {
"separator": ","
}
},
{
"command" : {
"command" : "create edge InteractWith from (select from HumanProtein where STRINGproteinId= '${input.protein1}') to (select from HumanProtein where STRINGproteinId= '${input.protein2}') set Score=${input.combined_score}",
"output" : "edge"
}
}
],
"loader": {
"orientdb": {
"dbType": "graph",
"dbURL": "remote:localhost/XXXXX",
"dbUser": "admin",
"dbPassword": "admin",
"serverUser": "root",
"serverPassword": "*****",
"dbAutoCreate": false,
"wal": false,
"dbAutoCreateProperties": true,
"classes": [
{name: 'HumanProtein', extends:"V"},
{name: 'InteractWith', extends:"E"}
],
"tx": false
}
}
}

  1. edit the link data to make sure one of the items in column 2 does not exist in the HumanProtein class you loaded in 1 by changing one of ENSPxxxx-code, i.e. in col 2.

Hope this helps.

@robfrank
Copy link
Contributor

robfrank commented Nov 3, 2016

just to be sure and don't waste my time:

  • first run using classfile.txt as input
  • second run using linkfile.txt ad input BUT I need to modify the file

@ghost
Copy link
Author

ghost commented Nov 3, 2016

That's correct. The idea is to make sure the linking fails. Do you
understand the ETL concept?

  1. nov. 2016 14.34 skrev "Roberto Franchini" notifications@github.com:

just to be sure and don't waste my time:

  • first run using classfile.txt as input
  • second run using linkfile.txt ad input BUT I need to modify the file


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#6872 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASkWeGJb8Ist_f0dRNMSfaSzQQHTVuGDks5q6eKggaJpZM4KnF2r
.

@ghost
Copy link
Author

ghost commented Nov 4, 2016

The ETL processes fails the same way, even with other transformers. I believe ODB should consider
changing ETL to include a "dryrun".
'Dryrun' should just do basic checking while logging bad data-rows with line numbers.
Facing a situation with jobs loading numerous CSV files, receiving data glitches is inevitable. Cleaning up after a failed ETL is not an easy task.

{
"config": {
"log": "dryrun",

Thank you.

@robfrank
Copy link
Contributor

robfrank commented Nov 4, 2016

Hi, I'll take care of this issues next week. Stay tuned

@robfrank
Copy link
Contributor

robfrank commented Nov 9, 2016

A silly question, did you try with the flag haltOnError set to false?
http://orientdb.com/docs/last/Configuration-File.html#configuration-variables

@ghost
Copy link
Author

ghost commented Nov 9, 2016

This is not a silly question. I did not use it and don't know if it works, so I have to test it. The question is
whether it gives enough sensible info in the log to make any sense. Running ETL with debug leaves Gb of
useless repeated data when loading millions of rows.
So, in essence I maintain my request for a 'dry run'.

@ghost
Copy link
Author

ghost commented Nov 9, 2016

haltOnError set to false is unfortunately not any helpful option.
Neither INFO nor ERROR helped me locating the error in question. That only leaves DEBUG(which makes
to much noise).
By the way, why do you repeat the name of the database twice when things go wrong?

robfrank added a commit that referenced this issue Dec 28, 2016
- removes system out in favor of OLogManager
- adds specialized logging config file to etl scripts (bat/sh)
- renames classes adding OETL prefix to all
- refactors tests to use one in memory db for each test method

refs #6872
robfrank added a commit that referenced this issue Dec 28, 2016
robfrank added a commit that referenced this issue Dec 28, 2016
@robfrank
Copy link
Contributor

Logging in case of errors is improved with input data, executed command and exception message:

[445:command] ERROR exception=No edge has been created because no target vertices
	DB name="graph" - input={protein1:ENSP00000000233,protein2:ENSP00000270115,combined_score:204} - command=sql.create edge InteractWith from (select from HumanProtein where STRINGproteinId= 'ENSP00000000233') to (select from HumanProtein where STRINGproteinId='ENSP00000270115') set Score=204
Error in Pipeline execution: com.orientechnologies.orient.core.exception.OCommandExecutionException: No edge has been created because no target vertices
	DB name="graph"

WDYT?

I don't have any idea in how to implement a dry run. How to do a "dry run " interaction with the database? My suggestion is to use the plocal or even memory connection to test the ETL process.

@robfrank robfrank added this to the 2.2.x (next hotfix) milestone Dec 28, 2016
robfrank added a commit to orientechnologies/orientdb-docs that referenced this issue Dec 28, 2016
@ghost
Copy link
Author

ghost commented Dec 28, 2016 via email

@robfrank
Copy link
Contributor

When the error is detected. This fix will be part of 2.2.15, but you can download a snapshot version of etl and use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants