Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDBC Connection to SPARQL Endpoint Randomly Throws 404 Error #565

Closed
OpenDataAlex opened this issue Jul 18, 2016 · 11 comments
Closed

JDBC Connection to SPARQL Endpoint Randomly Throws 404 Error #565

OpenDataAlex opened this issue Jul 18, 2016 · 11 comments
Assignees

Comments

@OpenDataAlex
Copy link

I've written a process using Pentaho Data Integration (6.0) and the Jena JDBC driver (3.0.0) connecting to the SPARQL endpoint for Virtuoso version 7.20.3215. The process normalizes data into triples and then fills out the triples into one of two INSERT statements (see below sample queries). Randomly while loading the triples in I will receive the following error:

Error occurred during SPARQL update evaluation
at org.pentaho.di.core.database.Database.execStatement(Database.java:1506)
at org.pentaho.di.core.database.Database.execStatements(Database.java:1606)
at org.pentaho.di.trans.steps.sql.ExecSQL.processRow(ExecSQL.java:210)
... 2 more
Caused by: java.sql.SQLException: Error occurred during SPARQL update evaluation
at org.apache.jena.jdbc.statements.JenaStatement.executeUpdate(JenaStatement.java:478)
at org.apache.jena.jdbc.statements.JenaStatement.execute(JenaStatement.java:284)
at org.pentaho.di.core.database.Database.execStatement(Database.java:1480)
... 4 more
Caused by: org.apache.jena.atlas.web.HttpException: 404 - File not found
at org.apache.jena.riot.web.HttpOp.exec(HttpOp.java:1107)
at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:718)
at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:548)
at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:518)
at org.apache.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:79)
at org.apache.jena.jdbc.statements.JenaStatement.executeUpdate(JenaStatement.java:457)

Sample queries:

INSERT
{
     GRAPH  <graph-address> { ?newSubject ? ?}
}
WHERE
{
     BIND(URI(<uri-string>) AS ?newSubject) .
}

AND

INSERT
{
     GRAPH  <graph-address> { ?newSubject ? ?newObject}
}
WHERE
{
     BIND(URI(<uri-string>) AS ?newSubject) .
     BIND(URI(<uri-string>) AS ?newObject) .
}
@HughWilliams
Copy link
Collaborator

Do you have an pseudo code for how the Jena JDBC driver (3.0.0) is actually specifying a connection to the Virtuoso SPARQL endpoint ? As we need to be clear as to whether it is connecting via the /sparql http interface or via the JDBC interface.

How often do these random errors occur, and are any messages written to the "virtuoso.log" file when they occur ?

@OpenDataAlex
Copy link
Author

OpenDataAlex commented Jul 21, 2016

Sure thing:

jdbc:jena:remote:query=http://<virtuoso-instance>/sparql&update=http://>virtuoso-instance>/sparql

They happen whenever I try to load the dataset in thru the sparql endpoint, usually within the first hour or two. The queries that it hits when the 404 error occurs are syntactically correct. I verified it by running the query directly in the Virtuoso Conductor's Sparql editor.

Here is the log from the same time as the above error message was coming from PDI. The PDI process was kicked off at 7/13 13:33 and hit the 404 error while running at 7/13 at 14:20. I've included all of Virtuoso's 7/13 log below, which included a manual restart on my part in an attempt at troubleshooting.

            Wed Jul 13 2016
00:43:17 Checkpoint started
00:43:17 Checkpoint finished, log reused
01:43:18 Checkpoint started
01:43:18 Checkpoint finished, log reused
02:43:18 Checkpoint started
02:43:18 Checkpoint finished, log reused
03:43:19 Checkpoint started
03:43:19 Checkpoint finished, log reused
04:43:19 Checkpoint started
04:43:19 Checkpoint finished, log reused
05:43:20 Checkpoint started
05:43:20 Checkpoint finished, log reused
06:43:20 Checkpoint started
06:43:20 Checkpoint finished, log reused
07:43:20 Checkpoint started
07:43:21 Checkpoint finished, log reused
08:43:21 Checkpoint started
08:43:21 Checkpoint finished, log reused
09:43:21 Checkpoint started
09:43:21 Checkpoint finished, log reused
10:43:22 Checkpoint started
10:43:22 Checkpoint finished, log reused
11:43:22 Checkpoint started
11:43:22 Checkpoint finished, log reused
12:43:23 Checkpoint started
12:43:23 Checkpoint finished, log reused
13:43:23 Checkpoint started
13:43:23 Checkpoint finished, log reused
14:43:24 Checkpoint started
14:43:24 Checkpoint finished, log reused
15:43:24 Checkpoint started
15:43:24 Checkpoint finished, log reused
16:43:24 Checkpoint started
16:43:24 Checkpoint finished, log reused
17:25:23 Server received signal 15
17:25:23 Initiating quick shutdown
17:25:23 Server shutdown complete

            Wed Jul 13 2016

17:25:46 { Loading plugin 1: Type `plain', file`wikiv' in `/opt/virtuoso-opensource/lib/virtuoso/hosting'
17:25:46   FAILED  plugin 1: Unable to locate file }
17:25:46 { Loading plugin 2: Type`plain', file `mediawiki' in`/opt/virtuoso-opensource/lib/virtuoso/hosting'
17:25:46   FAILED  plugin 2: Unable to locate file }
17:25:46 { Loading plugin 3: Type `plain', file`creolewiki' in `/opt/virtuoso-opensource/lib/virtuoso/hosting'
17:25:46   FAILED  plugin 3: Unable to locate file }
17:25:46 OpenLink Virtuoso Universal Server
17:25:46 Version 07.20.3215-pthreads for Linux as of Feb 26 2016
17:25:46 uses parts of OpenSSL, PCRE, Html Tidy
17:25:48 Database version 3126
17:25:48 SQL Optimizer enabled (max 1000 layouts)
17:25:49 Compiler unit is timed at 0.000145 msec
17:25:51 Roll forward started
17:25:51     71 transactions, 8917 bytes replayed (100 %)
17:25:51 Roll forward complete
17:25:51 Checkpoint started
17:25:51 Checkpoint finished, log reused
17:25:51 HTTP/WebDAV server online at 8890
17:25:51 Server online at 1111 (pid 53229)
18:25:52 Checkpoint started
18:25:52 Checkpoint finished, log reused
19:25:54 Checkpoint started
19:25:54 Checkpoint finished, log reused
20:25:54 Checkpoint started
20:25:54 Checkpoint finished, log reused
21:25:55 Checkpoint started
21:25:55 Checkpoint finished, log reused
22:25:55 Checkpoint started
22:25:55 Checkpoint finished, log reused
23:25:56 Checkpoint started
23:25:56 Checkpoint finished, log reused

@HughWilliams
Copy link
Collaborator

HughWilliams commented Jul 24, 2016

@OpenDataAlex: jdbc:jena:remote:query=http://<virtuoso-instance>/sparql&update=http://>virtuoso-instance>/sparql doesn't really tell me much about how the updates are being performed other than you seem to be connection to the /sparql endpoint , but how is the actual data being passed for insertion?

Also, you previously indicated not having granted the SPARQL_UPDATE role to the /sparql endpoint which has SPARQL_SELECT (i.e., read-only) access by default, thus no update operations would be allowed using it?

@openlink openlink assigned openlink and smalinin and unassigned openlink Jul 25, 2016
@OpenDataAlex
Copy link
Author

OpenDataAlex commented Jul 25, 2016

@HughWilliams The connection is opened by PDI and data is passed one triple at a time through the templates in the original posting. I'm using the dba account as this is just a test/poc of the process. I've not changed any permissions so the dba account has whatever the defaults are.

@HughWilliams
Copy link
Collaborator

HughWilliams commented Jul 28, 2016

@OpenDataAlex: It is still unclear to me how you are authenticating and performing SPARQL update operation against the default http://<virtuoso-instance>/sparql endpoint which is read-only and has no authentication, unless you have specifically updated to allow SPARQL_UPDATE operations and Authenticate as said previously, so I would not expect any update/insert/delete operations to be allowed. The Virtuoso /sparql-auth endpoint is the default to be used for SPARQL update and does challenge for authentication as detailed in SPARQL Web Services & APIs.

I assume this is the Jena JDBC driver being used and configured for remote access to a SPARQL endpoint?

So in conclusion, for a default Virtuoso installation, based on the connect string you provide above, I would expect it to be something more like:

jdbc:jena:remote:query=http://<virtuoso-instance>/sparql-auth&update=http://>virtuoso-instance>/sparql-auth

@OpenDataAlex
Copy link
Author

@HughWilliams I appreciate that, but don't know what to tell you. I use the connection as mentioned earlier and I do have inserts/deletes/selects working fine (at least until the 404 hits). This could be a result of the AWS image that's available, but not sure. In any case, all I have done to get PDI talking to the Virtuoso instance is add the Jena jdbc driver and provide for it the connection from my previous comments. I'm happy to provide more information, just let me know what to provide.

In the meantime, I'll try switching to the connection string you recommended and will let you know if there is a change in performance.

@HughWilliams
Copy link
Collaborator

HughWilliams commented Jul 31, 2016

@OpenDataAlex: Note we have this post on the virtuoso-users mailing list, where it is implied 404 errors are seen when performing operations against the Virtuoso sparql endpoint during a checkpoint and I wonder if this might be what you are experiencing also?

Can you check the CheckpointInterval setting in your virtuoso.ini and set to 0 to turn off checkpointing and then see if the 404 errors are seen ?

A manual checkpoint can be performed to commit transactions to the database, or the checkpoint_interval(-1) command can be used to turn off checkpointing when performing the update operations, and then turn it back on as detailed in the checkpoint_interval() function documentation.

@OpenDataAlex
Copy link
Author

OpenDataAlex commented Sep 9, 2016

@HughWilliams That did the trick, but ran out of memory.

The CheckpointInterval was set at 60 and changing the interval to 0 allowed for 10 million triples to load before hitting memory limits.

@HughWilliams
Copy link
Collaborator

HughWilliams commented Sep 12, 2016

@OpenDataAlex: What is running out of memory? Is a specific error being reported by Virtuoso on the client or server (log) side?

silvae86 added a commit to feup-infolab/dendro that referenced this issue Oct 6, 2017
…ntInterval = 0 parameter would fix the 404 errors as per the Issue on virtuoso Github: openlink/virtuoso-opensource#565
@TallTed
Copy link
Collaborator

TallTed commented Jul 22, 2019

@OpenDataAlex - Were you able to resolve the "ran out of memory" error you mentioned? We should probably address that in a fresh issue, if not. In either case, would you consider this issue (#565) resolved?

@OpenDataAlex
Copy link
Author

We never were able to resolve it and unfortunately I've had to move on to other projects since this issue was first noted. I believe the issue does boil down to configuration and environment. We were trying to load more data into memory than memory was available I believe. We can consider the issue closed but if I run into ever again I'll reference this one and create a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants