Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

Search lucene refactoring #1

Merged
merged 51 commits into from
Aug 1, 2014
Merged

Search lucene refactoring #1

merged 51 commits into from
Aug 1, 2014

Conversation

butonic
Copy link
Contributor

@butonic butonic commented Feb 11, 2014

This is owncloud-archive/apps#1624 moved to this Repo. It does not change the database.xml but changes the way backgroundjobs are used to index new files and update changed files.

Also starts adding unit tests and initial README.md Will enable travis testing and add more tests in future PR.

@DeepDiver1975 @karlitschek @kabum see commit messages for detailed changes.

@alexboss
Copy link

Thanks @butonic for the new release... It has been installed on my server (ownCloud 6.0.1 packaged for debian) and here is my feedback...

First of all, as a reminder, as I had problems previously with the cron process for the indexation, I switched it to the Ajax method.

So after deploying the latest version and accessing my owncloud (with 2 different users), here is what I got in the owncloud.log:

{"app":"PHP","message":"Argument 1 passed to OCA\\Search_Lucene\\Hooks::doIndexFile() must be of the type array, string given at \/var\/www\/owncloud\/apps\/search_lucene\/lib\/hooks.php#125","level":2,"time":"2014-02-11T20:28:32+00:00"}
{"app":"search_lucene","message":"background job optimizing index for xxx.xxx","level":0,"time":"2014-02-11T20:30:01+00:00"}
{"app":"search_lucene","message":"deleting deprecated index document for 2:\/LiberKey\/LiberKeyTools\/LiberKeyMenu\/data\/Menu\/bak\/recent_2014-01-30_10-59-09.xml","level":0,"time":"2014-02-11T20:30:02+00:00"}
[..]
{"app":"search_lucene","message":"deleting deprecated index document for 196:\/LiberKey\/Apps\/Chromium\/App\/Chromium\/debug.log","level":0,"time":"2014-02-11T20:30:03+00:00"}
{"app":"search_lucene","message":"optimizing index","level":0,"time":"2014-02-11T20:30:03+00:00"}

After browsing a bit on the ownCloud interface:

{"app":"search_lucene","message":"optimize job did not receive user in arguments: {\"user\":false}","level":0,"time":"2014-02-11T20:32:14+00:00"}
{"app":"search_lucene","message":"background job indexing 0 files for xxx.xxx","level":0,"time":"2014-02-11T20:44:36+00:00"}
{"app":"search_lucene","message":"background job indexing 0 files for xxx.xxx","level":0,"time":"2014-02-11T20:46:18+00:00"}
{"app":"search_lucene","message":"background job indexing 0 files for xxx.xxx","level":0,"time":"2014-02-11T20:48:07+00:00"}

Or doing the same with another user:

{"app":"search_lucene","message":"background job optimizing index for yyy.yyy","level":0,"time":"2014-02-11T20:48:19+00:00"}
{"app":"search_lucene","message":"deleting deprecated index document for 0:\/lorem.txt","level":0,"time":"2014-02-11T20:48:19+00:00"}
{"app":"search_lucene","message":"optimizing index","level":0,"time":"2014-02-11T20:48:19+00:00"}

The table oc_jobs looks like this:

INSERT INTO `oc_jobs` VALUES(1, 'OC\\Cache\\FileGlobalGC', 'null', 1392152493);
INSERT INTO `oc_jobs` VALUES(2, 'OC\\BackgroundJob\\Legacy\\RegularJob', '["\\\\OC\\\\Files\\\\Cache\\\\BackgroundWatcher","checkNext"]', 1392152493);
INSERT INTO `oc_jobs` VALUES(793, '\\OCA\\Search_Lucene\\OptimizeJob', '{"user":"xxx.xxx"}', 1392150601);
INSERT INTO `oc_jobs` VALUES(794, '\\OCA\\Search_Lucene\\OptimizeJob', '{"user":false}', 1392150734);
INSERT INTO `oc_jobs` VALUES(795, '\\OCA\\Search_Lucene\\IndexJob', '{"user":"xxx.xxx"}', 1392152493);
INSERT INTO `oc_jobs` VALUES(796, '\\OCA\\Search_Lucene\\OptimizeJob', '{"user":"yyy.yyy"}', 

Table oc_locks is empty.

Table oc_lucene status has entries like (all set to N)

INSERT INTO `oc_lucene_status` VALUES(45127, 'N');
[...]
INSERT INTO `oc_lucene_status` VALUES(46578, 'N');

And on the file system:

root@xxx:/var/www/owncloud/data/xxx.xxx/lucene_index# ll
total 16
drwxr-xr-x  2 www-data www-data 4096 Feb 11 21:30 ./
drwxr-xr-x 11 www-data www-data 4096 Feb 13 15:29 ../
-rw-rw-rw-  1 www-data www-data    0 Feb 11 21:30 optimization.lock.file
-rw-rw-rw-  1 www-data www-data    0 Feb 13 15:28 read.lock.file
-rw-rw-rw-  1 www-data www-data    0 Feb 11 21:30 read-lock-processing.lock.file
-rw-rw-rw-  1 www-data www-data   20 Feb 11 21:30 segments_cc
-rw-rw-rw-  1 www-data www-data   20 Feb 11 21:30 segments.gen
-rw-rw-rw-  1 www-data www-data    0 Feb 11 21:30 write.lock.file

The only error I have in the PHP error log file is now this one below, but I don't think it's related to the search lucene module (because raised after my first set of tests)

[13-Feb-2014 09:32:16 UTC] PHP Fatal error:  Method OC_L10N_String::__toString() must not throw an exception in /var/www/owncloud/lib/private/defaults.php on line 154

But unfortunately I have the feeling it's not indexing at the moment, have to try additional test using my previous test cases explained in #4

Alexandre

@alexboss
Copy link

Hello guys, made a new test with my test case explained on #4 :

1/ Edit through owncloud interface a file sonnets-pour-helene.txt
2/ Check owncloud.log

{"app":"core","message":"Generating preview for \"\/sonnets-pour-helene.txt\" with \"OC\\Preview\\TXT\"","level":0,"time":"2014-02-22T13:39:09+00:00"}
{"app":"core","message":"Generating preview for \"\/sonnets-pour-helene.txt\" with \"OC\\Preview\\Unknown\"","level":0,"time":"2014-02-22T13:39:09+00:00"}
{"app":"core","message":"OC_Image->fixOrientation() No readable file path set.","level":0,"time":"2014-02-22T13:39:09+00:00"}
{"app":"core","message":"OC_Image->fixOrientation() Orientation: -1","level":0,"time":"2014-02-22T13:39:09+00:00"}
{"app":"OC\\Files\\Cache\\Scanner","message":"!!! No reuse of etag for 'files\/sonnets-pour-helene.txt' !!! \ncache: Array\n(\n    [fileid] => 50449\n    [storage] => home::xxx.xxx\n    [path] => files\/sonnets-pour-helene.txt\n    [parent] => 2\n    [name] => sonnets-pour-helene.txt\n    [mimetype] => text\/plain\n    [mimepart] => text\n    [size] => 0\n    [mtime] => 1393076349\n    [storage_mtime] => 1393076349\n    [encrypted] => 1\n    [unencrypted_size] => 686\n    [etag] => 5308a87d0b38e\n)\n \ndata: Array\n(\n    [mimetype] => text\/plain\n    [mtime] => 1393076361\n    [size] => 940\n    [etag] => 5308a88906fcf\n    [storage_mtime] => 1393076361\n)\n","level":0,"time":"2014-02-22T13:39:21+00:00"}

3/ Check oc_jobs

INSERT INTO `oc_jobs` VALUES(1, 'OC\\Cache\\FileGlobalGC', 'null', 1392152493);
INSERT INTO `oc_jobs` VALUES(2, 'OC\\BackgroundJob\\Legacy\\RegularJob', '["\\\\OC\\\\Files\\\\Cache\\\\BackgroundWatcher","checkNext"]', 1392152493);
INSERT INTO `oc_jobs` VALUES(793, '\\OCA\\Search_Lucene\\OptimizeJob', '{"user":"xxx.xxx"}', 1392150601);
INSERT INTO `oc_jobs` VALUES(794, '\\OCA\\Search_Lucene\\OptimizeJob', '{"user":false}', 1392150734);
INSERT INTO `oc_jobs` VALUES(795, '\\OCA\\Search_Lucene\\IndexJob', '{"user":"xxx.xxx"}', 1392152493);
INSERT INTO `oc_jobs` VALUES(796, '\\OCA\\Search_Lucene\\OptimizeJob', '{"user":"yyy.yyy"}', 1392151699);

4/ Go back to owncloud interface - check owncloud.log

{"app":"PHP","message":"filemtime(): stat failed for \/var\/www\/owncloud\/data\/xxx.xxx\/files_trashbin\/files at \/var\/www\/owncloud\/lib\/private\/files\/storage\/local.php#127","level":2,"time":"2014-02-22T13:41:46+00:00"}
{"app":"core","message":"Generating preview for \"\/\/sonnets-pour-helene.txt\" with \"OC\\Preview\\TXT\"","level":0,"time":"2014-02-22T13:41:47+00:00"}
{"app":"core","message":"OC_Image->fixOrientation() No readable file path set.","level":0,"time":"2014-02-22T13:41:47+00:00"}
{"app":"core","message":"OC_Image->fixOrientation() Orientation: -1","level":0,"time":"2014-02-22T13:41:47+00:00"}
{"app":"OC\\Files\\Cache\\Scanner","message":"!!! Path 'cache\/132012632' is not readable !!!","level":0,"time":"2014-02-22T13:41:47+00:00"}

5/ Wait for background job to execute and check owncloud.log

{"app":"search_lucene","message":"background job indexing 0 files for xxx.xxx","level":0,"time":"2014-02-22T13:53:53+00:00"}

6/ Table oc_jobs is empty of QueuedJob, content of file sonnets-pour-helene.txt is NOT indexed

7/ Check table oc_lucene_status - no new record (some existing recors but all with status N)

8/ Content of lucene_index

root@xxx:/var/www/owncloud/data# ll xxx.xxx/lucene_index/
total 16
drwxr-xr-x  2 www-data www-data 4096 Feb 11 21:30 ./
drwxr-xr-x 10 www-data www-data 4096 Feb 22 14:15 ../
-rw-rw-rw-  1 www-data www-data    0 Feb 11 21:30 optimization.lock.file
-rw-rw-rw-  1 www-data www-data    0 Feb 22 14:56 read.lock.file
-rw-rw-rw-  1 www-data www-data    0 Feb 11 21:30 read-lock-processing.lock.file
-rw-rw-rw-  1 www-data www-data   20 Feb 11 21:30 segments_cc
-rw-rw-rw-  1 www-data www-data   20 Feb 11 21:30 segments.gen
-rw-rw-rw-  1 www-data www-data    0 Feb 11 21:30 write.lock.file

I then ran a second test with a document .odt but it didn't index either. Content of owncloud.log:

{"app":"OC\\Files\\Cache\\Scanner","message":"!!! Path 'files\/New Document.odt' is not readable !!!","level":0,"time":"2014-02-22T13:45:15+00:00"}
{"app":"OC\\Files\\Cache\\Scanner","message":"!!! No reuse of etag for 'files_versions' !!! \ncache: Array\n(\n    [fileid] => 62\n    [storage] => home::xxx.xxx\n    [path] => files_versions\n    [parent] => 1\n    [name] => files_versions\n    [mimetype] => httpd\/unix-directory\n    [mimepart] => httpd\n    [size] => 1177369700\n    [mtime] => 1392622722\n    [storage_mtime] => 1392622722\n    [encrypted] => \n    [unencrypted_size] => 0\n    [etag] => 5301bc8a4400e\n)\n \ndata: Array\n(\n    [mimetype] => httpd\/unix-directory\n    [mtime] => 1393076361\n    [size] => -1\n    [etag] => 5308aa073eb8e\n    [storage_mtime] => 1393076361\n)\n","level":0,"time":"2014-02-22T13:45:43+00:00"}

... but this time no PHP error.

So unfortunately, at least for me, indexing does not work anymore at all.

Content of oc_jobs doesn't really change (just last_run column) and the content of the folder lucene_index doesn't change either

Waiting now for further instructions ;-)

Thanks,

Alexandre 8)

@butonic
Copy link
Contributor Author

butonic commented Jul 2, 2014

@alexboss sorry to let you waiting so long. I finally found some time to bring this PR up to speed for OC7. It would be great if you could test this again (well ... requires OC7) ... especially migrating from OC6 to OC7 with search_lucene app enabled.

Also pulling @MorrisJobke @PVince81 @schiesbn @icewind1991 and @DeepDiver1975 here. Please take some time to try the new version. It now uses the file id as the key in the lucene index, which requires a complete reindex (will work automagically in the background). But that allows me to cleanly solve several problems.

// the cache already knows mime and other basic stuff
$data = $view->getFileInfo($path);
$data = $this->view->getFileInfo($path);
if (isset($data['mimetype'])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you're touching this anyway, it would be better to switch to the using the fileinfo as an object ($data->getMimetype() etc)

@icewind1991
Copy link
Contributor

Overall this seems like a great improvement

$user = OCP\User::getUser();

OC_Util::tearDownFS();
OC_Util::setupFS($user);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason behind this? The filesystem should already be setup properly for the logged in user

@butonic
Copy link
Contributor Author

butonic commented Jul 6, 2014

@icewind1991 could you re check the search_lucene refactoring? I moved to the new public files api.

butonic added a commit that referenced this pull request Aug 1, 2014
@butonic butonic merged commit 139f882 into master Aug 1, 2014
@butonic butonic deleted the search_lucene_refactoring branch August 1, 2014 12:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants