forked from synhershko/clucene
-
Notifications
You must be signed in to change notification settings - Fork 0
/
INSTALL
258 lines (209 loc) · 10.7 KB
/
INSTALL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
* There are packages available for most linux distributions through the usual channels.
* The Clucene Sourceforge website also has some distributions available.
Also in this document is information how to build from source, troubleshooting,
performance, and how to create a new distribution.
Building from source:
--------------------
Dependencies:
* CMake version 2.4.2 or later.
* A functioning and fairly new C++ compiler. We test mostly on GCC and Visual Studio 6+.
Anything other than that may not work.
* Something to unzip/untar the source code.
Build instructions:
1.) Download the latest sourcecode from http://www.sourceforge.net/projects/clucene
[Choose stable if you want the 'time tested' version of code. However, often
the unstable version will suite your needs more since it is newer and has had
more work put into it. The decision is up to you.]
2.) Unpack the tarball/zip/bzip/whatever
3.) Open a command prompt, terminal window, or cygwin session.
4.) Change directory into the root of the sourcecode (from now on referred to as <clucene>)
# cd <clucene>
5.) Create and change directory into an 'out-of-source' directory for your build.
[This is by far the easiest way to build, it has the benefit of being able to
create different types of builds in the same source-tree.]
# mkdir <clucene>/build-name
# cd <clucene>/build-name
6.) Configure using cmake. This can be done many different ways, but the basic syntax is
# cmake [-G "Script name"] ..
[Where "Script name" is the name of the scripts to build (e.g. Visual Studio 8 2005).
A list of supported build scripts can be found by]
# cmake --help
7.) You can configure several options such as the build type, debugging information,
mmap support, etc, by using the CMake GUI or by calling
# ccmake ..
Make sure you call configure again if you make any changes.
8.) Start the build. This depends on which build script you specified, but it would be something like
# make
or
# nmake
Or open the solution files with your IDE.
[You can also specify to just build a certain target (such as cl_test, cl_demo,
clucene-core (shared library), clucene-core-static (static library).]
9.) The binary files will be available in <clucene>build-name/bin
10.)Test the code. (After building the tests - this is done by default, or by calling make cl_test)
# ctest -V
11.)At this point you can install the library:
# make install
[There are options to do this from the IDE, but I find it easier to create a
distribution (see instructions below) and install that instead.]
or
# make cl_demo
[This creates the demo application, which demonstrates a simple text indexing and searching].
or
Adjust build values using ccmake or the Cmake GUI and rebuild.
12.)Now you can develop your own code. This is beyond the scope of this document.
Read the README for information about documentation or to get help on the mailinglist.
Other platforms:
----------------
Some platforms require specific actions to get cmake working. Here are some general tips:
Solaris:
I had problems when using the standard stl library. Using the -stlport4 switch worked. Had
to specify compiler from the command line: cmake -DCXX_COMPILER=xxx -stlport4
Installing:
-----------
CLucene is installed in CMAKE_INSTALL_PREFIX by default.
CLucene used to put config headers next to the library. this was done
because these headers are generated and are relevant to the library.
CMAKE_INSTALL_PREFIX was for system-independent files. the idea is that
you could have several versions of the library installed (ascii version,
ucs2 version, multithread, etc) and have only one set of headers.
in version 0.9.24+ we allow this feature, but you have to use
LUCENE_SYS_INCLUDES to specify where to install these files.
Troubleshooting:
----------------
'Too many open files'
Some platforms don't provide enough file handles to run CLucene properly.
To solve this, increase the open file limit:
On Solaris:
ulimit -n 1024
set rlim_fd_cur=1024
GDB - GNU debugging tool (linux only)
------------------------
If you get an error, try doing this. More information on GDB can be found on the internet
#gdb bin/cl_test
# gdb> run
when gdb shows a crash run
# gdb> bt
a backtrace will be printed. This may help to solve any problems.
Code layout
--------------
File locations:
* clucene-config.h is required and is distributed next to the library, so that multiple libraries can exist on the
same machine, but use the same header files.
* _HeaderFile.h files are private, and are not to be used or distributed by anything besides the clucene-core library.
* _clucene-config.h should NOT be used, it is also internal
* HeaderFile.h are public and are distributed and the classes within should be exported using CLUCENE_EXPORT.
* The exception to the internal/public conventions is if you use the static library. In this case the internal
symbols will be available (this is the way the tests program tests internal code). However this is not recommended.
Memory management
------------------
Memory in CLucene has been a bit of a difficult thing to manage because of the
unclear specification about who owns what memory. This was mostly a result of
CLucene's java-esque coding style resulting from porting from java to c++ without
too much re-writing of the API. However, CLucene is slowly improving
in this respect and we try and follow these development and coding rules (though
we dont guarantee that they are all met at this stage):
1. Whenever possible the caller must create the object that is being filled. For example:
IndexReader->getDocument(id, document);
As opposed to the old method of document = IndexReader->getDocument(id);
2. Clone always returns a new object that must be cleaned up manually.
Questions:
1. What should be the convention for an object taking ownership of memory?
Some documenting is available on this, but not much
Working with valgrind
----------------------
Valgrind reports memory leaks and memory problems. Tests should always pass
valgrind before being passed.
#valgrind --leak-check=full <program>
Memory leak tracking with dmalloc
---------------------------------
dmalloc (http://dmalloc.com/) is also a nice tool for finding memory leaks.
To enable, set the ENABLE_DMALLOC flag to ON in cmake. You will of course
have to have the dmalloc lib installed for this to work.
The cl_test file will by default print a low number of errors and leaks into
the dmalloc.log.txt file (however, this has a tendency to print false positives).
You can override this by setting your environment variable DMALLOC_OPTIONS.
See http://dmalloc.com/ or dmalloc --usage for more information on how to use dmalloc
For example:
# DMALLOC_OPTIONS=medium,log=dmalloc.log.txt
# export DMALLOC_OPTIONS
UPDATE: when i upgrade my machine to Ubuntu 9.04, dmalloc stopped working (caused
clucene to crash).
Performance with callgrind
--------------------------
Really simple
valgrind --tool=callgrind <command: e.g. bin/cl_test>
this will create a file like callgrind.out.12345. you can open this with kcachegrind or some
tool like that.
Performance with gprof
----------------------
Note: I recommend callgrind, it works much better.
Compile with gprof turned on (ENABLE_GPROF in cmake gui or using ccmake).
I've found (at least on windows cygwin) that gprof wasn't working over
dll boundaries, running the cl_test-pedantic monolithic build worked better.
This is typically what I use to produce some meaningful output after a -pg
compiled application has exited:
# gprof bin/cl_test-pedantic.exe gmon.out >gprof.txt
Code coverage with gcov
-----------------------
To create a code coverage report of the test, you can use gcov. Here are the
steps I followed to create a nice html report. You'll need the lcov package
installed to generate html. Also, I recommend using an out-of-source build
directory as there are lots of files that will be generated.
NOTE: you must have lcov installed for this to work
* It is normally recommended to compile with no optimisations, so change CMAKE_BUILD_TYPE
to Debug.
* I have created a cl_test-gcov target which contains the necessary gcc switches
already. So all you need to do is
# make test-gcov
If everything goes well, there will be a directory called code-coverage containing the report.
If you want to do this process manually, then:
# lcov --directory ./src/test/CMakeFiles/cl_test-gcov.dir/__/core/CLucene -c -o clucene-coverage.info
# lcov --remove clucene-coverage.info "/usr/*" > clucene-coverage.clean
# genhtml -o clucene-coverage clucene-coverage.clean
If both those commands pass, then there will be a clucene coverage report in the
clucene-coverage directory.
Benchmarks
----------
Very little benchmarking has been done on clucene. Andi Vajda posted some
limited statistics on the clucene list a while ago with the following results.
There are 250 HTML files under $JAVA_HOME/docs/api/java/util for about
6108kb of HTML text.
org.apache.lucene.demo.IndexFiles with java and gcj:
on mac os x 10.3.1 (panther) powerbook g4 1ghz 1gb:
. running with java 1.4.1_01-99 : 20379 ms
. running with gcj 3.3.2 -O2 : 17842 ms
. running clucene 0.8.9's demo : 9930 ms
I recently did some more tests and came up with these rough tests:
663mb (797 files) of Guttenberg texts
on a Pentium 4 running Windows XP with 1 GB of RAM. Indexing max 100,000 fields
- Jlucene: 646453ms. peak mem usage ~72mb, avg ~14mb ram
- Clucene: 232141. peak mem usage ~60, avg ~4mb ram
Searching indexing using 10,000 single word queries
- Jlucene: ~60078ms and used ~13mb ram
- Clucene: ~48359ms and used ~4.2mb ram
Distribution
------------
CPack is used for creating distributions.
* Create a out-of-source build as per usual
* Make sure the version number is correct (see <clucene>/CMakeList.txt, right at the top of the file)
* Make sure you are compiling in the correct release mode (check ccmake or the cmake gui)
* Make sure you enable ENABLE_PACKAGING (check ccmake or the cmake gui)
* Next, check that the package is compliant using several tests (must be done from a linux terminal, or cygwin):
# cd <clucene>/build-name
# ../dist-check.sh
* Make sure the source directory is clean. Make sure there are no unknown svn files:
# svn stat ..
* Run the tests to make sure that the code is ok (documented above)
* If all tests pass, then run
# make package
for the binary package (and header files). This will only create a tar.gz package.
and/or
# make package_source
for the source package. This will create a ZIP on windows, and tar.bz2 and tar.gz packages on other platforms.
There are also options for create RPM, Cygwin, NSIS, Debian packages, etc. It depends on your version of CPack.
Call
# cpack --help
to get a list of generators.
Then create a special package by calling
# cpack -G <GENERATOR> CPackConfig.cmake