You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello everyone.
I downloaded the first file enwiki-20210220-pages-articles1.xml-p1p41242.bz2 at the wiki server. I successfully got the extracted text after running the script. However, I found that the text seemed to ignore the table information in the wiki pages i.e. the wikitable.
Do I miss something or the dump files not contain the table information at all?
Thanks!
The text was updated successfully, but these errors were encountered:
I think I've got the answer myself. The dump files actually contain the wikitable information but in a different way.
Adding the argument --html may help get the wikitable more directly. But the code seems to have bugs when converting xml to html.
It reports KeyError as follows:
File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/storage/fbzhu/yc/wikiextractor/wikiextractor/WikiExtractor.py", line 467, in extract_process
Extractor(*job[:-1]).extract(out, html_safe) # (id, urlbase, title, page)
File "/storage/wikiextractor/wikiextractor/extract.py", line 857, in extract
text = self.clean_text(text, html_safe=html_safe)
File "/storage/wikiextractor/wikiextractor/extract.py", line 847, in clean_text
text = compact(text, mark_headers=mark_headers)
File "/storage/wikiextractor/wikiextractor/extract.py", line 256, in compact
page.append(listItem[n] % line)
KeyError: '&'
I am using the xml files dumped at 20 Feb 2021 and wikiextractor version 3.0.5.
Hello everyone.
I downloaded the first file
enwiki-20210220-pages-articles1.xml-p1p41242.bz2
at the wiki server. I successfully got the extracted text after running the script. However, I found that the text seemed to ignore the table information in the wiki pages i.e. the wikitable.Do I miss something or the dump files not contain the table information at all?
Thanks!
The text was updated successfully, but these errors were encountered: