-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_html does not correctly parse table cells with commas #5029
Comments
for now pass seems to be an issue with comma parsing |
@cpcloud It also looks like the infer types is doing something weird to the row headings. Not only do the values look better with
|
Also the ordering of the tables seems somewhat arbitrary. Using the page above as an example, the html for |
@cpcloud any idea about this table ordering issue? |
Not sure what the issue is. What type of ordering are you expecting? I can't really think of a way to generally say "this table should come before this other one" other than the obvious "this one comes before this other one in the parse tree" |
Okay. Is this consistent then with the parse thee used ?
|
I took at look at what lxml parses and the ordering still seems wrong:
|
You can't depend on the ordering, mostly because of invalid HTML (there might be other reasons that I can't think of right now). I'm not exactly sure how the parsers I use here "fix" invalid markup. I don't follow how your example demonstrates that there's an issue with the order of the tables in the page. Can you be more explicit about what the expected input/output is? |
In this case I use lxml so I would imagine the page is relatively valid html. The example shows the index of the table containing the cell I am searching for. I should have searched for "table" rather than "tr". When looking at all tr in the document, the tr in table 17 comes before the tr in table 16. |
@cancan101 Few things:
So, I don't see the ordering as a problem, but I'd be happy to document this. |
Okay. at the very least then the fact that ordering is unreliable should be documented. Perhaps the return should even be changed to a This issue makes #4469 more interesting. In the case that I do not specify a parser, is it possible to see what parser was actually used? |
No. By default it tries to use |
Okay. It appears that |
Ah. I've figured it out! I convert to a set when parsing bs4 tables ... thus the different ordering. Thanks @cancan101 for pointing this out, that's actually a buglet that i'll fix |
@cpcloud That is good to hear. It makes extracting a fixed table from a given page much easier. |
read_html
, find the correct table, parses the structure of the table (inclusing row and header labels), but does not parse the data:tables = pd.read_html("http://www.camacau.com/changeLang?lang=en_US&url=/statistic_list")
The text was updated successfully, but these errors were encountered: