read_html does not correctly parse table cells with commas #5029

cancan101 · 2013-09-29T00:51:11Z

read_html, find the correct table, parses the structure of the table (inclusing row and header labels), but does not parse the data:

tables = pd.read_html("http://www.camacau.com/changeLang?lang=en_US&url=/statistic_list")

In [119]: tables[7]
Out[119]: 
                     0     1     2     3     4     5     6
0                  NaT  2013  2012  2011  2010  2009  2008
1  2013-01-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
2  2013-02-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
3  2013-03-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
4  2013-04-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
5  2013-05-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
6  2013-06-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
7  2013-07-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
8  2013-08-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
9  2013-09-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
10 2013-10-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
11 2013-11-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
12 2013-12-28 00:00:00   NaN   NaN   NaN   NaN   NaN   NaN
13                 NaT   NaN   NaN   NaN   NaN   NaN   NaN

The text was updated successfully, but these errors were encountered:

cpcloud · 2013-09-29T01:27:41Z

for now pass infer_types=False and manually parse the results

seems to be an issue with comma parsing

cancan101 · 2013-09-29T22:45:51Z

@cpcloud It also looks like the infer types is doing something weird to the row headings.

Not only do the values look better with infer_types=False, but so do the row headings:

            0       1       2       3       4       5       6
0                2013    2012    2011    2010    2009    2008
1     January   3,925   3,463   3,289   3,184   3,488   4,568
2    February   3,632   2,983   2,902   3,053   3,347   4,527
3       March   3,909   3,166   3,217   3,175   3,636   4,594
4       April   3,903   3,258   3,146   3,023   3,709   4,574
5         May   4,075   3,234   3,266   3,033   3,603   4,511
6        June   4,038   3,272   3,316   2,909   3,057   4,081
7        July           3,661   3,359   3,062   3,354   4,215
8      August           3,942   3,417   3,077   3,395   4,139
9   September           3,703   3,169   3,095   3,100   3,752
10    October           3,727   3,469   3,179   3,375   3,874
11   November           3,722   3,145   3,159   3,213   3,567
12   December           3,866   3,251   3,199   3,324   3,362
13      Total  23,482  41,997  38,946  37,148  40,601  49,764

cancan101 · 2013-09-29T22:55:59Z

Also the ordering of the tables seems somewhat arbitrary. Using the page above as an example, the html for tables[17] comes before tables[16].

cancan101 · 2013-10-01T05:23:17Z

@cpcloud any idea about this table ordering issue?

cpcloud · 2013-10-01T05:58:08Z

Not sure what the issue is. What type of ordering are you expecting? I can't really think of a way to generally say "this table should come before this other one" other than the obvious "this one comes before this other one in the parse tree"

cancan101 · 2013-10-01T11:46:57Z

Okay. Is this consistent then with the parse thee used ?
On Oct 1, 2013 1:58 AM, "Phillip Cloud" notifications@github.com wrote:

Not sure what the issue is. What type of ordering are you expecting? I
can't really think of a way to generally say "this table should come before
this other one" other than the obvious "this one comes before this other
one in the parse tree"

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/5029#issuecomment-25427114
.

cancan101 · 2013-10-01T12:21:17Z

I took at look at what lxml parses and the ordering still seems wrong:

tree_tr = tree.findall(".//tr")

# Cell from table[16]
In [153]: [i for i,y in enumerate([x.text_content() for x  in tree_tr]) if "536" in (y)]
Out[153]: [155]

# Cell from table[17]
In [151]: [i for i,y in enumerate([x.text_content() for x  in tree_tr]) if "37,148" in (y)]
Out[151]: [114]

cpcloud · 2013-10-01T12:49:34Z

@cancan101

You can't depend on the ordering, mostly because of invalid HTML (there might be other reasons that I can't think of right now). I'm not exactly sure how the parsers I use here "fix" invalid markup.

I don't follow how your example demonstrates that there's an issue with the order of the tables in the page. Can you be more explicit about what the expected input/output is?

cancan101 · 2013-10-01T12:56:34Z

In this case I use lxml so I would imagine the page is relatively valid html.

The example shows the index of the table containing the cell I am searching for.
Pandas returns those two tables as numbers 16 and 17.

I should have searched for "table" rather than "tr". When looking at all tr in the document, the tr in table 17 comes before the tr in table 16.

cpcloud · 2013-10-01T13:28:30Z

@cancan101 Few things:

Never assume your HTML is valid. In fact, it would be reasonable assume that it's invalid. google.com has invalid markup. Here is an interesting read on the validity of web pages. Only 4.13% of pages validated passed the W3C's validator.
lxml doesn't behave in a sane way, for all cases, when it comes across invalid markup. For example, it will sometimes remove a node instead of trying to keep it. That is, IMHO, a bad solution. html5lib on the other hand, tries very hard to keep everything.
There isn't a meaningful way to assign an order to arbitrary HTML tables. You should define one if that's what you're interested in doing, but read_html's result should essentially be treated as a set. There is an ordering but it depends on the order in which the underlying parser returns tables. That may or may not be consistent across parsers.

So, I don't see the ordering as a problem, but I'd be happy to document this.

cancan101 · 2013-10-01T13:46:32Z

Okay. at the very least then the fact that ordering is unreliable should be documented. Perhaps the return should even be changed to a set?

This issue makes #4469 more interesting.

In the case that I do not specify a parser, is it possible to see what parser was actually used?

cpcloud · 2013-10-01T13:55:15Z

In the case that I do not specify a parser, is it possible to see what parser was actually used?

No. By default it tries to use lxml, but makes lxml use strict validation. If that raises an exception, bs4 + html5lib is tried.

cancan101 · 2013-10-02T00:06:18Z

Okay. It appears that lxml (with recover=False) is unable to parse that page, so I guess it falls back to an alternative.

cpcloud · 2013-10-02T18:38:32Z

Ah. I've figured it out! I convert to a set when parsing bs4 tables ... thus the different ordering. Thanks @cancan101 for pointing this out, that's actually a buglet that i'll fix

cancan101 · 2013-10-02T19:10:44Z

@cpcloud That is good to hear. It makes extracting a fixed table from a given page much easier.

cpcloud mentioned this issue Sep 29, 2013

REF/BUG/ENH/API: refactor read_html to use TextParser #4770

Merged

1 task

ghost assigned cpcloud Sep 29, 2013

cpcloud closed this as completed in #4770 Oct 3, 2013

wesm unassigned cpcloud Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html does not correctly parse table cells with commas #5029

read_html does not correctly parse table cells with commas #5029

cancan101 commented Sep 29, 2013

cpcloud commented Sep 29, 2013

cancan101 commented Sep 29, 2013

cancan101 commented Sep 29, 2013

cancan101 commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cancan101 commented Oct 1, 2013

cancan101 commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cancan101 commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cancan101 commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cancan101 commented Oct 2, 2013

cpcloud commented Oct 2, 2013

cancan101 commented Oct 2, 2013

read_html does not correctly parse table cells with commas #5029

read_html does not correctly parse table cells with commas #5029

Comments

cancan101 commented Sep 29, 2013

cpcloud commented Sep 29, 2013

cancan101 commented Sep 29, 2013

cancan101 commented Sep 29, 2013

cancan101 commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cancan101 commented Oct 1, 2013

cancan101 commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cancan101 commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cancan101 commented Oct 1, 2013

cpcloud commented Oct 1, 2013

cancan101 commented Oct 2, 2013

cpcloud commented Oct 2, 2013

cancan101 commented Oct 2, 2013