Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/_convert_table_to_text index out of range #709

Closed
igoforth opened this issue Jun 9, 2023 · 8 comments · Fixed by #982
Closed

bug/_convert_table_to_text index out of range #709

igoforth opened this issue Jun 9, 2023 · 8 comments · Fixed by #982
Assignees
Labels
bug Something isn't working docx Related to Microsoft Word (.docx) file format

Comments

@igoforth
Copy link

igoforth commented Jun 9, 2023

Describe the bug
A list index out of range occurs in _convert_table_to_text during docx parsing.

To Reproduce
I was operating on 1360 docx files from this source: https://www.3gpp.org/ftp/Specs/latest/Rel-17
In the case of doc's, I first converted to docx using the below command:

C:\"Program Files"\LibreOffice\program\soffice.exe --headless --convert-to docx --outdir out in\<filename>

Expected behavior
_convert_table_to_text to correctly convert all docx tables

Screenshots
Capture

Desktop (please complete the following information):

  • OS: Windows 10
  • Python version Python 3.10.10

Additional context
21101-h10.docx
21202-h00.docx
21205-h10.docx
21905-h10.docx
22003-h00.docx
22004-h00.docx
22011-h60.docx
22022-h00.docx
22030-h00.docx
22031-h00.docx
22032-h00.docx
22041-h00.docx
22042-h00.docx
22057-h00.docx
22071-h00.docx
22072-h00.docx
22081-h00.docx
22084-h00.docx
22087-h00.docx
22090-h00.docx
22094-h00.docx
22096-h00.docx
22097-h00.docx
22101-h50.docx
22104-h70.docx
22115-h10.docx
22119-h00.docx
22125-h60.docx
22135-h00.docx
22142-h00.docx
22146-h00.docx
22182-h00.docx
22185-h00.docx
22186-h00.docx
22220-h00.docx
22226-h00.docx
22242-h00.docx
22246-h00.docx
22259-h00.docx
22261-hb0.docx
22263-h40.docx
22268-h00.docx
22278-h20.docx
22279-h00.docx
22282-h00.docx
22346-h00.docx
22368-h00.docx
22468-h01.docx
22519-h00.docx
22826-h20.docx
22829-h10.docx
22832-h40.docx
22836-h10.docx
22866-h10.docx
22873-020.docx
22881-020_cl.docx
22889-h40.docx
22912-h00.docx
22936-h00.docx
22944-h00.docx
22948-h00.docx
22967-h00.docx
22973-h00.docx
22978-h00.docx
22979-h00.docx
22986-h00.docx
22987-h00.docx
23035-h00.docx
23041-h40.docx
23172-h00.docx
23222-h70.docx
23281-h60.docx
23303-h00.docx
23379-h90.docx
23402-h00.docx
23554-h20.docx
23558-h70.docx
23744-h10.docx
23755-h00.docx
23758-h00.docx
23783-1a0_sAnnex_A.docx
23783-1a0_sAnnex_D.docx
23783-1a0_sAnnex_E.docx
24002-h00.docx
24022-h00.docx
24166-h00.docx
24250-h00.docx
24322-h00.docx
24323-h10.docx
24333-h00.docx
24341-h10.docx
24371-h10.docx
24391-h00.docx
24483-h70.docx
25102-h00.docx
25113-h00.docx
25116-h00.docx
25153-h00.docx
25171-h00.docx
25172-h00.docx
25173-h00.docx
25213-h00.docx
25214-h00.docx
25221-h00.docx
25222-h00.docx
25224-h00.docx
25304-h10.docx
25305-h00.docx
25306-h10.docx
25321-h00.docx
25322-h00.docx
25323-h00.docx
25327-h00.docx
25401-h00.docx
25411-h00.docx
25412-h00.docx
25413-h00.docx
25420-h00.docx
25421-h00.docx
25422-h00.docx
25423-h00.docx
25424-h00.docx
25430-h00.docx
25431-h00.docx
25435-h00.docx
25442-h00.docx
25444-h00.docx
25446-h00.docx
25450-h00.docx
25453-h00.docx
25461-h00.docx
25470-h00.docx
25912-h00.docx
25914-h00.docx
25943-h00.docx
25951-h00.docx
25963-h00.docx
25967-h00.docx
25968-h00.docx
25993-h00.docx
26074-h01.docx
26090-h00.docx
26091-h00.docx
26101-h00.docx
26102-h00.docx
26103-h00.docx
26110-h00.docx
26117-h00.docx
26131-h30.docx
26132-h20.docx
26140-h00.docx
26141-h00.docx
26142-h00.docx
26150-h00.docx
26173-h11.docx
26177-h00.docx
26179-h00.docx
26193-h00.docx
26201-h00.docx
26204-h10.docx
26231-h00.docx
26234-h00.docx
26243-h00.docx
26245-h00.docx
26247-h30.docx
26267-h00.docx
26268-h00.docx
26273-h00.docx
26347-h20.docx
26403-h00.docx
26404-h00.docx
26410-h01.docx
26411-h00.docx
26412-h00.docx
26430-h00.docx
26445-h00_1_s05_s0501.docx
26445-h00_2_s0502_s050203.docx
26445-h00_4_s050206.docx
26445-h00_5_s0503.docx
26445-h00_6_s0504_s0506.docx
26445-h00_9_s0602_s0607.docx
26445-h00_a_s0608_sHistory.docx
26446-h00.docx
26447-h10.docx
26448-h00.docx
26450-h00.docx
26452-h00.docx
26511-h10.docx
26903-h00.docx
26904-h00.docx
26907-h00.docx
26911-h00.docx
26918-h00.docx
26923-h00.docx
26925-h10.docx
26937-h00.docx
26938-h00.docx
26943-h00.docx
26944-h00.docx
26946-h00.docx
26947-h00.docx
26949-h00.docx
26952-h00.docx
26957-h00.docx
26959-h00.docx
26967-h00.docx
26980-h00.docx
27002-h00.docx
27003-h00.docx
27010-h00.docx
28302-h00.docx
28303-h00.docx
28305-h00.docx
28308-h00.docx
28310-h50.docx
28311-h00.docx
28402-h00.docx
28403-h00.docx
28404-h40.docx
28405-h40.docx
28510-h00.docx
28511-h00.docx
28513-h00.docx
28520-h00.docx
28521-h00.docx
28525-h00.docx
28526-h00.docx
28528-h00.docx
28530-h40.docx
28531-h70.docx
28533-h30.docx
28540-h30.docx
28550-h10.docx
28623-h51.docx
28626-h00.docx
28628-h00.docx
28629-h00.docx
28631-h00.docx
28656-h00.docx
28657-h00.docx
28658-h10.docx
28662-h00.docx
28667-h00.docx
28668-h00.docx
28669-h00.docx
28672-h00.docx
28681-h00.docx
28682-h00.docx
28683-h00.docx
28701-h00.docx
28702-h00.docx
28707-h00.docx
28708-h00.docx
28731-h00.docx
28732-h00.docx
28735-h00.docx
28751-h00.docx
28812-h10.docx
29007-h00.docx
29108-h00.docx
29153-h00.docx
29164-h00.docx
29215-h00.docx
29217-h00.docx
29250-h00.docx
29251-h20.docx
29343-h00.docx
29368-h00.docx
29414-h00.docx
29486-h60.docx
29507-h90.docx
29508-ha0.docx
29512-ha0.docx
29517-h90.docx
29523-h80.docx
29549-h70.docx
29554-h40.docx
29594-h50.docx
29658-h00.docx
29675-h70.docx
29949-h00.docx
32111-1-h00.docx
32111-2-h00.docx
32111-6-h00.docx
32121-h00.docx
32122-h00.docx
32126-h00.docx
32130-h60.docx
32153-h00.docx
32154-h00.docx
32157-h00.docx
32158-h40.docx
32160-h70.docx
32181-h00.docx
32182-h00.docx
32250-h00.docx
32253-h00.docx
32254-h30.docx
32255-h90.docx
32256-h20.docx
32270-h00.docx
32271-h00.docx
32274-h20.docx
32275-h30.docx
32280-h00.docx
32290-h60.docx
32293-h00.docx
32300-h00.docx
32301-h00.docx
32306-h00.docx
32312-h00.docx
32321-h00.docx
32331-h00.docx
32336-h00.docx
32341-h00.docx
32356-h00.docx
32361-h00.docx
32371-h00.docx
32381-h00.docx
32386-h00.docx
32391-h00.docx
32404-h00.docx
32407-h00.docx
32408-h00.docx
32409-h00.docx
32411-h00.docx
32421-h40.docx
32425-h10.docx
32436-h00.docx
32442-h00.docx
32446-h00.docx
32450-h00.docx
32452-h10.docx
32453-h00.docx
32501-h00.docx
32506-h00.docx
32531-h00.docx
32536-h00.docx
32541-h00.docx
32572-h00.docx
32581-h00.docx
32582-h00.docx
32583-h00.docx
32592-h10.docx
32594-h00.docx
32600-h00.docx
32601-h00.docx
32602-h00.docx
32612-h00.docx
32690-h00.docx
32901-h00.docx
33102-h00.docx
33106-h00.docx
33110-h00.docx
33117-h30.docx
33122-h10.docx
33187-h00.docx
33203-h10.docx
33204-h00.docx
33210-h10.docx
33216-h00.docx
33221-h00.docx
33234-h00.docx
33246-h00.docx
33250-h00.docx
33259-h00.docx
33303-h10.docx
33310-h60.docx
33320-h00.docx
33402-h00.docx
33511-h31.docx
33513-h10.docx
33514-h00.docx
33515-h00.docx
33518-h00.docx
33824-h00.docx
33916-h00.docx
33937-h00.docx
33995-h00.docx
34109-h00.docx
34926-h00.docx
35201-h00.docx
35204-h00.docx
35207-h00.docx
35216-h00.docx
35217-h00.docx
35218-h00.docx
35222-h00.docx
35232-h00.docx
35233-h00.docx
35935-h00.docx
35936-h00.docx
36360-h00.docx
36361-h00.docx
36414-h00.docx
36422-h00.docx
36425-h00.docx
36441-h00.docx
36442-h00.docx
36443-h01.docx
36444-h00.docx
36455-h10.docx
36456-h01.docx
36457-h00.docx
36462-h00.docx
36463-h00.docx
36903-h00.docx
36904-h00.docx
36905-h00.docx
36913-h00.docx
37460-h00.docx
37470-h00.docx
37481-h00.docx
38201-h00.docx
38411-h00.docx
38412-h00.docx
38414-h00.docx
38422-h00.docx
38462-h00.docx
38463-h00.docx
38913-h00.docx
41101-h00.docx
42068-h00.docx
42069-h00.docx
43010-h00.docx
43020-h00.docx
43026-h00.docx
43030-h00.docx
43055-h00.docx
43058-h00.docx
43059-h00.docx
43064-h00.docx
43129-h00.docx
43246-h00.docx
43318-h00.docx
43902-h00.docx
44004-h00.docx
44012-h00.docx
44014-h00.docx
44060-h00.docx
44071-h00.docx
44901-h00.docx
45002-h00.docx
45008-h00.docx
45010-h00.docx
45050-h00.docx
45056-h00.docx
45903-h00.docx
45912-h00.docx
45913-h00.docx
45926-h00.docx
46001-h00.docx
46002-h00.docx
46007-h00.docx
46008-h00.docx
46011-h00.docx
46012-h00.docx
46020-h00.docx
46021-h00.docx
46054-h00.docx
46061-h00.docx
46081-h00.docx
48001-h00.docx
48006-h00.docx
48008-h00.docx
48014-h00.docx
48016-h00.docx
48018-h00.docx
48031-h00.docx
48049-h00.docx
48052-h00.docx
48054-h00.docx
48056-h00.docx
48058-h00.docx
48061-h00.docx
48103-h00.docx
49031-h00.docx
49995-h00.docx
51021-h00.docx
51026-h00.docx
52008-h00.docx
52402-h00.docx
55205-h00.docx
55217-h00.docx
55226-h00.docx
55236-h00.docx
55241-h00.docx
55243-h00.docx
55252-h00.docx
Readme_VAD2_TV_h01.docx
22.890-040_rm.docx

@igoforth igoforth added the bug Something isn't working label Jun 9, 2023
@igoforth
Copy link
Author

igoforth commented Jun 9, 2023

Using the following prevents the script from breaking:

    @property
    def _cells(self):
        """
        A sequence of |_Cell| objects, one for each cell of the layout grid.
        If the table contains a span, one or more |_Cell| object references
        are repeated.
        """
        col_count = self._column_count
        cells = []
        for tc in self._tbl.iter_tcs():
            for grid_span_idx in range(tc.grid_span):
                # if tc.vMerge == ST_Merge.CONTINUE:
                #     cells.append(cells[-col_count])
                if grid_span_idx > 0:
                    cells.append(cells[-1])
                else:
                    cells.append(_Cell(tc, self))
        return cells

@qued
Copy link
Contributor

qued commented Jun 13, 2023

Thanks for submitting this, @igoforth ! I'm attempting to reproduce from the list of files you provided. My initial take is that it looks like the error is occurring in the docx library, so it's possible an issue needs to be submitted there. I'm going to look into it though to see if some changes are appropriate from our side.

@qued
Copy link
Contributor

qued commented Jun 13, 2023

I've parsed a couple hundred of these docs now without errors. Can you by any chance point to a specific document that gives you the error you got?

@igoforth
Copy link
Author

I wish I could, but as you might understand I didn't want to try to debug a python program that, in this case, took over two days to finish lol.

If not random, understanding in which order the implementation of unstructured acts on files could clue us in. Perhaps file #286?

Reference my gist here for how I converted originally https://gist.github.com/igoforth/80b86cc4a256db502b5d8bed3b857113

It's worth noting that my memory and CPU usage were both close to full, could there be a timing issue?

After the original error, I commented out the block which looked like an edge case. It then ran great for me.

Aside from that, libreoffice could've done something funky which messed with whatever tx.vMerge is. Did you try both 7.5.4 and 7.4.7? I probably used the stable branch. Apologies for not having looked into it further yet

@qued
Copy link
Contributor

qued commented Jun 14, 2023

Ok it looks like you're running into the issue referenced here and here. It looks like this issue has been around for a while in the python-docx library with no fix.

@MthwRobinson
Copy link
Contributor

Reopening per report from community Slack. @scanny - per @qued 's comment this might stem from python-docx, thought you'd know better than us 😄

@MthwRobinson MthwRobinson reopened this Apr 25, 2024
@scanny
Copy link
Collaborator

scanny commented Apr 25, 2024

Ah, right, this can happen when Word tables becomes non-uniform, that is, not all rows contain the same number of cells (after accounting for merged cells). Unfortunately Word itself can produce this situation in certain table editing situations where row endings don't line up. I'll change the docx partitioner to not assume tables are uniform.

@scanny scanny self-assigned this Apr 25, 2024
@scanny scanny added the docx Related to Microsoft Word (.docx) file format label Apr 25, 2024
@scanny
Copy link
Collaborator

scanny commented May 2, 2024

This is fixed on unstructured@main. It should appear in v0.13.7 which should be released within a week or so.

If you want to try it out in the meantime one option is:

  1. clone unstructured
  2. run $ make install in the repo root directory, once you have activated the target virtualenv.

@scanny scanny closed this as completed May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docx Related to Microsoft Word (.docx) file format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants