-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to detect a hyperlink run and extract the URL? #1113
Comments
Hey, So each hyperlink run has a The following snippet will enumerate the relationship id and target (URL) of all hyperlink relationships:
On the actual runs/hyperlink elements, you'll need to access to
This code makes a bunch of assumptions:
|
This functionality is added in v1.0.0 released circa Oct 1, 2023. There are two approaches depending on whether you care about where the hyperlink is in the paragraph or not. The simplest is this: for paragraph in document.paragraphs:
if not paragraph.hyperlinks:
continue
for hyperlink in paragraph.hyperlinks:
print(hyperlink.address) If you need to know where they fit in with the runs of the paragraph: for paragraph in document.paragraphs:
if not paragraph.hyperlinks:
continue
for item in paragraph.iter_inner_content():
if isinstance(item, Run):
... do the run thing ...
elif isinstance(item, Hyperlink):
... do the hyperlink thing ... |
I am trying to get functionality such that when I go over the runs of a paragraph I can extract the text from a hyperlink but also the URL. Currently, a hyperlink run is totally omitted and not at all caught by python-docx. Digging through the issues I have found the following code snippet here
`from docx.oxml.shared import qn
def GetParagraphRuns(paragraph):
def _get(node, parent):
for child in node:
if child.tag == qn('w:r'):
yield Run(child, parent)
if child.tag == qn('w:hyperlink'):
yield from _get(child, parent)
return list(_get(paragraph._element, paragraph))
Paragraph.runs = property(lambda self: GetParagraphRuns(self))`
However, this simply converts the hyperlink into plain text and I lose the URL. Is there any way to extract the hyperlink text and URL and return it in a run with my own html tags surrounding the extracted hyperlink?
For example:
hyperlink run
would be extracted to
<a href="https://github.com/python-openxml/python-docx/issues/85">hyperlink run</a>
The use case for this is for a news website where authors submit their articles as .docx files and the articles then have any HTML markdown added to it before being pushed to the website.
The text was updated successfully, but these errors were encountered: