Skip to content

Commit

Permalink
v1.5.0 various fixes + more test files
Browse files Browse the repository at this point in the history
  • Loading branch information
jgclark committed Jan 29, 2024
1 parent dee8623 commit c05ebb3
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 25 deletions.
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
# Bible Gateway to Markdown Script CHANGELOG
# Bible Gateway to Markdown Script

## CHANGELOG

### v1.5.0, 29.1.2024
- [Fix] Cope with BibleGateway's change to their output pages, which meant missing heading and unwanted extra text on the end of bg2md's output
- [Fix] With some options, verses could be omitted when the range included two successive chapters

### v1.4.7, 12.8.2023
- [Fix] Extend request fetch timeouts for slower connections and connectivity issues
Expand Down
47 changes: 26 additions & 21 deletions bg2md.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,10 @@
# BibleGateway passage lookup and parser to Markdown
# - Jonathan Clark, v1.5.0, 29.1.2024
#------------------------------------------------------------------------------
# Uses BibleGateway.com's passage lookup tool to find a passage and turn it into
# Markdown usable in other ways. It passes 'reference' through to the BibleGateway
# parser to work out what range of verses should be included.
# The reference term is concatenated to remove spaces, meaning it doesn't need to be
# 'quoted'. It does not yet support multiple passages.
# Uses BibleGateway.com's passage lookup tool to find a passage and turn it into Markdown usable in other ways.
# It passes 'reference' through to the BibleGateway parser to work out what range of verses should be included.
# The reference term is concatenated to remove spaces, meaning it doesn't need to be 'quoted'.
# It does not yet support multiple passages.
#
# The Markdown output includes:
# - passage reference
Expand All @@ -30,7 +29,6 @@
# - all <h2> meta-chapter titles, <hr />, most <span>s
#------------------------------------------------------------------------------
# TODO:
# - Allow spanning of more than one chapter
# - Decide whether to support returning more than one passage (e.g. "Mt1.1;Jn1.1")
#------------------------------------------------------------------------------
# Ruby String manipulation docs: https://ruby-doc.org/core-2.7.1/String.html#method-i-replace
Expand All @@ -51,11 +49,12 @@

# Regular expressions used to detect various parts of the HTML to keep and use
START_READ_CONTENT_RE = '<h1 class=[\'"]passage-display[\'"]>'.freeze # seem to see both versions of this -- perhaps Jude is an outlier?
END_READ_CONTENT_RE = '^<script '.freeze
END_READ_CONTENT_RE = '<section class="other-resources">|<section class="sponsors">'.freeze
# Match parts of lines which actually contain passage text
PASSAGE_RE = '(<p>\s*<span id=|<p class=|<p>\s?<span class=|<h3).*?(?:<\/p>|<\/h3>)'.freeze
# Match parts of lines which actually contain passage text -- this uses non-matching groups to allow both options and capture
MATCH_PASSAGE_RE = '((?:<p>\s*<span id=|<p class=|<p>\s?<span class=|<h3).*?(?:<\/p>|<\/h3>))'.freeze
# Match lines that give the reference and version info in a displayable form
REF_RE = '(<div class=\'bcv\'><div class="dropdown-display"><div class="dropdown-display-text">|<span class="passage-display-bcv">).*?(<\/div>|<\/span>)'.freeze
MATCH_REF_RE = '(?:<div class=\'bcv\'><div class="dropdown-display"><div class="dropdown-display-text">|<span class="passage-display-bcv">)(.*?)(?:<\/div>|<\/span>)'.freeze
VERSION_RE = '(<div class=\'translation\'><div class="dropdown-display"><div class="dropdown-display-text">|<span class="passage-display-version">).*?(<\/div>|<\/span>)'.freeze
Expand Down Expand Up @@ -229,6 +228,7 @@
line = input_lines[n]
n += 1
# add line to 'lump' if it's not one of hundreds of version options
# Note: join with space, for reasons I now don't remember
lump = lump + ' ' + line.strip if line !~ %r{<option.*</option>}
end
puts "Pass 1: 'Interesting' text = #{input_line_count} lines, #{lump.size} bytes." if opts[:verbose]
Expand Down Expand Up @@ -289,7 +289,7 @@
end
n += 1
end
puts if opts[:verbose]
# puts if opts[:verbose]

# Only continue if we have found the passage
if passage.empty?
Expand All @@ -310,8 +310,8 @@
#---------------------------------------
# Now process the main passage text
#---------------------------------------
# remove UNICODE U+00A0 (NBSP) characters (they are only used in BG for formatting not content)
passage.gsub!(/\u00A0/, '') # FIXME: ?? error as getting ASCII-8BIT string when using live data
# remove UNICODE U+00A0 (NBSP) characters (they are only used in BG for formatting not content) -- this was hard to find!
passage.gsub!(/\u00A0/, '')
# replace HTML &nbsp; and &amp; elements with ASCII equivalents
passage.gsub!(/&nbsp;/, ' ')
passage.gsub!(/&amp;/, '&')
Expand All @@ -323,30 +323,34 @@
# replace en dash with markdwon equivalent
passage.gsub!(/—/, '--')

# ignore a particular string in NIV
passage.gsub!(%r{<h3>More on the NIV</h3>}, '')
# ignore <h1> as it doesn't always appear (e.g. Jude)
passage.gsub!(%r{<h1.*?</h1>\s*}, '')
# ignore all <h2>book headings</h2>
passage.gsub!(%r{<h2>.*?</h2>}, '')
# ignore all <hr />
passage.gsub!(%r{<hr />}, '')

# simplify verse/chapters numbers (or remove entirely if that option set)
if opts[:numbering]
# Now see whether to start chapters and verses as H5 or H6
if opts[:newline]
# Extract the contents of the 'versenum' class (which should just be numbers, but we're not going to be strict)
passage.gsub!(%r{<sup class=".*?versenum.*?">\s*(\d+-?\d?)\s*</sup>}, "\n###### \\1 ")
passage.gsub!(%r{<sup\sclass="[^"]*?versenum[^"]*?">\s*?(\d+-?\d?)\s*?</sup>}, "\n###### \\1 ")
# verse number '1' seems to be omitted if start of a new chapter, and the chapter number is given.
passage.gsub!(%r{<span class=".*?chapternum.*?">\s*(\d+)\s*</span>}, "\n##### Chapter \\1\n###### 1 ")
passage.gsub!(%r{<span class="[^"]*?chapternum[^"]*?">\s*?(\d+)\s*?</span>}, "\n##### Chapter \\1\n###### 1 ")
else
# Extract the contents of the 'versenum' class (which should just be numbers, but we're not going to be strict)
passage.gsub!(%r{<sup class=".*?versenum.*?">\s*(\d+-?\d?)\s*</sup>}, '\1 ')
# Extract the contents of the 'versenum' class (either numbers or number range (for MSG))
passage.gsub!(%r{<sup\sclass="[^"]*?versenum[^"]*?">\s*?(\d+-?\d?)\s*?</sup>}, "\\1 ")
# verse number '1' seems to be omitted if start of a new chapter, and the chapter number is given.
passage.gsub!(%r{<span class=".*?chapternum.*?">\s*(\d+)\s*</span>}, '\1:1 ')
passage.gsub!(%r{<span class="[^"]*?chapternum[^"]*?">\s*?(\d+)\s*?</span>}, "\\1:1 ")
end
else
passage.gsub!(%r{<sup class=".*?versenum.*?">.*?</sup>}, '')
passage.gsub!(%r{<span class=".*?chapternum.*?">.*?</span>}, '')
passage.gsub!(%r{<sup class="[^"]*?versenum[^"]*?">.*?</sup>}, '')
passage.gsub!(%r{<span class="[^"]*?chapternum[^"]*?">.*?</span>}, '')
end

# Modify various things to their markdown equivalent
passage.gsub!(/<p.*?>/, "\n") # needs double quotes otherwise it doesn't turn this into newline
passage.gsub!(%r{</p>}, '')
Expand All @@ -367,14 +371,14 @@
passage.gsub!(%r{<span style="font-variant: small-caps" class="small-caps">Lord</span>}, 'LORD')
# Change the red text for Words of Jesus to be bold instead (if wanted)
passage.gsub!(%r{<span class="woj">(.*?)</span>}, '**\1**') if opts[:boldwords]
# simplify footnotes (or remove if that option set). Complex so do in several stages.
# simplify footnotes (or remove if that option set). Complex so do in several stages
if opts[:footnotes]
passage.gsub!(/<sup data-fn=\'.*?>/, '<sup>')
passage.gsub!(%r{<sup>\[<a href.*?>(.*?)</a>\]</sup>}, '[^\1]')
else
passage.gsub!(%r{<sup data-fn.*?<\/sup>}, '')
end
# simplify cross-references (or remove if that option set).
# simplify cross-references (or remove if that option set)
if opts[:crossrefs]
passage.gsub!(%r{<sup class='crossreference'.*?See cross-reference (\w+).*?</sup>}, '[^\1]')
else
Expand Down Expand Up @@ -414,7 +418,7 @@

# Create an alphabetical hash of numbers (Mod 26) to mimic their
# footnote numbering scheme (a..zz). Taken from
# https://stackoverflow.com/questions/14632304/generate-letters-to-represent-number-using[math - Generate letters to represent number using ruby? - Stack Overflow](https://stackoverflow.com/questions/14632304/generate-letters-to-represent-number-using-ruby)
# [math - Generate letters to represent number using ruby? - Stack Overflow](https://stackoverflow.com/questions/14632304/generate-letters-to-represent-number-using-ruby)
hf = {}
('a'..'zz').each_with_index { |w, i| hf[i + 1] = w }
# Create an alphabetical hash of numbers (Mod 26) to mimic their
Expand Down Expand Up @@ -445,7 +449,8 @@
end
output_text += copyright.to_s if opts[:copyright]

# Then write out text to screen
# Then write out text
puts
puts output_text
# And also copy it to clipboard
Clipboard.copy(output_text)
13 changes: 10 additions & 3 deletions bg_HTML_structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,22 @@ There is a _tremendous_ amount of guff in the file. The key parts, which I have
<sup data-fn='...' class='footnote' ... >
Pharisee<sup data-fn='#fen-NET-26112a' class='footnote' data-link='[&lt;a href=&quot;#fen-NET-26112a&quot; title=&quot;See footnote a&quot;&gt;a&lt;/a&gt;]'>[<a href="#fen-NET-26112a" title="See footnote a">a</a>]</sup> named Nicodemus, who was a member of the Jewish ruling council,<sup data-fn='#fen-NET-26112b' class='footnote' data-link='[&lt;a href=&quot;#fen-NET-26112b&quot; title=&quot;See footnote b&quot;&gt;b&lt;/a&gt;]'>[<a href="#fen-NET-26112b" title="See footnote b">b</a>]</sup> </span> <span id="en-NET-26113" class="text John-3-2"><sup class="versenum">2 </sup>came to Jesus<sup data-fn='#fen-NET-26113c' class='footnote' data-link='[&lt;a href=&quot;#fen-NET-26113c&quot; title=&quot;See footnote c&quot;&gt;c&lt;/a&gt;]'>[<a href="#fen-NET-26113c" title="See footnote c">c</a>]</sup> at night<sup data-fn='#fen-NET-26113d' class='footnote' data-link='[&lt;a href=&quot;#fen-NET-26113d&quot; title=&quot;See footnote d&quot;&gt;d&lt;/a&gt;]'>[<a href="#fen-NET-26113d" title="See footnote d">d</a>]</sup> and said to him, “Rabbi, we know that you are a teacher who has come from God. For no one could perform the miraculous signs<sup data-fn='#fen-NET-26113e' class='footnote' data-link='[&lt;a href=&quot;#fen-NET-26113e&quot; title=&quot;See footnote e&quot;&gt;e&lt;/a&gt;]'>[<a href="#fen-NET-26113e" title="See footnote e">e</a>]</sup> that you do unless God is with him.” </span> <span id="en-NET-26114" class="text John-3-3"><sup class="versenum">3 </sup>Jesus replied,<sup data-fn='#fen-NET-26114f' class='footnote' data-link='[&lt;a href=&quot;#fen-NET-26114f&quot; title=&quot;See footnote f&quot;&gt;f&lt;/a&gt;]'>[<a href="#fen-NET-26114f" title="See footnote f">f</a>]</sup> “I tell you the solemn truth,<sup data-fn='#fen-NET-26114g' class='footnote' data-link='[&lt;a href=&quot;#fen-NET-26114g&quot; title=&quot;See footnote g&quot;&gt;g&lt;/a&gt;]'>[<a href="#fen-NET-26114g" title="See footnote g">g</a>]</sup> unless a person is born from above,<sup data-fn='#fen-NET-26114h' class='footnote' data-link='[&lt;a href=&quot;#fen-NET-26114h&quot; title=&quot;See footnote h&quot;&gt;h&lt;/a&gt;]'>[<a href="#fen-NET-26114h" title="See footnote h">h</a>]</sup> he cannot see the kingdom of God.”<sup data-fn='#fen-NET-26114i' class='footnote' data-link='[&lt;a href=&quot;#fen-NET-26114i&quot; title=&quot;See footnote i&quot;&gt;i&lt;/a&gt;]'>[<a href="#fen-NET-26114i" title="See footnote i">i</a>]</sup> </span> </p>
```
- Within that, the Verse numbers are:
- Chapters are marked as:
```html
<span class="chapternum">4 </span>
```
Note: verse number '1' seems to be omitted if start of a new chapter, and the chapter number is given.
- Verse numbers are marked as:
```html
<span id="en-NLT-28073" class="text Rom-7-20"><sup class="versenum">20 </sup>
```
Except for version MSG:
```html
<sup class="versenum">5-8</sup>"
```
Note: verse number '1' seems to be omitted if start of a new chapter, and the chapter number is given.
- Words of Jesus (where available) are annotated:
Note: The extra space after the verse number is actually a Unicode NBSP character, not a standard space.
Note: Sometimes it seems that the character after `<sup` is also not a standard space character.
- The words of Jesus (where available) are marked as:
```html
<span class="woj">...</span>
```
Expand All @@ -53,6 +59,7 @@ There is a _tremendous_ amount of guff in the file. The key parts, which I have
```html
<div class="publisher-info-bottom ... <a href="...">New English Translation</a> (NET)</strong> <p>NET Bible® copyright ©1996-2017 by Biblical Studies Press, L.L.C. http://netbible.com All rights reserved.</p></div></div>
```
- other stuff starts `<section class="other-resources">` (by 2024) or earlier it was `<section class="sponsors">`
Other important notes:
- The character before the verse number in `<sup class="versenum">20 </sup>` is actually Unicode Character U+00A0 No-Break Space (NBSP). This was a tough one to find! These are converted to ordinary ASCII spaces.
Expand Down

0 comments on commit c05ebb3

Please sign in to comment.