Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor search query with quotes #2491

Merged
merged 24 commits into from
Dec 1, 2022
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
60b1aaa
Simplify stopwords file ingest
mrharpo Nov 15, 2022
529e5bc
Refactor query_to_terms_array to helper functions
mrharpo Nov 16, 2022
373f400
Only grab quotes if there are an even number of double quotes
mrharpo Nov 16, 2022
197d73e
Strip punctuation and split quoted terms
mrharpo Nov 17, 2022
79d4f75
Unquoted terms each in own array
mrharpo Nov 17, 2022
3db8da5
Format with rufo
mrharpo Nov 18, 2022
edf4d0c
Rename get_stopwords -> stopwords
mrharpo Nov 18, 2022
8b60d76
MIgrate to lib/query_to_terms_array.rb
mrharpo Nov 18, 2022
94e9139
Remove stopwords before nesting arrays
mrharpo Nov 18, 2022
9451dee
Fix spec to use new QueryToTermsArray
mrharpo Nov 18, 2022
45b1cef
Fix formatting with rubocop
mrharpo Nov 21, 2022
30d5daa
Add more tests to confirm behavior is as expected
mrharpo Nov 21, 2022
1561c40
Better handling of edge cases
mrharpo Nov 22, 2022
3d8c149
Add additional tests for double quotes
mrharpo Nov 22, 2022
74953e3
Add inline documentation
mrharpo Nov 22, 2022
b91794a
Simplify ruby notation
mrharpo Nov 23, 2022
cd5dc37
Fix extra space to make rubocop happy
mrharpo Nov 23, 2022
3460f02
Adds spec for QueryToTermsArray
afred Nov 30, 2022
95efd5f
Adds inline docs
afred Nov 30, 2022
8ad595b
Remove ArgumentError on empty query, fix tests
mrharpo Nov 30, 2022
0e1c556
Fix %W() -> %w() to please rubocop
mrharpo Nov 30, 2022
86f3260
Appease rubocop by fixing missing space
mrharpo Nov 30, 2022
3307f29
rubocop
afred Nov 30, 2022
ec50136
Migrate [:alpha:] -> [:alnum:], add test
mrharpo Nov 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion app/controllers/catalog_controller.rb
Original file line number Diff line number Diff line change
@@ -194,7 +194,7 @@ def index

# pull this out because we're going to mutate it inside terms_array method
@query = params[:q].dup
@terms_array = query_to_terms_array(@query)
@terms_array = QueryToTermsArray.new(@query).terms_array

if !params[:f] || !params[:f][:access_types]
# Sets Access Level
2 changes: 1 addition & 1 deletion app/controllers/snippets_controller.rb
Original file line number Diff line number Diff line change
@@ -17,7 +17,7 @@ def show
snippet_data = {}

# make array of words from users search query
terms_array = query_to_terms_array(params["query"])
terms_array = QueryToTermsArray.new(params["query"]).terms_array

# do, a search
solr_docs = query_from_solr(solr_q)
37 changes: 2 additions & 35 deletions app/helpers/application_helper.rb
Original file line number Diff line number Diff line change
@@ -10,37 +10,6 @@ def convert_timestamp_to_seconds(timestamp)
nil
end

def query_to_terms_array(query)
return [] if !query || query.empty?

stopwords = Rails.cache.fetch("stopwords") do
sw = []
File.read(Rails.root.join('jetty', 'solr', 'blacklight-core', 'conf', 'stopwords.txt')).each_line do |line|
next if line.start_with?('#') || line.empty?
sw << line.upcase.strip
end
sw
end

terms_array = if query.include?(%("))
# pull out double quoted terms!
quoteds = query.scan(/"([^"]*)"/)

# now remove them from the remaining query
quoteds.each { |q| query.remove!(q.first) }
query = query.gsub(/[[:punct:]]/, '').upcase

# put it all together (removing any term thats just a stopword)
# and remove punctuation now that we've used our ""
quoteds.flatten.map(&:upcase) + (query.split(" ").delete_if { |term| stopwords.any? { |stopword| stopword == term } })
else
query.split(" ").delete_if { |term| stopwords.any? { |stopword| stopword == term } }
end

# remove extra spaces and turn each term into word array
terms_array.map { |term| term.upcase.strip.gsub(/[^\w\s]/, "").split(" ") }
end

def get_last_day(month)
if %w(04 06 09 11).include?(month)
'30'
@@ -55,15 +24,13 @@ def handle_date_string(date_val, type)
# type => before, after, index
# 0000-00-00
if /\A\d{4}\-\d{1,2}\-\d{1,2}\z/ =~ date_val

year, month, day = date_val.scan(/\A(\d{4})\-(\d{1,2})\-(\d{1,2})\z/).flatten

# 0000-00
# 0000-00
elsif /\A\d{4}\-\d{1,2}\z/ =~ date_val

year, month = date_val.scan(/\A(\d{4})\-(\d{1,2})\z/).flatten

# 0000
# 0000
elsif /\A\d{4}\z/ =~ date_val
date_was_reset = true
year = date_val
2 changes: 1 addition & 1 deletion app/views/catalog/index.html.erb
Original file line number Diff line number Diff line change
@@ -45,7 +45,7 @@
$(document).ready(function() {
<% if @query.present? && @snippets && @snippets.keys.present? %>
var guids = <%= raw(@snippets.keys).to_s %>
var q = "<%= @query %>"
var q = `<%= raw @query %>`
getSnippets(guids, q)
<% end %>
})
108 changes: 108 additions & 0 deletions lib/query_to_terms_array.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Service class for converting a text query into a specially formatted array
# of terms to be used in highlighting text in transcript snippets.
#
# The class converts a text query into an array of terms according to the
# following rules:
# * Query terms are capitalized
# * Query terms not in double quotes are put into single-element arrays.
# * Query terms within double quotes are put into an array where each element
# is a term within the double-quoted phrase.
# * Stopwords are not removed from double-quoted phrases.
# * Stopwords are removed from unquoted query terms.
# * Special characters (any non-alphanumeric, non-space character) are removed.
#
# @see ./spec/lib/query_to_terms_array_spec.rb
# @see SnippetHelper::Snippet - class that uses output from
# QueryToTermsArray#terms_array
#
# @example
# query = "the french chef with Julia Child"
# QueryToTermsArray.new(query).terms_array
# => [["FRENCH"], ["CHEF"], ["JULIA"], ["CHILD"]]
#
# query = '"the french chef" with Julia Child'
# QueryToTermsArray.new(query).terms_array
# => [["THE", "FRENCH", "CHEF"], ["JULIA"], ["CHILD"]]
class QueryToTermsArray
attr_reader :query

# @param [String] query The search query
def initialize(query)
@query = query.to_s.upcase
end

# @return [Array<Array>] array where the first elements are arrays containing
# each term from a quoted phrase, including stopwords, but excluding special
# characters, followed by each unquoted term in the query, excluding both
# stopwords and special chars.
def terms_array
quoted_terms_arrays + unquoted_terms_arrays
end

private

# Get cached list of stopwords from stopwords.txt
# @return [Array<String>] array of stopwords
def stopwords
Rails.cache.fetch('stopwords') do
sw = File.readlines(Rails.root.join('jetty', 'solr', 'blacklight-core', 'conf', 'stopwords.txt'), chomp: true).map(&:upcase)

# Remove comments and empty lines
sw.reject do |word|
word =~ /^#/ || word.empty?
end
end
end

# @return [Array<String>] double-quoted phrases from #query.
def quoted_phrases
query.
# Match any double quoted phrase and capturing the stuff in between,
scan(/"([^"]*)"/).
# and grab the first (and only) thing captured.
map(&:first)
end

# @param [String]
# @return [String] the original string minus all non-alphanumeric, non-space
# characters, and all repeated whitespace collapsed into single space, and
# whitespace stripped from front and back.
def strip_special_chars(str)
str.
# Replace any non-alphanumeric, non-space character with a single space,
gsub(/[^[:alnum:] ]/, ' ').
# and collapse multiple whitespace down to single space,
gsub(/\s+/, ' ').
# and strip whitespace off front and back of string.
strip
end

# @return [Array<Array>] array of single-element arrays, each of which contain
# a single term from #unquoted_terms.
def unquoted_terms_arrays
unquoted_terms.map { |term| Array(term) }
end

# @return [Array] all terms from the original query that are not contained
# within double quotes.
def unquoted_terms
strip_special_chars(unquoted_query).split - stopwords
end

# @return [String] the original query minus any double-quoted phrases.
def unquoted_query
query_copy = query.dup
quoted_phrases.each do |quoted_phrase|
query_copy.remove!(quoted_phrase)
end
query_copy
end

# @return [Array<Array>] list of double-quoted phrases where each phrase has
# been converted into an array of terms.
def quoted_terms_arrays
quoted_phrases.map do |quoted_phrase|
strip_special_chars(quoted_phrase).split
end
end
end
10 changes: 0 additions & 10 deletions spec/helpers/snippet_helper_spec.rb
Original file line number Diff line number Diff line change
@@ -77,16 +77,6 @@
end
end

describe 'query to terms array' do
# it 'removes punctuation from and capitalizes the user query' do
# expect(clean_query_for_snippet(query_with_punctuation)).to eq(test_array)
# end

it 'uses stopwords.txt to remove words not used in actual search' do
expect(query_to_terms_array(%(extremist is cheddar "president of the Eisenhower"))).to eq([%w(PRESIDENT OF THE EISENHOWER), ["EXTREMIST"], ["CHEDDAR"]])
end
end

describe 'view snippet helpers' do
it 'creates a timecode transcript snippet for the frontend' do
expect(transcript_snippet(transcript_snippet_1.snippet, "Moving Image", transcript_snippet_1.url_at_timecode)).to eq(%(\n <span class=\"index-data-title\">From Transcript</span>:\n <p style=\"margin-top: 0;\"> FOR THIS 15TH ANNIVERSARY CELEBRATION AND DEDICATION CEREMONY IS MR GEORGE CAMPBELL CHAIRMAN OF THE <mark>ARKANSAS</mark> EDUCATIONAL TELEVISION COMMISSION GOOD AFTERNOON DISTINGUISHED GUESTS LADIES AND GENTLEMEN \n \n <a href=\"/catalog/cpb-aacip-111-21ghx7d6?term=ARKANSAS&proxy_start_time=50.24\">\n <button type=\"button\" class=\"btn btn-default snippet-link\">Watch from here</button>\n </a>\n \n </p>\n ))
92 changes: 92 additions & 0 deletions spec/lib/query_to_terms_array_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
require 'rails_helper'

RSpec.describe QueryToTermsArray do
describe '#terms_array' do
let(:terms_array) { QueryToTermsArray.new(query).terms_array }

context 'when query is empty' do
let(:query) { '' }
it 'returns an empty array' do
expect(terms_array).to eq []
end
end

context 'when query is not empty' do
context 'and when query contains no quoted terms' do
let(:query) { 'query without quoted terms' }
it 'returns an array where each element is a single-element array ' \
'containing each unquoted term from the query' do
expect(terms_array).to eq [["QUERY"], ["WITHOUT"], ["QUOTED"], ["TERMS"]]
end

context 'and query contains punctuation' do
let(:query) { %(`show_, ^me %+/- ice ? $@* cream}) }
it 'returns an array containing each term without punctuation' do
expect(terms_array).to eq [["SHOW"], ["ME"], ["ICE"], ["CREAM"]]
end
end
end

context 'and when query contains only a single quoted phrase' do
let(:query) { '"quoted phrase"' }
it 'returns an array where first and only element is an array of ' \
'terms from the quoted phrase' do
expect(terms_array).to eq [%w(QUOTED PHRASE)]
end
end

context 'when query contains a mix of unquoted terms and quoted phrases' do
let(:query) { 'unquoted stuff "quoted phrase"' }
it 'returns an array where the first element is an array containing ' \
'each term from the quoted phrase, and the remaining elements are ' \
'single-element arrays containing the unquoted terms from the query' do
expect(terms_array).to eq [%w(QUOTED PHRASE), ['UNQUOTED'], ['STUFF']]
end
end

context 'when query contains multiple quoted phrases' do
let(:query) { '"quoted phrase one" "another quoted phrase"' }
it 'returns an array where each element is an array containing the ' \
'terms from quoted phrases' do
expect(terms_array).to eq [%w(QUOTED PHRASE ONE), %w(ANOTHER QUOTED PHRASE)]
end
end

context 'when there are an odd number of quotation marks' do
let(:query) { %("broken quotation" marks") }
it 'ignores the last odd quotation mark' do
expect(terms_array).to eq([%w(BROKEN QUOTATION), ["MARKS"]])
end
end

context 'when query contains a quoted phrase with non-alphanumeric characters' do
let(:query) { %("This` is_, a^quoted %+/- phrase ? $@*") }
it 'returns an array where the elements are arrays of terms from the ' \
'quoted phrase with all non-alphanumeric chars removed' do
expect(terms_array).to eq [%w(THIS IS A QUOTED PHRASE)]
end
end

context 'when query contains an unquoted stopword' do
let(:query) { %(a search with no stopworda or stopwordb) }
it 'uses stopwords.txt to remove words not used in actual search' do
expect(terms_array).to eq([%w(SEARCH)])
end
end

context 'when query contains a quoted stopword' do
let(:query) { %(extremist is cheddar "president of the Eisenhower") }
it 'preserves the stopword in the search' do
expect(terms_array).to eq([%w(PRESIDENT OF THE EISENHOWER), ["EXTREMIST"], ["CHEDDAR"]])
end
end

context 'when query contains numbers' do
let(:query) { %(lost year "1958-59" 1960) }
it 'leaves numbers in quoted and unquoted terms' do
expect(terms_array).to eq([%w(1958 59), ["LOST"], ["YEAR"], ["1960"]])
end
end
end
end
end