Skip to content

Commit

Permalink
Split downloads by size (#582)
Browse files Browse the repository at this point in the history
* can break up bagit zips by a hardcoded record amount

* breaking bags into folders appropriately

* splitting bagit zips by size

* missed an "end"

* zip using only the new method

* add rerun_all_exporters task

* move a puts

* multiple zips show up on the exporter index page

* wip: new download button

* able to download zips from index page again!
co-author: shana@scientist.com

* able to download zips from the exporter show page
co-author: shana@scientist.com

* splitting download sizes for csv exports too

* style the download buttons
co-author: shana@scientist.com

* lint fix

* more linting

* specs pass

* fix extra count issue

* sort download options in ui

* handle works with more than one file_set, move 1000 to a method

* always find or create the file set entry on export

* sort the exported zip files properly

* fixing specs

* add bagit parser spec for #find_child_file_sets

* 100% coverage for #find_child_file_sets

* remove byebug

* add spec for #records_split_count

* create exporter specs

* create spec for #setup_export_file

* add bagit parser specs for new folder system

* consistent use of "subject" instead of "parser" in the bagit parser specs

* all specs passing

* add a spec for #write_files

* linting

* refactor #find_child_file_sets

Co-authored-by: kirkkwang <k3wang@gmail.com>
  • Loading branch information
alishaevn and kirkkwang authored Jul 15, 2022
1 parent 199fd9c commit 4fe1840
Show file tree
Hide file tree
Showing 14 changed files with 282 additions and 77 deletions.
2 changes: 1 addition & 1 deletion app/controllers/bulkrax/exporters_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ def add_exporter_breadcrumbs
# Download methods

def file_path
@exporter.exporter_export_zip_path
"#{@exporter.exporter_export_zip_path}/#{params['exporter']['exporter_export_zip_files']}"
end
end
end
17 changes: 15 additions & 2 deletions app/models/bulkrax/exporter.rb
Original file line number Diff line number Diff line change
Expand Up @@ -124,9 +124,13 @@ def exporter_export_path
end

def exporter_export_zip_path
@exporter_export_zip_path ||= File.join(parser.base_path('export'), "export_#{self.id}_#{self.exporter_runs.last.id}.zip")
@exporter_export_zip_path ||= File.join(parser.base_path('export'), "export_#{self.id}_#{self.exporter_runs.last.id}")
rescue
@exporter_export_zip_path ||= File.join(parser.base_path('export'), "export_#{self.id}_0.zip")
@exporter_export_zip_path ||= File.join(parser.base_path('export'), "export_#{self.id}_0")
end

def exporter_export_zip_files
@exporter_export_zip_files ||= Dir["#{exporter_export_zip_path}/**"].map { |zip| Array(zip.split('/').last) }
end

def export_properties
Expand All @@ -137,5 +141,14 @@ def export_properties
def metadata_only?
export_type == 'metadata'
end

def sort_zip_files(zip_files)
zip_files.sort_by do |item|
number = item.split('_').last.match(/\d+/)&.[](0) || 0.to_s
sort_number = number.rjust(4, "0")

sort_number
end
end
end
end
22 changes: 0 additions & 22 deletions app/models/concerns/bulkrax/export_behavior.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,6 @@ module ExportBehavior

def build_for_exporter
build_export_metadata
# TODO(alishaevn): determine if the line below is still necessary
# the csv and bagit parsers also have write_files methods
write_files if export_type == 'full' && !importerexporter.parser_klass.include?('Bagit')
rescue RSolr::Error::Http, CollectionsCreatedError => e
raise e
rescue StandardError => e
Expand All @@ -26,25 +23,6 @@ def hyrax_record
@hyrax_record ||= ActiveFedora::Base.find(self.identifier)
end

def write_files
return if hyrax_record.is_a?(Collection)

file_sets = hyrax_record.file_set? ? Array.wrap(hyrax_record) : hyrax_record.file_sets
file_sets << hyrax_record.thumbnail if hyrax_record.thumbnail.present? && hyrax_record.work? && exporter.include_thumbnails
file_sets.each do |fs|
path = File.join(exporter_export_path, 'files')
FileUtils.mkdir_p(path)
file = filename(fs)
require 'open-uri'
io = open(fs.original_file.uri)
next if file.blank?
File.open(File.join(path, file), 'wb') do |f|
f.write(io.read)
f.close
end
end
end

# Prepend the file_set id to ensure a unique filename and also one that is not longer than 255 characters
def filename(file_set)
return if file_set.original_file.blank?
Expand Down
12 changes: 8 additions & 4 deletions app/parsers/bulkrax/application_parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -260,10 +260,14 @@ def unzip(file_to_unzip)
end

def zip
FileUtils.rm_rf(exporter_export_zip_path)
Zip::File.open(exporter_export_zip_path, create: true) do |zip_file|
Dir["#{exporter_export_path}/**/**"].each do |file|
zip_file.add(file.sub("#{exporter_export_path}/", ''), file)
FileUtils.mkdir_p(exporter_export_zip_path)

Dir["#{exporter_export_path}/**"].each do |folder|
zip_path = "#{exporter_export_zip_path.split('/').last}_#{folder.split('/').last}.zip"
Zip::File.open(File.join("#{exporter_export_zip_path}/#{zip_path}"), create: true) do |zip_file|
Dir["#{folder}/**/**"].each do |file|
zip_file.add(file.sub("#{folder}/", ''), file)
end
end
end
end
Expand Down
54 changes: 38 additions & 16 deletions app/parsers/bulkrax/bagit_parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,9 @@ def current_record_ids
when 'importer'
set_ids_for_exporting_from_importer
end

find_child_file_sets(@work_ids) if importerexporter.export_from == 'collection' || importerexporter.export_from == 'worktype'

@work_ids + @collection_ids + @file_set_ids
end

Expand All @@ -122,18 +125,27 @@ def current_record_ids
def write_files
require 'open-uri'
require 'socket'

folder_count = 1
records_in_folder = 0

importerexporter.entries.where(identifier: current_record_ids)[0..limit || total].each do |entry|
record = ActiveFedora::Base.find(entry.identifier)
next unless Hyrax.config.curation_concerns.include?(record.class)
bag = BagIt::Bag.new setup_bagit_folder(entry.identifier)

bag_entries = [entry]
file_set_entries = Bulkrax::CsvFileSetEntry.where(importerexporter_id: importerexporter.id).where("parsed_metadata LIKE '%#{record.id}%'")
file_set_entries.each { |fse| bag_entries << fse }

record.file_sets.each do |fs|
if @file_set_ids.present?
file_set_entry = Bulkrax::CsvFileSetEntry.where("parsed_metadata LIKE '%#{fs.id}%'").first
bag_entries << file_set_entry unless file_set_entry.nil?
end
records_in_folder += bag_entries.count
if records_in_folder > records_split_count
folder_count += 1
records_in_folder = bag_entries.count
end

bag ||= BagIt::Bag.new setup_bagit_folder(folder_count, entry.identifier)

record.file_sets.each do |fs|
file_name = filename(fs)
next if file_name.blank?
io = open(fs.original_file.uri)
Expand All @@ -148,17 +160,21 @@ def write_files
end
end

CSV.open(setup_csv_metadata_export_file(entry.identifier), "w", headers: export_headers, write_headers: true) do |csv|
CSV.open(setup_csv_metadata_export_file(folder_count, entry.identifier), "w", headers: export_headers, write_headers: true) do |csv|
bag_entries.each { |csv_entry| csv << csv_entry.parsed_metadata }
end
write_triples(entry)

write_triples(folder_count, entry)
bag.manifest!(algo: 'sha256')
end
end
# rubocop:enable Metrics/MethodLength, Metrics/AbcSize

def setup_csv_metadata_export_file(id)
File.join(importerexporter.exporter_export_path, id, 'metadata.csv')
def setup_csv_metadata_export_file(folder_count, id)
path = File.join(importerexporter.exporter_export_path, folder_count.to_s)
FileUtils.mkdir_p(path) unless File.exist?(path)

File.join(path, id, 'metadata.csv')
end

def key_allowed(key)
Expand All @@ -167,21 +183,27 @@ def key_allowed(key)
key != source_identifier.to_s
end

def setup_triple_metadata_export_file(id)
File.join(importerexporter.exporter_export_path, id, 'metadata.nt')
def setup_triple_metadata_export_file(folder_count, id)
path = File.join(importerexporter.exporter_export_path, folder_count.to_s)
FileUtils.mkdir_p(path) unless File.exist?(path)

File.join(path, id, 'metadata.nt')
end

def setup_bagit_folder(id)
File.join(importerexporter.exporter_export_path, id)
def setup_bagit_folder(folder_count, id)
path = File.join(importerexporter.exporter_export_path, folder_count.to_s)
FileUtils.mkdir_p(path) unless File.exist?(path)

File.join(path, id)
end

def write_triples(e)
def write_triples(folder_count, e)
sd = SolrDocument.find(e.identifier)
return if sd.nil?

req = ActionDispatch::Request.new({ 'HTTP_HOST' => Socket.gethostname })
rdf = Hyrax::GraphExporter.new(sd, req).fetch.dump(:ntriples)
File.open(setup_triple_metadata_export_file(e.identifier), "w") do |triples|
File.open(setup_triple_metadata_export_file(folder_count, e.identifier), "w") do |triples|
triples.write(rdf)
end
end
Expand Down
53 changes: 48 additions & 5 deletions app/parsers/bulkrax/csv_parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
module Bulkrax
class CsvParser < ApplicationParser # rubocop:disable Metrics/ClassLength
include ErroredEntries
include ExportBehavior
attr_writer :collections, :file_sets, :works

def self.export_supported?
Expand Down Expand Up @@ -207,6 +208,13 @@ def current_record_ids
@work_ids + @collection_ids + @file_set_ids
end

# find the related file set ids so entries can be made for export
def find_child_file_sets(work_ids)
work_ids.each do |id|
ActiveFedora::Base.find(id).file_set_ids.each { |fs_id| @file_set_ids << fs_id }
end
end

# Set the following instance variables: @work_ids, @collection_ids, @file_set_ids
# @see #current_record_ids
def set_ids_for_exporting_from_importer
Expand Down Expand Up @@ -283,6 +291,10 @@ def total
@total = 0
end

def records_split_count
1000
end

# @todo - investigate getting directory structure
# @todo - investigate using perform_later, and having the importer check for
# DownloadCloudFileJob before it starts
Expand All @@ -307,9 +319,37 @@ def retrieve_cloud_files(files)
# export methods

def write_files
CSV.open(setup_export_file, "w", headers: export_headers, write_headers: true) do |csv|
importerexporter.entries.where(identifier: current_record_ids)[0..limit || total].each do |e|
csv << e.parsed_metadata
require 'open-uri'
folder_count = 0

importerexporter.entries.where(identifier: current_record_ids)[0..limit || total].in_groups_of(records_split_count, false) do |group|
folder_count += 1

CSV.open(setup_export_file(folder_count), "w", headers: export_headers, write_headers: true) do |csv|
group.each do |entry|
csv << entry.parsed_metadata
next if importerexporter.metadata_only? || entry.type == 'Bulkrax::CsvCollectionEntry'

store_files(entry.identifier, folder_count.to_s)
end
end
end
end

def store_files(identifier, folder_count)
record = ActiveFedora::Base.find(identifier)
file_sets = record.file_set? ? Array.wrap(record) : record.file_sets
file_sets << record.thumbnail if exporter.include_thumbnails && record.thumbnail.present? && record.work?
file_sets.each do |fs|
path = File.join(exporter_export_path, folder_count, 'files')
FileUtils.mkdir_p(path) unless File.exist? path
file = filename(fs)
io = open(fs.original_file.uri)
next if file.blank?

File.open(File.join(path, file), 'wb') do |f|
f.write(io.read)
f.close
end
end
end
Expand Down Expand Up @@ -356,8 +396,11 @@ def sort_headers(headers)
end

# in the parser as it is specific to the format
def setup_export_file
File.join(importerexporter.exporter_export_path, "export_#{importerexporter.export_source}_from_#{importerexporter.export_from}.csv")
def setup_export_file(folder_count)
path = File.join(importerexporter.exporter_export_path, folder_count.to_s)
FileUtils.mkdir_p(path) unless File.exist?(path)

File.join(path, "export_#{importerexporter.export_source}_from_#{importerexporter.export_from}_#{folder_count}.csv")
end

# Retrieve file paths for [:file] mapping in records
Expand Down
8 changes: 8 additions & 0 deletions app/views/bulkrax/exporters/_downloads.html.erb
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<%= form.select :exporter_export_zip_files,
exporter.sort_zip_files(form.object.exporter_export_zip_files.flatten),
{},
{
class: 'btn btn-default form-control',
style: 'width: 200px'
}
%>
7 changes: 5 additions & 2 deletions app/views/bulkrax/exporters/index.html.erb
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
<th scope="col">Name</th>
<th scope="col">Status</th>
<th scope="col">Date Exported</th>
<th scope="col"></th>
<th scope="col">Downloadable Files</th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
Expand All @@ -35,7 +35,10 @@
<td><%= exporter.created_at %></td>
<td>
<% if File.exist?(exporter.exporter_export_zip_path) %>
<%= link_to raw('<span class="glyphicon glyphicon-download"></span>'), exporter_download_path(exporter) %>
<%= simple_form_for(exporter, method: :get, url: exporter_download_path(exporter)) do |form| %>
<%= render 'downloads', exporter: exporter, form: form %>
<%= form.button :submit, value: 'Download', data: { disable_with: false } %>
<% end %>
<% end%>
</td>
<td><%= link_to raw('<span class="glyphicon glyphicon-info-sign"></span>'), exporter_path(exporter) %></td>
Expand Down
11 changes: 4 additions & 7 deletions app/views/bulkrax/exporters/show.html.erb
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,11 @@
<div class='panel-body'>

<% if File.exist?(@exporter.exporter_export_zip_path) %>
<p class='bulkrax-p-align'>
<%= simple_form_for @exporter, method: :get, url: exporter_download_path(@exporter), html: { class: 'form-inline bulkrax-p-align' } do |form| %>
<strong>Download:</strong>
<%= link_to raw('<span class="glyphicon glyphicon-download"></span>'), exporter_download_path(@exporter) %>
</p>
<%= render 'downloads', exporter: @exporter, form: form %>
<%= form.button :submit, value: 'Download', data: { disable_with: false } %>
<% end %>
<% end %>

<p class='bulkrax-p-align'>
Expand Down Expand Up @@ -135,10 +136,6 @@
<%= page_entries_info(@work_entries) %><br>
<%= paginate(@work_entries, param_name: :work_entries_page) %>
<br>
<% if File.exist?(@exporter.exporter_export_zip_path) %>
<%= link_to 'Download', exporter_download_path(@exporter) %>
|
<% end %>
<%= link_to 'Edit', edit_exporter_path(@exporter) %>
|
<%= link_to 'Back', exporters_path %>
Expand Down
32 changes: 28 additions & 4 deletions lib/tasks/bulkrax_tasks.rake
Original file line number Diff line number Diff line change
@@ -1,6 +1,30 @@
# frozen_string_literal: true

# desc "Explaining what the task does"
# task :bulkrax do
# # Task goes here
# end
namespace :bulkrax do
desc "Remove old exported zips and create new ones with the new file structure"
task rerun_all_exporters: :environment do
if defined?(::Hyku)
Account.find_each do |account|
puts "=============== updating #{account.name} ============"
next if account.name == "search"
switch!(account)

rerun_exporters_and_delete_zips

puts "=============== finished updating #{account.name} ============"
end
else
rerun_exporters_and_delete_zips
end
end

def rerun_exporters_and_delete_zips
begin
Bulkrax::Exporter.all.each { |e| Bulkrax::ExporterJob.perform_later(e.id) }
rescue => e
puts "(#{e.message})"
end

Dir["tmp/exports/**.zip"].each { |zip_path| FileUtils.rm_rf(zip_path) }
end
end
2 changes: 1 addition & 1 deletion spec/jobs/bulkrax/importer_job_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ module Bulkrax
importer_job.perform(importer.id)

expect(importer.current_run.total_work_entries).to eq(10)
expect(importer.current_run.total_collection_entries).to eq(427)
expect(importer.current_run.total_collection_entries).to eq(428)
end
end

Expand Down
Loading

0 comments on commit 4fe1840

Please sign in to comment.