RISM Marcxml is a command line utility for managing MARCXML-files.
This prgram has the following options:
- --analyze: gives report about tags and occurrances of a MARCXML-file
- --filter: building a subset of records (e.g. from the complete XML open dataset of sources at http://opac.rism.info).
- --help: see help text
- --merge: merging multiple MARCXML-files into one file
- --report: creates a report
- --split: splitting large files into chunks
- --transform: transforms MARCXML-files
- --validate: validates a MARCXML-file
Creates a report of all fields in the input-file.
Optional: --with-content: add sample content at end of line
Example call: marcxml -i input.xml -c config.yaml -o output.txt --with-content
Example output:
100$0: 798888 (30020630)
100$a: 798888 (Schmitt, Jakob)
100$d: 765593 (1799-1853)
100$j: 500026 (Conjectural)
130$0: 612 (40000133)
130$a: 40222 (8 Minuets)
Creating a subset of records from the complete XML open dataset of sources at http://opac.rism.info.
Required: -c [Yaml-config-file]
Optional: --with-linked: select also linked parent/child entries
Optional: --with-disjunct: select with logical disjunction
Example call: marcxml --filter -i input.xml -c config.yaml --with-disjunct
Filtering rules are defined by key-value pairs in an YAML-configuration file (default: conf/query.yaml). It is possible to look also for dependend records in a collection with the '--with-linked' flag. Query parameters (one per line) are combined with "AND" logic by default. Take '--with-disjunct' to use the disjunction logic instead.
Example: Yaml-config for all new records from Bach in Berlin, State Library in 2015:
# query.yaml
"005": "^2015"
"100$a":
- "Bach, Johann Sebastian"
"852$a": "^D-B$"
gives an output-XML-file with a subset of 476 records (as of October 2015).
Semantic structure:
- Key is the Marc21 field (e.g. "100$a" or "005")
- Value is a regular expression (e.g. "Mozart.*"). Hint: regular expression for negative matching (e.g.
^(?!.*term).*$
), see: http://www.regular-expressions.info/lookaround.html.
Merging an array of marcxml-files to one output-file.
Required: -i [list of input files].
Example call: marcxml --merge -i input1.xml input2.xml -o result.xml
Creates report of the inputfile to stdout. Output can be xls- or csv-format too.
Optional: --with-tag: define the marcfield for aggregation.
Example call: marcxml -i input-xml --xls --with-tag='100$a'
Splitting input-file in chunks. Size is declared with the '--with-limit'-flag. Out are files in sequence 000000+.xml
Optional: --with-limit: Specify record size for splitting
Example call : marcxml --split -i input.xml --with-limit=10000
Replaces Marc21 datafield tags and subfield codes according to rules defined by an YAML-file.
Required: -c [Yaml-config-file]
Example: marcxml --transform -i input.xml -c config.yaml -o output.xml
Structure of the Yaml-conf is:
#Optional
Class: MuscatSource
Mapping:
# Moving 772 to 762
- "772": "762"
# Dropping 772$a
- "772$a": ~
# Moving subfield $a to subfield $d
- "690$a": "d"
# Moving subfield $d to datafield 852
- "591$d": "852"
You can build much more transform logic with your own classes defined in the lib-folder, see below. Then you have to declare the class-name in the Yaml-conf.
Validating input-file according to the official standard.
Example call: marcxml --validate -i input.xml
Clone this repository with git clone https://github.com/rism-t3/marcxml-tools.git
and execute 'bundle install'.
Define enviroment variables if you like to use the --muscat-flag (using Oracle-DB).
###Requirements
- Probably Linux / Ubuntu
- Ruby
New transformator classes should inherit from Transformator class in module Marcxml. Every new transformator should declare an array named methods
containing methods executed in order.
This array should also include :map if mappings are defined in the Yaml-conf; the new tranformator class should also be defined in the Yaml-conf.
Example lib/mytransformtor.rb
:
module Marcxml
MyTransformator < Transformator
attr_accessor :node, :namespace, :methods
def initialize(node, namespace={'marc' => "http://www.loc.gov/MARC21/slim"})
@namespace = namespace
@node = node
@methods = [:add_isil, :change_cataloging_source, :repair_leader, :map]
end
def change_cataloging_source
#function code
end
end
end
It is possible to use a database-connection to get external values directly from a database. Until now an Oracle connector is given in the lib-folder. Credentials have to been declared in the enviroment.
- RISM Opendata: https://opac.rism.info/index.php?id=8&L=0
- MARC21 Documentation: http://www.loc.gov/marc/bibliographic/
- Regular Expression: https://en.wikipedia.org/wiki/Regular_expression