class MARC::Reader

A class for reading MARC binary (ISO 2709) files.

Character Encoding

In ruby 1.9+, ruby tags all strings with expected character encodings. If illegal bytes for that character encoding are encountered in certain operations, ruby will raise an exception. If a String is incorrectly tagged with the wrong character encoding, that makes it fairly likely an illegal byte for the specified encoding will be encountered.

So when reading binary MARC data with the MARC::Reader, it’s important that you let it know the expected encoding:

MARC::Reader.new("path/to/file.mrc", :external_encoding => "UTF-8")

If you leave off ‘external_encoding’, it will use the ruby environment Encoding.default_external, which is usually UTF-8 but may depend on your environment.

Even if you expect your data to be (eg) UTF-8, it may include bad/illegal bytes. By default MARC::Reader will leave these in the produced Strings, which will probably raise an exception later in your program. Better to catch this early, and ask MARC::Reader to raise immediately on illegal bytes:

MARC::Reader.new("path/to/file.mrc", :external_encoding => "UTF-8",
  :validate_encoding => true)

Alternately, you can have MARC::Reader replace illegal bytes with the Unicode Replacement Character, or with a string of your choice (including the empty string, meaning just omit the bad bytes)

MARC::Reader("path/to/file.mrc", :external_encoding => "UTF-8",
   :invalid => :replace)
MARC::Reader("path/to/file.mrc", :external_encoding => "UTF-8",
   :invalid => :replace, :replace => "")

If you supply an :external_encoding argument, MARC::Reader will always assume that encoding – if you leave it off, MARC::Reader will use the encoding tagged on any input you pass in, such as Strings or File handles.

# marc data will have same encoding as string.encoding:
MARC::Reader.decode( string )

# Same, values will have encoding of string.encoding:
MARC::Reader.new(StringIO.new(string))

# data values will have cp866 encoding, per external_encoding of
# File object passed in
MARC::Reader.new(File.new("myfile.marc", "r:cp866"))

# explicitly tell MARC::Reader the encoding
MARC::Reader.new("myfile.marc", :external_encoding => "cp866")

MARC-8

The legacy MARC-8 encoding needs to be handled differently, because there is no built-in support in ruby for MARC-8.

You can specify “MARC-8” as an external encoding. It will trigger trans-code to UTF-8 (NFC-normalized) in the internal ruby strings.

MARC::Reader.new("marc8.mrc", :external_encoding => "MARC-8")

For external_encoding “MARC-8”, :validate_encoding is always true, there’s no way to ignore bad bytes in MARC-8 when transcoding to unicode. However, just as with other encodings, the ‘:invalid => :replace` and `:replace => “string”` options can be used to replace bad bytes instead of raising.

If you want your MARC-8 to be transcoded internally to something other than UTF-8, you can use the :internal_encoding option which works with any encoding in MARC::Reader.

MARC::Reader.new("marc8.mrc",
  :external_encoding => "MARC-8",
  :internal_encoding => "UTF-16LE")

If you want to read in MARC-8 without transcoding, leaving the internal Strings in MARC-8, the only way to do that is with ruby’s ‘binary’ (aka “ASCII-8BIT”) encoding, since ruby doesn’t know from MARC-8. This will work:

MARC::Reader.new("marc8.mrc", :external_encoding => "binary")

Please note that MARC::Reader does not currently have any facilities for guessing encoding from MARC21 leader byte 9, that is ignored.

Complete Encoding Options

These options can all be used on MARC::Reader.new or MARC::Reader.decode to specify external encoding, ask for a transcode to a different encoding on read, or validate or replace bad bytes in source.

:external_encoding

What encoding to consider the MARC record’s values to be in. This option takes precedence over the File handle or String argument’s encodings.

:internal_encoding

Ask MARC::Reader to transcode to this encoding in memory after reading the file in.

:validate_encoding

If you pass in ‘true`, MARC::Reader will promise to raise an Encoding::InvalidByteSequenceError if there are illegal bytes in the source for the :external_encoding. There is a performance penalty for this check. Without this option, an exception may or _may not_ be raised, and whether an exception or raised (or what class the exception has) may change in future ruby-marc versions without warning.

:invalid

Just like String#encode, set to :replace and any bytes in source data illegal for the source encoding will be replaced with the unicode replacement character (when in unicode encodings), or else ‘?’. Overrides :validate_encoding. This can help you sanitize your input and avoid ruby “invalid UTF-8 byte” exceptions later.

:replace

Just like String#encode, combine with ‘:invalid=>:replace`, set your own replacement string for invalid bytes. You may use the empty string to simply eliminate invalid bytes.

Warning on ruby File’s own :internal_encoding, and unsafe transcoding from ruby

Be careful with using an explicit File object with the File’s own :internal_encoding set – it can cause ruby to transcode your data before MARC::Reader gets it, changing the bytecount and making the marc record unreadable in some cases. This applies to Encoding.default_encoding too!

# May in some cases result in unreadable marc and an exception
MARC::Reader.new(  File.new("marc_in_cp866.mrc", "r:cp866:utf-8") )

# May in some cases result in unreadable marc and an exception
Encoding.default_internal = "utf-8"
MARC::Reader.new(  File.new("marc_in_cp866.mrc", "r:cp866") )

# However this should be safe:
MARC::Reader.new(  "marc_in_cp866.mrc", :external_encoding => "cp866")

# And this should be safe, if you do want to transcode:
MARC::Reader.new(  "marc_in_cp866.mrc", :external_encoding => "cp866",
   :internal_encoding => "utf-8")

# And this should ALWAYS be safe, with or without an internal_encoding
MARC::Reader.new( File.new("marc_in_cp866.mrc", "r:binary:binary"),
   :external_encoding => "cp866",
   :internal_encoding => "utf-8")

jruby note

In the past, jruby encoding-related bugs have caused problems with our encoding treatments. See for example: jira.codehaus.org/browse/JRUBY-6637

We recommend using the latest version of jruby, especially at least jruby 1.7.6.