module Sanzang::Formatting

This module handles formatting of text data especially to prepare the text for direct translation. This involves reformatting and reflowing text so that words are not divided between lines, and so the output is well suited for humans. For practical purposes of readability, lines of text to be translated should be succinct and easily comprehensible.

Public Class Methods

reflow_cjk(s) click to toggle source

Given a CJK string of text, reformat the string for greater compatibility with direct translation, and reflow the text based on its punctuation. The first step of this reformatting is to remove any CBETA-style margins at the beginning of each line, which are indicated by the double-bar character (“║” U+2551). An extra space is then inserted after each short line which may indicate that the line is part of a poem, and should be kept separate. Following this, all newlines are removed, and the text is then reformatted according to the remaining punctuation and spacing.

# File lib/sanzang/formatting.rb, line 36
def reflow_cjk(s)
  source_encoding = s.encoding
  s.encode!(Encoding::UTF_8)

  # Strip all CBETA-style margins
  s.gsub!(/^.*║/, "")

  # Starts with Hanzi space and short line: add Hanzi space at the end.
  # This is used for avoiding conflicts between poetry and prose.
  s.gsub!(/^( )(.{1,15})$/, "\\1\\2 ")

  # Collapse all vertical whitespace.
  using_crlf = s.include?("\r")
  s.gsub!(/(\r|\n)/, "")

  # Ender followed by non-ender: newline in between.
  s.gsub!(/([:,;。?!」』.;:\?])([^:,;。?!」』.;:\?])/,
    "\\1\n\\2")

  # Non-starter, non-ender, followed by a starter: newline in between.
  s.gsub!(/([^「『 \t:,;。?!」』.;:\?\n])([「『 \t])/,
    "\\1\n\\2")

  if s[-1] != "\n"
    s << "\n"
  end

  s.gsub!("\n", "\r\n") if using_crlf
  s.encode!(source_encoding)
end