module Sanzang::Formatting
This module handles formatting of text data especially to prepare the text for direct translation. This involves reformatting and reflowing text so that words are not divided between lines, and so the output is well suited for humans. For practical purposes of readability, lines of text to be translated should be succinct and easily comprehensible.
Public Class Methods
Given a CJK string of text, reformat the string for greater compatibility with direct translation, and reflow the text based on its punctuation. The first step of this reformatting is to remove any CBETA-style margins at the beginning of each line, which are indicated by the double-bar character (“║” U+2551). An extra space is then inserted after each short line which may indicate that the line is part of a poem, and should be kept separate. Following this, all newlines are removed, and the text is then reformatted according to the remaining punctuation and spacing.
# File lib/sanzang/formatting.rb, line 36 def reflow_cjk(s) source_encoding = s.encoding s.encode!(Encoding::UTF_8) # Strip all CBETA-style margins s.gsub!(/^.*║/, "") # Starts with Hanzi space and short line: add Hanzi space at the end. # This is used for avoiding conflicts between poetry and prose. s.gsub!(/^( )(.{1,15})$/, "\\1\\2 ") # Collapse all vertical whitespace. using_crlf = s.include?("\r") s.gsub!(/(\r|\n)/, "") # Ender followed by non-ender: newline in between. s.gsub!(/([:,;。?!」』.;:\?])([^:,;。?!」』.;:\?])/, "\\1\n\\2") # Non-starter, non-ender, followed by a starter: newline in between. s.gsub!(/([^「『 \t:,;。?!」』.;:\?\n])([「『 \t])/, "\\1\n\\2") if s[-1] != "\n" s << "\n" end s.gsub!("\n", "\r\n") if using_crlf s.encode!(source_encoding) end