class Hermeneutics::Entities

Translate HTML and XML character entities: "&" to "&" and vice versa.

What actually happens

HTML pages usually come in with characters encoded &lt; for < and &euro; for .

Further, they may contain a meta tag in the header like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta charset="utf-8" />                        (HTML5)

or

<?xml version="1.0" encoding="UTF-8" ?>         (XHTML)

When charset is utf-8 and the file contains the byte sequence "\303\244"/"\xc3\xa4" then there will be displayed a character "ä".

When charset is iso8859-15 and the file contains the byte sequence "\344"/"\xe4" then there will be displayed a character "ä", too.

The sequence "&auml;" will produce an "ä" in any case.

What you should do

Generating your own HTML pages you will always be safe when you only produce entity tags as &auml; and &euro; or &#x00e4; and &#x20ac; respectively.

What this module does

This module translates strings to a HTML-masked version. The encoding will not be changed and you may demand to keep 8-bit-characters.

Examples

Entities.encode "<"                           #=> "&lt;"
Entities.decode "&lt;"                        #=> "<"
Entities.encode "äöü"                         #=> "&auml;&ouml;&uuml;"
Entities.decode "&auml;&ouml;&uuml;"          #=> "äöü"

Attributes

keep_8bit[RW]

Public Class Methods

decode( str) → str click to toggle source

Replace HTML-style masks by normal characters:

Entities.decode "&lt;"                       #=> "<"
Entities.decode "&auml;&ouml;&uuml;"         #=> "äöü"

Unmasked 8-bit-characters ("ä" instead of "&auml;") will be kept but translated to a unique encoding.

s = "ä &ouml; ü"
s.encode! "utf-8"
Entities.decode s                            #=> "ä ö ü"

s = "\xe4 &ouml; \xfc &#x20ac;"
s.force_encoding "iso-8859-15"
Entities.decode s                            #=> "ä ö ü €"
                                                 (in iso8859-15)
# File lib/hermeneutics/escape.rb, line 200
def decode str
  str.gsub /&(.+?);/ do
    (named_decode $1) or (numeric_decode $1) or $&
  end
end
encode(str) click to toggle source
# File lib/hermeneutics/escape.rb, line 175
def encode str
  std.encode str
end
new( keep_8bit: bool) → ent click to toggle source

Creates an Entities converter.

ent = Entities.new keep_8bit: true
# File lib/hermeneutics/escape.rb, line 125
def initialize keep_8bit: nil
  @keep_8bit = keep_8bit
end
std() click to toggle source
# File lib/hermeneutics/escape.rb, line 171
def std
  @std ||= new
end

Private Class Methods

named_decode(s) click to toggle source
# File lib/hermeneutics/escape.rb, line 208
def named_decode s
  c = NAMES[ s]
  if c then
    if c.encoding != s.encoding then
      c.encode s.encoding
    else
      c
    end
  end
end
numeric_decode(s) click to toggle source
# File lib/hermeneutics/escape.rb, line 219
def numeric_decode s
  if s =~ /\A#(?:(\d+)|x([0-9a-f]+))\z/i then
    c = ($1 ? $1.to_i : ($2.to_i 0x10)).chr Encoding::UTF_8
    c.encode! s.encoding
  end
end

Public Instance Methods

decode(str) click to toggle source
# File lib/hermeneutics/escape.rb, line 163
def decode str
  self.class.decode str
end
encode( str) → str click to toggle source

Create a string thats characters are masked the HTML style:

ent = Entities.new
ent.encode "&<\""    #=> "&amp;&lt;&quot;"
ent.encode "äöü"     #=> "&auml;&ouml;&uuml;"

The result will be in the same encoding as the source even if it will not contain any 8-bit characters (what can only happen when keep_8bit is set).

ent = Entities.new true

uml = "<ä>".encode "UTF-8"
ent.encode uml             #=> "&lt;\xc3\xa4&gt;" in UTF-8

uml = "<ä>".encode "ISO-8859-1"
ent.encode uml             #=> "&lt;\xe4&gt;"     in ISO-8859-1
# File lib/hermeneutics/escape.rb, line 150
def encode str
  r = str.new_string
  r.gsub! RE_ASC do |x| "&#{SPECIAL_ASC[ x]};" end
  unless @keep_8bit then
    r.gsub! /[^\0-\x7f]/ do |c|
      c.encode! __ENCODING__
      s = SPECIAL[ c] || ("#x%04x" % c.ord)
      "&#{s};"
    end
  end
  r
end