class Hermeneutics::Entities
Translate HTML and XML character entities: "&"
to "&"
and vice versa.
What actually happens¶ ↑
HTML pages usually come in with characters encoded <
for <
and €
for €
.
Further, they may contain a meta tag in the header like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta charset="utf-8" /> (HTML5)
or
<?xml version="1.0" encoding="UTF-8" ?> (XHTML)
When charset
is utf-8
and the file contains the byte sequence "\303\244"
/"\xc3\xa4"
then there will be displayed a character "ä"
.
When charset
is iso8859-15
and the file contains the byte sequence "\344"
/"\xe4"
then there will be displayed a character "ä"
, too.
The sequence "ä"
will produce an "ä"
in any case.
What you should do¶ ↑
Generating your own HTML pages you will always be safe when you only produce entity tags as ä
and €
or ä
and €
respectively.
What this module does¶ ↑
This module translates strings to a HTML-masked version. The encoding will not be changed and you may demand to keep 8-bit-characters.
Examples¶ ↑
Entities.encode "<" #=> "<" Entities.decode "<" #=> "<" Entities.encode "äöü" #=> "äöü" Entities.decode "äöü" #=> "äöü"
Attributes
Public Class Methods
Replace HTML-style masks by normal characters:
Entities.decode "<" #=> "<" Entities.decode "äöü" #=> "äöü"
Unmasked 8-bit-characters ("ä"
instead of "ä"
) will be kept but translated to a unique encoding.
s = "ä ö ü" s.encode! "utf-8" Entities.decode s #=> "ä ö ü" s = "\xe4 ö \xfc €" s.force_encoding "iso-8859-15" Entities.decode s #=> "ä ö ü €" (in iso8859-15)
# File lib/hermeneutics/escape.rb, line 200 def decode str str.gsub /&(.+?);/ do (named_decode $1) or (numeric_decode $1) or $& end end
# File lib/hermeneutics/escape.rb, line 175 def encode str std.encode str end
Creates an Entities
converter.
ent = Entities.new keep_8bit: true
# File lib/hermeneutics/escape.rb, line 125 def initialize keep_8bit: nil @keep_8bit = keep_8bit end
# File lib/hermeneutics/escape.rb, line 171 def std @std ||= new end
Private Class Methods
# File lib/hermeneutics/escape.rb, line 208 def named_decode s c = NAMES[ s] if c then if c.encoding != s.encoding then c.encode s.encoding else c end end end
# File lib/hermeneutics/escape.rb, line 219 def numeric_decode s if s =~ /\A#(?:(\d+)|x([0-9a-f]+))\z/i then c = ($1 ? $1.to_i : ($2.to_i 0x10)).chr Encoding::UTF_8 c.encode! s.encoding end end
Public Instance Methods
# File lib/hermeneutics/escape.rb, line 163 def decode str self.class.decode str end
Create a string thats characters are masked the HTML style:
ent = Entities.new ent.encode "&<\"" #=> "&<"" ent.encode "äöü" #=> "äöü"
The result will be in the same encoding as the source even if it will not contain any 8-bit characters (what can only happen when keep_8bit
is set).
ent = Entities.new true uml = "<ä>".encode "UTF-8" ent.encode uml #=> "<\xc3\xa4>" in UTF-8 uml = "<ä>".encode "ISO-8859-1" ent.encode uml #=> "<\xe4>" in ISO-8859-1
# File lib/hermeneutics/escape.rb, line 150 def encode str r = str.new_string r.gsub! RE_ASC do |x| "&#{SPECIAL_ASC[ x]};" end unless @keep_8bit then r.gsub! /[^\0-\x7f]/ do |c| c.encode! __ENCODING__ s = SPECIAL[ c] || ("#x%04x" % c.ord) "&#{s};" end end r end