class Sanitize
Constants
- REGEX_DATA_ATTR
Matches a valid HTML5 data attribute name. The unicode ranges included here are a conservative subset of the full range of characters that are technically allowed, with the intent of matching the most common characters used in data attribute names while excluding uncommon or potentially misleading characters, or characters with the potential to be normalized into unsafe or confusing forms.
If you need data attr names with characters that aren't included here (such as combining marks, full-width characters, or CJK), please consider creating a custom transformer to validate attributes according to your needs.
- REGEX_PROTOCOL
Matches an attribute value that could be treated by a browser as a URL with a protocol prefix, such as “http:” or “javascript:”. Any string of zero or more characters followed by a colon is considered a match, even if the colon is encoded as an entity and even if it's an incomplete entity (which IE6 and Opera will still parse).
- VERSION
Attributes
Public Class Methods
Returns a sanitized copy of html, using the settings in config if specified.
# File lib/sanitize.rb, line 66 def self.clean(html, config = {}) Sanitize.new(config).clean(html) end
Performs Sanitize#clean
in place, returning html, or nil
if no changes were made.
# File lib/sanitize.rb, line 72 def self.clean!(html, config = {}) Sanitize.new(config).clean!(html) end
Performs a Sanitize#clean
using a full-document HTML parser instead of the default fragment parser. This will add a DOCTYPE and html tag unless they are already present
# File lib/sanitize.rb, line 79 def self.clean_document(html, config = {}) Sanitize.new(config).clean_document(html) end
Performs Sanitize#clean_document
in place, returning html, or nil
if no changes were made.
# File lib/sanitize.rb, line 85 def self.clean_document!(html, config = {}) Sanitize.new(config).clean_document!(html) end
Sanitizes the specified Nokogiri::XML::Node and all its children.
# File lib/sanitize.rb, line 90 def self.clean_node!(node, config = {}) Sanitize.new(config).clean_node!(node) end
Returns a new Sanitize
object initialized with the settings in config.
# File lib/sanitize.rb, line 99 def initialize(config = {}) @config = Config::DEFAULT.merge(config) @transformers = { :breadth => Array(@config[:transformers_breadth].dup), :depth => Array(@config[:transformers]) + Array(@config[:transformers_depth]) } # Default depth transformers. These always run at the end of the chain, # after any custom transformers. @transformers[:depth] << Transformers::CleanComment unless @config[:allow_comments] @transformers[:depth] << Transformers::CleanCDATA << Transformers::CleanElement.new(@config) end
Public Instance Methods
Returns a sanitized copy of the given html fragment.
# File lib/sanitize.rb, line 117 def clean(html) if html dupe = html.dup clean!(dupe) || dupe end end
Performs clean in place, returning html, or nil
if no changes were made.
# File lib/sanitize.rb, line 126 def clean!(html, parser = Nokogiri::HTML::DocumentFragment) fragment = parser.parse(html) clean_node!(fragment) output_method_params = {:encoding => @config[:output_encoding], :indent => 0} if @config[:output] == :xhtml output_method = fragment.method(:to_xhtml) output_method_params[:save_with] = Nokogiri::XML::Node::SaveOptions::AS_XHTML elsif @config[:output] == :html output_method = fragment.method(:to_html) else raise Error, "unsupported output format: #{@config[:output]}" end result = output_method.call(output_method_params) return result == html ? nil : html[0, html.length] = result end
Returns a sanitized copy of the given full html document.
# File lib/sanitize.rb, line 147 def clean_document(html) unless html.nil? clean_document!(html.dup) || html end end
Performs clean_document
in place, returning html, or nil
if no changes were made.
# File lib/sanitize.rb, line 155 def clean_document!(html) if !@config[:elements].include?('html') && !@config[:remove_contents] raise 'You must have the HTML element whitelisted to call #clean_document unless remove_contents is set to true' # otherwise Nokogiri will raise for having multiple root nodes when # it moves its children to the root document context end clean!(html, Nokogiri::HTML::Document) end
Sanitizes the specified Nokogiri::XML::Node and all its children.
# File lib/sanitize.rb, line 166 def clean_node!(node) raise ArgumentError unless node.is_a?(Nokogiri::XML::Node) node_whitelist = Set.new unless @transformers[:breadth].empty? traverse_breadth(node) {|n| transform_node!(n, node_whitelist, :breadth) } end traverse_depth(node) {|n| transform_node!(n, node_whitelist, :depth) } node end
Private Instance Methods
# File lib/sanitize.rb, line 181 def transform_node!(node, node_whitelist, mode) @transformers[mode].each do |transformer| result = transformer.call({ :config => @config, :is_whitelisted => node_whitelist.include?(node), :node => node, :node_name => node.name.downcase, :node_whitelist => node_whitelist, :traversal_mode => mode }) if result.is_a?(Hash) && result[:node_whitelist].respond_to?(:each) node_whitelist.merge(result[:node_whitelist]) end end node end
Performs breadth-first traversal, operating first on the root node, then traversing downwards.
# File lib/sanitize.rb, line 202 def traverse_breadth(node, &block) block.call(node) node.children.each {|child| traverse_breadth(child, &block) } end
Performs depth-first traversal, operating first on the deepest nodes in the document, then traversing upwards to the root.
# File lib/sanitize.rb, line 209 def traverse_depth(node, &block) node.children.each {|child| traverse_depth(child, &block) } block.call(node) end