class PrettyProxy

The PrettyProxy class aggregate and validate the configuration of a proxy based in simple pretty url oriented rewriting rules. It’s too a rack app, and offers a abstract method for rewrite the responses returned by the proxy. The (X)HTML responses are rewritten to make the hyperlinks point to the proxy version of the page if it exist.

@example A terrible example

# You can run this example with 'rake heresy_example' in the gem folder
# and see the result in localhost:9292/proxy/
require 'pretty_proxy'

class Heresy < PrettyProxy
  def sugared_rewrite_response(triplet, requested_to_proxy_env, rewritten_env)
    status, headers, page = triplet
    page = page.gsub(/(MTG )?Magic(: The Gathering)?/, 'Yu-Gi-Oh')
    [status, headers, page]
  end
end

run Heresy.new('/proxy/', 'http://magiccards.info', '/')

If you want to make a Rack app who use the proxy to point to another path of the same app you have to use a server in multithread mode, otherwise requests to the proxy will end in a deadlock. The proxy request the original page but the server don’t respond because is waiting the proxy request to be resolved. The proxy request don’t end because need the original page. A timeout error occur.

What this class can’t do but maybe will do in the future: smart handling of 3xx status response and chunked encoding (the chunks are concatened in the proxy and the transfer-encoding header removed); support more than deflate and gzip; exception classes with more than a message;

The exception classes (except Error) inherit Error, and Error inherit ArgumentError. They are empty yet, only have a message.

Glossary:

‘a valid proxy url/path’: The path (or the path of the url) start with the proxy_path and is followed by a original_path.

‘in(side)/out(side) the proxy control’: The url have (or not) the path starting with a original_path, and the scheme, port and host are the same of the original_domain.

CHANGELOG:

4.0.0
  * proxify_hyperlink don't take relative paths or urls anymore, only
    absolute urls. This is because the proxy url was used for a double
    purpose (know the proxy scheme+host+port and resolve relative
    hyperlinks). This can lead to the mistake of believing that the
    base url to resolve relative links in the page is the page url
    (what's false if the page has a base tag). See more in:
    http://www.w3.org/TR/html5/infrastructure.html#base-urls
  * proxify_html (and other methods who use it, as #call) use the base
    tag from the page to determine the base url, and add the the base
    tag (if the page don't have one) to simplify the assets proxification.
    All a[href] are changed to absolute urls.
  * rspec-html-matchers added as development dependency
3.0.0
  * return a String for unproxify_url (and not more a URI)
     because this is a change in the API (and can break code) the major
     version is now 3, if you don't use this method you can safely upgrade
  * depends in addressable gem
  * handles correctly the URIs without scheme (but with host)
    like '//duckduckgo.com/' (spec added for that)

@author Henrique Becker

Attributes

ignore_html_errors[RW]

Public Class Methods

new(proxy_path, original_domain, original_paths, ignore_html_errors = false) click to toggle source

Create a new PrettyProxy instance or raise a ConfigError. Clone the arguments. @param proxy_path [String] Start and end with slashes, represent the

path in the proxy site who map to the proxy app (and, in consequence,
to another path in the same or another site).

@param original_domain [String, URI] A URL without path (no trailing slash),

query or fragment (can have scheme (http[s]), domain and port), the site
to where the proxy map.

@param original_paths [String, each] The path (or the paths) to be mapped

right inside the proxy_path (has to begin with slash).

@param ignore_html_errors [TrueClass, FalseClass] If the argument of this

parameter is true the #proxify_html try to ignore some exceptions that can
be caused by an malformed (X)HTML and continue. Don't silence
#{sugared_,}rewrite_response. Experimental.

@note See the specs {file:../spec/pretty_proxy_spec.rb} for examples and

complete definition of invalid args.

@return [PrettyProxy] a new instance @raise PrettyProxy::ConfigError

# File lib/pretty_proxy.rb, line 108
def initialize(proxy_path, original_domain, original_paths, ignore_html_errors = false)
  Utils.validate_proxy_path(proxy_path)
  Utils.validate_original_domain_and_paths(original_domain, original_paths)

  @ignore_html_errors = ignore_html_errors
  @proxy_path = proxy_path.clone
  @original_domain = Addressable::URI.parse(original_domain.clone)
  @original_paths = Set.new 
  if original_paths.respond_to? :each
    original_paths.each { | value | @original_paths << value.clone }
  else
    @original_paths << original_paths.clone
  end
end

Public Instance Methods

call(env) click to toggle source

Make this class a Rack app. It’s overriden to repass to the rewrite_response the original Rack environment (request to the proxy) and the rewritten env (modified to point the original page request). If you don’t know the parameters and return of this method, please read {rack.rubyforge.org/doc/SPEC.html}.

# File lib/pretty_proxy.rb, line 392
def call(env)
  # in theory we only need to repass the rewritten_env, any original env info
  #  needed can be passed as a environment application variable
  #  example: (env['app_name.original_path'] = env['PATH_INFO'])
  #  but to avoid this to be a common idiom we repass the original env too
  rewritten_env = rewrite_env(env)
  rewrite_response(perform_request(rewritten_env), env, rewritten_env)
end
inside_proxy_control?(uri) click to toggle source

Check if the URI::HTTP(S) is a page who can be accessed through the proxy.

# File lib/pretty_proxy.rb, line 408
def inside_proxy_control?(uri)
  same_domain_as_original?(uri) &&
    valid_path_for_proxy?(@proxy_path + uri.path[1..-1])
end
original_domain=(original_domain) click to toggle source
# File lib/pretty_proxy.rb, line 139
def original_domain=(original_domain)
  Utils.validate_original_domain_and_paths(original_domain, @original_paths)
  @original_domain = original_domain
end
original_paths=(original_paths) click to toggle source
# File lib/pretty_proxy.rb, line 144
def original_paths=(original_paths)
  Utils.validate_original_domain_and_paths(@original_domain, original_paths)
  @original_paths = original_paths
end
point_to_a_proxy_page?(hyperlink, proxy_domain) click to toggle source

Take a url and the proxy domain (scheme, host and port) and return if the url point to a valid proxy page.

# File lib/pretty_proxy.rb, line 433
def point_to_a_proxy_page?(hyperlink, proxy_domain)
  Utils.same_domain?(hyperlink, proxy_domain) &&
    valid_path_for_proxy?(hyperlink.path)
end
proxify_html(html, proxy_url, mime_type) click to toggle source

Take a (X)HTML Document add a base tag (if none) and apply proxify_hyperlink to the ‘href’ attribute of each ‘a’ element. If the page has a base tag leave it unchanged. If a valid mime_type is passed as argument, but the html argument can’t be parsed by this mime-type it simple returns the first argument unchanged. @param html [String] A (X)HTML document. @param proxy_url [String, URI::HTTP, URI::HTTPS] The url where the

the proxified version of the page will be displayed.

@param mime_type [String] A string containing ‘text/html’ or

'application/xhtml+xml' (insensitive to case and any characters
before or after the type). Define if the content will be parsed as xml or
html. See this link for more info: http://www.w3.org/TR/xhtml-media-types/.
Raise an exception if an invalid value is provided.

@return [String] A copy of the document with the changes applied,

or the original string, if the document can't be parsed.

@raise PrettyProxy::ProxyError

# File lib/pretty_proxy.rb, line 211
def proxify_html(html, proxy_url, mime_type)
  parsed_html = Utils.parse_html_or_xhtml(html, mime_type)

  if parsed_html.nil?
    return html
  end

  # This isn't in conformance with the following document
  # http://www.w3.org/TR/html5/infrastructure.html#base-urls
  # but support to frames is not a priority
  document_original_url = unproxify_url(proxy_url)
  # in theory base must have a href... but to avoid an exception by bad html
  base_tag = parsed_html.at_css('base[href]')
  base_url = nil
  if base_tag
    base_url = Addressable::URI.parse(document_original_url)
                               .join(base_tag['href']).to_s
  else
    base_url = document_original_url
  end

  # the href isn't a obrigatory attribute of an anchor element
  parsed_html.css('a[href]').each do | hyperlink |
    begin
      absolute_hyperlink = Addressable::URI.parse(base_url)
                                           .join(hyperlink['href']).to_s
      hyperlink['href'] = proxify_hyperlink(absolute_hyperlink, proxy_url)
    rescue => e
      # Here we catch any exception derived from StandardError and do nothing
      # with it. This is a little risky, but the link in the href can be
      # wrong in many ways and yet be accepted by nokogiri. So to not
      # complexify the code we simply ignore when we can't proxify a link.
      raise e unless @ignore_html_errors
    end
  end

  unless base_tag
    is_XML = %r{application/xhtml\+xml}.match(mime_type)
    base_tag = "<base href='#{document_original_url}' #{is_XML ? '/' : ''}>"
    parsed_html.at_css('head').first_element_child
               .add_previous_sibling(base_tag)
  end

  parsed_html.to_s
end
proxy_path=(proxy_path) click to toggle source
# File lib/pretty_proxy.rb, line 134
def proxy_path=(proxy_path)
  Utils.validate_proxy_path(proxy_path)
  @proxy_path = proxy_path
end
rewrite_env(env) click to toggle source

Modify a Rack environment hash of a request to the proxy version of a page to a request to the original page. As in Rack::proxy is used by call for require the original page before call rewrite_response in the response. If you want to use your own rewrite rules maybe is more wise to subclass Rack::Proxy instead subclass this class. The purpose of this class is mainly implement and enforce these rules for you. @param env [Hash{String => String}] A Rack environment hash.

(see: {http://rack.rubyforge.org/doc/SPEC.html})

@return [Hash{String => String}] A unproxified copy of the argument. @raise PrettyProxy::ProxyError

# File lib/pretty_proxy.rb, line 267
def rewrite_env(env)
  env = env.clone
  url_requested_to_proxy = Rack::Request.new(env).url
  # Using URI, and not Addressable::URI because the port value is incorrect in the last
  unproxified_url = Addressable::URI.parse(unproxify_url(url_requested_to_proxy))

  if env['HTTP_HOST']
    env['HTTP_HOST'] = unproxified_url.host
  end
  env['SERVER_NAME'] = unproxified_url.host
  env['SERVER_PORT'] = unproxified_url.inferred_port.to_s

  if env['SCRIPT_NAME'].empty? && !env['PATH_INFO'].empty?
    env['PATH_INFO'] = unproxified_url.path
  end
  if !env['SCRIPT_NAME'].empty? && env['PATH_INFO'].empty?
    env['SCRIPT_NAME'] = unproxified_url.path
  end
  # Seriously, i don't know how to split again the unproxified url, so PATH_INFO gonna have the full path
  if (!env['SCRIPT_NAME'].empty? && !env['PATH_INFO'].empty?) ||
      (env['SCRIPT_NAME'].empty? && env['PATH_INFO'].empty?)
    env['PATH_INFO'] = unproxified_url.path
    env['SCRIPT_NAME'] = ''
  end

  env['REQUEST_PATH'] = unproxified_url.path
  env['REQUEST_URI'] = unproxified_url.path

  env
end
rewrite_response(triplet, requested_to_proxy_env, rewritten_env) click to toggle source

Mainly apply the proxify_html to the body of the response if it is a html. Raise an error if the ‘content-encoding’ is other than deflate, gzip or identity. Change the ‘content-length’ header for the new body bytesize. Remove the ‘transfer-encoding’ if it is chunked, and act as not chunked. This method is inherited of Rack::Proxy, but in the original it have only the first parameter (the triplet). This version have the Rack env requested to the proxy and the rewritten Rack env as second and third parameters, respectively. @param triplet [Array<(Integer, Hash{String => String}, each)>] A Rack

response (see {http://rack.rubyforge.org/doc/SPEC.html}) for the request
to the original site.

@param requested_to_proxy_env [Hash{String => String}] A Rack environment

hash. The requested to the proxy version.

@param rewritten_env [Hash{String => String}] A Rack environment hash.

The rewritten by the proxy to point to the original version.

@return [Array<(Integer, Hash{String => String}, each)>] A unproxified

copy of the first argument.

@raise PrettyProxy::ProxyError

# File lib/pretty_proxy.rb, line 316
def rewrite_response(triplet, requested_to_proxy_env, rewritten_env)
  status, headers, body = triplet
  content_type = headers['content-type']
  return triplet unless 200 == status && (%r{text/html} =~ content_type ||
                        %r{application/xhtml\+xml} =~ content_type)


  # the #each method of body can't be called twice, but we need to call it here and it is called
  # after this method return, so we fake the body with a array of one string
  # we can't return a string (even it responds to #each) see: http://rack.rubyforge.org/doc/SPEC.html (section 'The Body')
  page = ''
  body.each do | chunk |
    page << chunk
  end

  case headers['content-encoding']
  when 'gzip' then page = Zlib::GzipReader.new(StringIO.new(page)).read
  when 'deflate' then page = Zlib::Inflate.inflate(page)
  when 'identity' then page = page
  when nil then page = page
  else
    fail ProxyError, 'unknown content-encoding, only encodings known are gzip, deflate and identity'
  end

  request_to_proxy = Rack::Request.new(requested_to_proxy_env)
  page = proxify_html(page, request_to_proxy.url, content_type)
  status, headers, page = sugared_rewrite_response([status, headers, page],
                                                    requested_to_proxy_env,
                                                    rewritten_env)

  case headers['content-encoding']
  when 'gzip'
    page_ = ''
    gzip_stream = Zlib::GzipWriter.new(StringIO.new(page_))
    gzip_stream.write page
    gzip_stream.close
    page = page_
  when 'deflate' then page = Zlib::Deflate.deflate(page)
  end

  headers['content-length'] = page.bytesize.to_s if headers['content-length']

  # TODO: find a way to make the code work with chunked encoding
  if 'chunked' == headers['transfer-encoding']
    headers.delete('transfer-encoding') 
    headers['content-length'] = page.bytesize.to_s
  end

  [status, headers, [page]]
end
same_domain_as_original?(uri) click to toggle source

Check if the scheme, host, and port of the argument are equal to the original_domain ones.

# File lib/pretty_proxy.rb, line 403
def same_domain_as_original?(uri)
  Utils.same_domain?(@original_domain, uri)
end
sugared_rewrite_response(triplet, requested_to_proxy_env, rewritten_env) click to toggle source

The simplest way to make use of this class is subclass this class and redefine this method. @abstract This method is called only over (X)HTML responses, after they are

decompressed and the hyperlinks proxified, before they are compressed
again and the new content-length calculated.

@note The body of the triplet is a String and not a object who respond to each,

the same has to be true in the return. Return a modified clone of the
response, don't change the argument.

@param triplet [Array<(Integer, Hash{String => String}, String)>] Not a

valid Rack response, the third element is a string with the response body.

@param requested_to_proxy_env [Hash{String => String}] A Rack environment

hash. The requested to the proxy version.

@param rewritten_env [Hash{String => String}] A Rack environment hash.

The rewritten by the proxy to point to the original version.

@return [Array<(Integer, Hash{String => String}, String)>] A unproxified

copy of the first argument.
# File lib/pretty_proxy.rb, line 383
def sugared_rewrite_response(triplet, requested_to_proxy_env, rewritten_env)
  triplet
end
unproxify_url(url) click to toggle source

Take a proxy url and return the original URL behind the proxy. Preserve the query and fragment, if any. For the rewrite of a request @see rewrite_env. @param url [String, URI::HTTP, URI::HTTPS] A URL. @return [String] The unproxified URI in a string. @raise PrettyProxy::ProxyError

# File lib/pretty_proxy.rb, line 154
def unproxify_url(url)
  url = Addressable::URI.parse(url.clone)
  
  unless valid_path_for_proxy? url.path
    fail ProxyError, "'#{url.to_s}' isn't inside the proxy control, it can't be unproxified"
  end

  url.site = @original_domain.site
  url.path = url.path.slice((@proxy_path.size-1)..-1)

  url.to_s
rescue Addressable::URI::InvalidURIError
  raise ArgumentError, "the url argument isn't a valid uri"
end
valid_path_for_proxy?(absolute_path) click to toggle source

Check if the absolute path begin with a proxy_path and is followed by a original_paths element.

# File lib/pretty_proxy.rb, line 415
def valid_path_for_proxy?(absolute_path)
  return false unless absolute_path.start_with?(@proxy_path)

  path_without_proxy_prefix = absolute_path[(@proxy_path.size-1)..-1]

  @original_paths.any? do | original_path |
    # if we don't test this '/about' and '/about_us' will match
    if original_path.end_with? '/'
      path_without_proxy_prefix.start_with? original_path
    else
      path_without_proxy_prefix == original_path ||
        path_without_proxy_prefix.start_with?("#{original_path}/")
    end
  end
end

Private Instance Methods

proxify_uri(absolute_uri, proxy_site) click to toggle source

api private Don’t use this method.

# File lib/pretty_proxy.rb, line 439
def proxify_uri(absolute_uri, proxy_site)
  uri = absolute_uri.clone

  uri.site = proxy_site.site
  uri.path = @proxy_path[0..-2] + uri.path

  uri
end