class MlmmjArchiver::Archiver
Archiver
class. Point it to a target directory you want to place your web archive under, add some MLs to process and start the process via archive!
. You have some influence over the used (temporary) MHonArc RC file by specifying some arguments to ::new
.
Note that archiving for the web is a two-step process. First the mails in mlmmj’s archive
folder need to be split up in a directory structure that allows processesing them month-by-month instead of processing them all at once, because this allows for an easier overview of the web archive. In the second step, all these month directories are passed into mhonarc
, which converts them to HTML and stores them in the final directory.
Constants
- ARCHIVE_DIR
Path relative to ML root containing the mails
- CONTROL_FILE
Path relative to ML root containing the file that requests the web archiving.
- MHONARC
Path to the
mhonarc
executable.- MRC_DEFAULTS
Default values for the MHonArc RC file.
- MRC_TEMPLATE
Template for generating the temporary MHonArc RC file.
Public Class Methods
Create a new Archiver
that stores its HTML mails below the given target
directory. rc_args
allows the customization of the used MHonArc RC file. It is a hash that takes the following arguments (the values in parentheses denote the default values)
- header (“<p>ML archive</p>”)
-
HTML header to prepend to every page. $IDXTITLE$ is replaced by the title of the respective index.
- tlevels (8)
-
Number of levels to nest threads before flattening.
- archiveadmin (postmaster@example.org)
-
E-Mail address of the archive administrator.
- checknoarchive (true)
-
If set, adds <CHECKNOARCHIVE> to the rc file. Otherwise adds <NOCHECKNOARCHIVE>.
- searchtarget (nil)
-
If this is set, displays a link called “search” next to the index links that links to the location specified here.
- stylefile (“/archive.css”)
-
CSS style file to reference from the outputted HTML pages.
- mhonarc (“/usr/bin/mhonarc”)
-
Path to the
mhonarc
executable to create the archive. - cachedir (nil)
-
Path to a directory where the mails are stored sorted. Setting this to a permanent storage will speed up the archiving process on large MLs.
# File lib/mlmmj-archiver/archiver.rb, line 77 def initialize(target, rc_args = {}) @target_dir = Pathname.new(target).expand_path @mailinglists = [] @mutex = Mutex.new @rc_args = MRC_DEFAULTS.merge(rc_args) @debug = false @inotify_thread = nil @mhonarc = rc_args[:mhonarc] || MHONARC if rc_args[:cachedir] @sorted_target = Pathname.new(rc_args[:cachedir]).expand_path else @sorted_target = Pathname.new(Dir.mktmpdir) at_exit{FileUtils.rm_rf(@sorted_target)} end end
Public Instance Methods
Like add_ml
, but returns self
for method chaining.
# File lib/mlmmj-archiver/archiver.rb, line 114 def <<(path) add_ml(path) self end
Add a mlmmj ML directory to process.
# File lib/mlmmj-archiver/archiver.rb, line 106 def add_ml(path) dir = Pathname.new(path).expand_path debug("Adding ML directory: #{dir}") @mailinglists.push(dir) end
Process all the mails in all the directories.
# File lib/mlmmj-archiver/archiver.rb, line 169 def archive! @mutex.synchronize do rcpath = generate_rcfile @mailinglists.each do |path| control_file = path + CONTROL_FILE next unless control_file.file? process_ml(@sorted_target + path.basename, @target_dir + path.basename, rcpath) end end end
Enable/disable debugging output.
# File lib/mlmmj-archiver/archiver.rb, line 96 def debug_mode=(val) @debug = val end
True if debugging output is enabled, see debug_mode=
.
# File lib/mlmmj-archiver/archiver.rb, line 101 def debug_mode? @debug end
Iterates over all mailinglists and copies new messages into the intermediate month directory structure.
# File lib/mlmmj-archiver/archiver.rb, line 157 def preprocess_mlmmj_mails! @sorted_target.mkpath unless @sorted_target.directory? @mutex.synchronize do @mailinglists.each do |path| hsh = collect_messages(path + ARCHIVE_DIR) split_messages_into_month_dirs(hsh, @sorted_target + path.basename) # path.basename is the ML name end end end
Search the given mailinglist for a specific search term. Return value is an array of paths relative to the HTML directory of the given ML. query
may be a regular expression or simply a string to check for.
# File lib/mlmmj-archiver/archiver.rb, line 186 def search(mlname, query) html_dir = @target_dir + mlname results = [] html_dir.find do |path| next unless path.file? next unless path.basename.to_s =~ /^\d+\.html$/ # Check if the file content matches content = File.read(path) if query.kind_of?(Regexp) result = content =~ query else result = content.downcase.include?(query.downcase) end # If it did, remember it for returning results << path.relative_path_from(html_dir) if result end results end
Terminate the watching thread started by watch_mlmmj_mails.
# File lib/mlmmj-archiver/archiver.rb, line 151 def stop_watching_mlmmj_mails! @inotify_thread.terminate end
The more elegant variant of preprocess_mlmmj_mails. Instead of polling all mails and testing whether they are there, use inotify to have Linux notify us when a new file is added to the ML directory. For this method to work rb-inotify
must be available on your system (otherwise you get a NotImplementedError).
# File lib/mlmmj-archiver/archiver.rb, line 124 def watch_mlmmj_mails! raise(NotImplementedError, "This is only possible with rb-inotify!") unless defined?(INotify) @inotifier = INotify::Notifier.new @mailinglists.each do |path| archive_dir = path + ARCHIVE_DIR @inotifier.watch(archive_dir.to_s, :create) do |event| next unless File.file?(event.absolute_name) next unless event.name =~ /^\d+$/ debug "Got a new mail: #{event.name}" sleep 2 # Wait for the file to be fully written @mutex.synchronize do mail = Mail.read(event.absolute_name) FileUtils.cp(event.absolute_name, @sorted_target + path.basename + mail.date.year.to_s + mail.date.month.to_s) end end end debug "Watching MLs via inotify." @inotify_thread = Thread.new{@inotifier.run} end
Private Instance Methods
Collect the mails in the given directory in a nested hash like this:
{year1 => {month1 => [...], month2 => [...]}, year2 => {...}}
# File lib/mlmmj-archiver/archiver.rb, line 264 def collect_messages(mail_dir) hsh = Hash.new{|hsh, k| hsh[k] = Hash.new{|hsh2, k2| hsh2[k2] = []}} debug "Collecting messages in #{mail_dir}" mail_dir.each_child do |path| next unless path.file? mail = Mail.read(path) hsh[mail.date.year][mail.date.month] << path end hsh end
Prints str
onto stdout via puts if debug_mode?
.
# File lib/mlmmj-archiver/archiver.rb, line 322 def debug(str) puts str if debug_mode? end
- header (“<p>ML archive</p>”)
-
HTML header to prepend to every page. $IDXTITLE$ is replaced by the title of the respective index.
- tlevels (8)
-
Number of levels to nest threads before flattening.
- archiveadmin (postmaster@example.org)
-
E-Mail address of the archive administrator.
- checknoarchive (true)
-
If set, adds <CHECKNOARCHIVE> to the rc file. Otherwise adds <NOCHECKNOARCHIVE>.
- searchtarget (“/search”)
-
Target for the “search” link.
- stylefile (“/archive.css”)
-
Generate an RC file for MHonArc and return the path to it.
# File lib/mlmmj-archiver/archiver.rb, line 225 def generate_rcfile tempfile = Tempfile.new("archive-mhonarc") rcpath = tempfile.path at_exit{File.delete(rcpath)} debug "Generating MhonArc RC file at #{rcpath}" header = @rc_args[:header] tlevels = @rc_args[:tlevels] archiveadmin = @rc_args[:archiveadmin] checknoarchive = @rc_args[:checknoarchive] ? "<CHECKNOARCHIVE>" : "<CHECKNOARCHIVE>\n<NOCHECKNOARCHIVE>" searchtarget = @rc_args[:searchtarget] stylefile = @rc_args[:stylefile] mrc = MRC_TEMPLATE.result(binding) tempfile.write(mrc) rcpath end
Run mhonarc over the source
directory and place the results in rel_target
which is a path relative to the target
passed to ::new
. rcpath
is the path to an MHonArc RC file to use.
# File lib/mlmmj-archiver/archiver.rb, line 312 def mhonarc(source, rel_target, rcpath) target = @target_dir + rel_target target.mkpath unless target.directory? ary = [@mhonarc.to_s, "-rcfile", rcpath.to_s, "-outdir", target.to_s, "-add", source.to_s] debug "Executing: #{ary.inspect}" system(*ary) end
Process all mails in sorted_mail_dir
and output an HTML directory structure in archive_dir
. rcpath
is the path to an MHonArc RC file to use.
# File lib/mlmmj-archiver/archiver.rb, line 248 def process_ml(sorted_mail_dir, archive_dir, rcpath) debug "Processing sorted ML directory #{sorted_mail_dir} ===> #{archive_dir}" # Create the target directory archive_dir.mkpath unless archive_dir.directory? # Let mhonarc process the messages sorted_mail_dir.each_child do |yeardir| yeardir.each_child do |monthdir| mhonarc(monthdir, archive_dir + sprintf("%04d/%02d", yeardir.basename.to_s.to_i, monthdir.basename.to_s.to_i), rcpath) end end end
Takes the result of collect_messages
and writes the messages out to a directory structure under target
like this:
2013/ 1/ msg1 2/ msg1 msg2 ...
Already existing messages will not be copied again.
# File lib/mlmmj-archiver/archiver.rb, line 289 def split_messages_into_month_dirs(hsh, target) debug "Splitting into year-month directories under #{target}" target.mkpath unless target.directory? hsh.each_pair do |year, months| year_dir = target + year.to_s year_dir.mkdir unless year_dir.directory? months.each do |month, messages| month_dir = year_dir + month.to_s month_dir.mkdir unless month_dir.directory? messages.each do |msgpath| FileUtils.cp(msgpath, month_dir) unless month_dir.join(msgpath.basename).file? end end end end