bio-lazyblastxml

Provides a libxml-based lazy parser for reading through large blast xml files with a small memory footprint.

Requirements

If you’re on ubuntu, libxml is most easily installed with:

sudo apt-get install libxml-ruby

And then install the gem with

gem install bio-lazyblastxml
# If you need to be root to install gems, try:
sudo gem install bio-lazyblastxml

Overview

The parsers uses a LibXML::Reader instance to read the XML file one line at a time, keeping very little in memory. You can think of the parser as having a very short memory, only able to recall the object that it happens to be looking at right now. The parser only runs through the document once unless Bio::LazyBlast#rewind is called.

Example Usage

Each report is an enumerable that yields iterations. Each iteration is an enumerable that yields hits (if there are any hits).

require 'bio-lazyblastxml'

# Generate your new report object
report = Bio::LazyBlast::Report.new('my_huge_blastfile.xml')

# How many hits does each query have?
report.each_iteration do |iteration|
  puts [iteration.query_def, iteration.count].join("\t")
end

Each hit is an enumerable that yields hsps:

require 'bio-lazyblastxml'

# Generate your new report object
report = Bio::LazyBlast::Report.new('my_huge_blastfile.xml')

report.each_iteration do |iteration|
  iteration.each_hit do |hit|
    # Sum up the lengths of all the hsps
    hsp_length_sum = hit.inject(0){|count,hsp| count += hsp.align_len}
    puts [iteration.query_def, hit.definition, hsp_length_sum].join("\t")
  end
end

Contributing to bio-lazyblastxml

Copyright © 2011 robsyme. See LICENSE.txt for further details.