class Bio::Big::OrfEmitter
Public Class Methods
6-frame ORF
emitter for (growing) sequences from the emit
object. Type can be a symbol or a function. Symbols are
:stopstop All sequences from STOP to STOP codon :startstop All sequences from START to STOP codon
size control is in nucleotides.
The difference with most other getorf implementations, including EMBOSS, is that:
1) ORFs get emitted during the reading of large continuous sequences,
e.g. chromosomes.
2) This allows processing in parallel to IO, even on a single CPU 3) ORFs come with splitting CODONs 4) Bordering ORFs are not included (by default), which is somehow
not easy with EMBOSS getorf
I have carefully designed this code, so it is easy to reason about the steps and prove correct. It is easy to understand, and therefore to parallelize correctly. Some features are:
5) Emit size does not matter for correctness 6) Reverse strands are positioned according to
GFF3 on the parent contig
# File lib/bigbio/db/emitters/orf_emitter.rb, line 235 def initialize emit, type, min_size=30, max_size=nil @em = emit @type = type @min_size = min_size @max_size = max_size end
Public Instance Methods
Concats sequences from the emitter and yields the contained ORFs for every resulting frame (-3..-1, 1..3 ). Note that for the reverse frame, the resulting sequence is complemented! Translate these sequences in a forward frame only.
First :head, then :mid parts get emitted, closed by the :tail part.
# File lib/bigbio/db/emitters/orf_emitter.rb, line 249 def emit_seq @em.emit_seq do | part, index, tag, seq | # p [part, seq] # case part do # when :head # when :mid # when :tail # end emit_forward(part, index, tag, seq) { |*x| yield(*x) } emit_reverse(part, index, tag, seq) { |*x| yield(*x) } end end
Private Instance Methods
# File lib/bigbio/db/emitters/orf_emitter.rb, line 264 def emit_forward(part, index, tag, seq) # Yield frame 1..3 (1..3).each do | frame | fr = ShortFrameState.new seq[frame-1..-1],0,0 orfs = fr.get_stopstop_orfs orfs.each do | orf | yield frame, index, tag, orf.track_ntseq_pos, orf.to_seq end end end
# File lib/bigbio/db/emitters/orf_emitter.rb, line 275 def emit_reverse(part, index, tag, seq) # Yield frame -1..-3 ntseq = Bio::Sequence::NA.new(seq) rev_seq = ntseq.complement (1..3).each do | frame | fr = ShortReversedFrameState.new rev_seq[0..rev_seq.size-frame],0,0 orfs = fr.get_stopstop_orfs orfs.each do | orf | yield(-frame,index,tag,orf.track_ntseq_pos,orf.to_seq) end end end