module Bio::BioAlignment::MaskIslands

Public Instance Methods

mark_islands() click to toggle source

Drop all 'islands' in a sequence with low consensus, i.e. islands that show a gap larger than 'min_gap_size' (default 3) on both sides, and are shorter than 'max_island_size' (default 30). An island larger than 30 elements is arguably no longer an island, and low consensus stretches may be loops - it is up to the alignment procedure to get that right. We also allow for micro deletions inside an alignment (1 or 2 elements). The island consensus is calculated by column. If more than 50% of the island shows consensus, the island is retained. Consensus for each element is defined as the number of matches in the column (default 1).

# File lib/bio-alignment/edit/mask_islands.rb, line 26
def mark_islands
  logger = Bio::Log::LoggerPlus['bio-alignment']
  count_marked_islands = 0
  count_marked_elements = 0

  # Traverse each row in the alignment
  marked_aln = mark_row_elements { |row,rownum|
    # for each element create a state object, and find unique elements (i.e. consensus) across a column
    row.each_with_index do |e,colnum|
      e.state = IslandElementState.new
      column = columns[colnum]
      e.state.unique = (column.count{|e2| !e2.gap? and e2 == e } == 1)
      # p [e,e.state,e.state.unique]
    end
    # at this stage all elements of the row have been set to unique,
    # which are unique. Now group elements into islands (split on gap)
    # and mask
    gap = []
    island = []
    in_island = true
    row.each do |e|
      if not in_island
        if e.gap?
          gap << e
        else
          island << e
          in_island = true
          gap = []
        end
      else # in_island
        if not e.gap?
          island << e
          gap = []
        else
          gap << e
          if gap.length > 2
            in_island = false
            ci, ce = mark_island(island)
            count_marked_islands += ci
            count_marked_elements += ce
            # print_island(island)
            island = []
          end
        end
      end
    end
    if in_island
      ci, ce = mark_island(island)
      count_marked_islands += ci
      count_marked_elements += ce
      # print_island(island) if island.length > 0
    end
    row # always return the row to mark_row_elements
  }
  logger.info("#{count_marked_islands} islands marked (#{count_marked_elements} elements)")
  return marked_aln
end

Private Instance Methods

mark_island(island) click to toggle source

Check a list of elements that form an island. First count the number of elements marked as being unique. If the island is more than 50% unique (i.e. less than 50% consensus with the rest if the alignment) all island elements are marked for masking. Returns the number of islands and elements marked as a tuple

# File lib/bio-alignment/edit/mask_islands.rb, line 91
def mark_island island
  return 0,0 if island.length < 2
  unique = 0
  count_islands = 0
  count_elements = 0
  island.each do |e|
    unique += 1 if e.state.unique == true
  end
  consensus = 1.0 - unique.to_f / island.length
  # p unique, consensus
  if consensus < 0.5
    island.each do |e|
      e.state.mask!
    end
    count_islands += 1
    count_elements += island.size
  end
  return count_islands, count_elements
end
print_island(island) click to toggle source