class SequenceServer::Doctor
Doctor
detects inconsistencies likely to cause problems with Sequenceserver operation.
Constants
- AVOID_ID_REGEX
- ERROR_NUMERIC_IDS
- ERROR_PARSE_SEQIDS
- ERROR_PROBLEMATIC_IDS
Attributes
Public Class Methods
Retrieve sequence ids (specified by %i) from all databases. Using accession number is problematic because of several reasons.
# File lib/sequenceserver/doctor.rb, line 31 def all_sequence_ids(ignore) Database.map do |db| next if ignore.include? db out = `blastdbcmd -entry all -db #{db.name} -outfmt "%i" 2> /dev/null` { db: db, seqids: out.to_s.split } end.compact end
Pretty print database list.
# File lib/sequenceserver/doctor.rb, line 54 def bullet_list(values) list = '' values.each do |value| list << " - #{value}\n" end list end
FASTA files formatted without -parse_seqids option won’t support the blastdbcmd command of fetching sequence ids using ‘%i’ identifier. In such cases, an array of ‘N/A’ values are returned which is checked in this case.
# File lib/sequenceserver/doctor.rb, line 47 def inspect_parse_seqids(seqids) seqids.map do |sq| sq[:db] if sq[:seqids].include? 'N/A' end.compact end
Returns an array of database objects in which each of the object has an array of sequence_ids satisfying the block passed to the method.
# File lib/sequenceserver/doctor.rb, line 23 def inspect_seqids(seqids, &block) seqids.map do |sq| sq[:db] unless sq[:seqids].select(&block).empty? end.compact end
# File lib/sequenceserver/doctor.rb, line 98 def initialize @ignore = [] @all_seqids = Doctor.all_sequence_ids(@ignore) end
Print diagnostic error messages according to the type of error. rubocop:disable Metrics/MethodLength
# File lib/sequenceserver/doctor.rb, line 64 def show_message(error, values) return if values.empty? case error when ERROR_PARSE_SEQIDS puts <<~MSG *** Doctor has found improperly formatted database: #{bullet_list(values)} Please reformat your databases with -parse_seqids switch (or use sequenceserver -m) for using SequenceServer as the current format may cause problems. These databases are ignored in further checks. MSG when ERROR_NUMERIC_IDS puts <<~MSG *** Doctor has found databases with numeric sequence ids: #{bullet_list(values)} Note that this may cause problems with sequence retrieval. MSG when ERROR_PROBLEMATIC_IDS puts <<~MSG *** Doctor has found databases with problematic sequence ids: #{bullet_list(values)} This causes some sequence to contain extraneous words like `gnl|` appended to their id string. MSG end end
Public Instance Methods
Warn users about sequence identifiers of format abc|def because then BLAST+ appends a gnl (for general) infront of the database identifiers. There are only two identifiers that we need to avoid when searching for this format. bbs|number, gi|number Note that while sequence ids could have been arbitrary, using parse_seqids reduces our search space substantially.
# File lib/sequenceserver/doctor.rb, line 147 def check_id_format selector = proc { |id| id.match(AVOID_ID_REGEX) } Doctor.show_message(ERROR_PROBLEMATIC_IDS, Doctor.inspect_seqids(@all_seqids, &selector)) end
Check for the presence of numeric sequence ids within a database.
# File lib/sequenceserver/doctor.rb, line 133 def check_numeric_ids selector = proc { |id| !id.to_i.zero? } Doctor.show_message(ERROR_NUMERIC_IDS, Doctor.inspect_seqids(@all_seqids, &selector)) end
Obtain files which aren’t formatted with -parse_seqids and add them to ignore list.
# File lib/sequenceserver/doctor.rb, line 125 def check_parse_seqids without_parse_seqids = Doctor.inspect_parse_seqids(@all_seqids) Doctor.show_message(ERROR_PARSE_SEQIDS, without_parse_seqids) @ignore.concat(without_parse_seqids) end
# File lib/sequenceserver/doctor.rb, line 105 def diagnose puts "\n1/3 Inspecting databases for proper -parse_seqids formatting.." check_parse_seqids remove_invalid_databases puts "\n2/3 Inspecting databases for numeric sequence ids.." check_numeric_ids puts "\n3/3 Inspecting databases for problematic sequence ids.." check_id_format end
Remove entried which are in ignore list or not formatted with -parse_seqids option.
# File lib/sequenceserver/doctor.rb, line 119 def remove_invalid_databases @all_seqids.delete_if { |sq| @ignore.include? sq[:db] } end