class DataIO

Class containing methods for reading and write FASTQ data files.

Public Class Methods

new(samples, fastq_files, compress, output_dir) click to toggle source

Internal: Constructor method for DataIO objects.

samples - Array with Sample objects consisting id, index1 and index2 fastq_files - Array of Strings with FASTQ file names of multiplexed data. compress - Symbol indicating if output data should be compressed with

either gzip or bzip2.

output_dir - String with path of output directory.

Returns DataIO object.

# File lib/data_io.rb, line 38
def initialize(samples, fastq_files, compress, output_dir)
  @samples         = samples
  @compress        = compress
  @output_dir      = output_dir
  @suffix1         = extract_suffix(fastq_files, '_R1_')
  @suffix2         = extract_suffix(fastq_files, '_R2_')
  @input_files     = identify_input_files(fastq_files)
  @undetermined    = @samples.size
  @output_file_ios = nil
end

Public Instance Methods

[](key) click to toggle source

Internal: Getter method that returns a tuple of file handles from @output_file_ios when given a sample index key.

key - Sample index Integer key used for lookup.

Returns Array with a tuple of IO objects.

# File lib/data_io.rb, line 107
def [](key)
  @output_file_ios[key]
end
each() { |entries| ... } click to toggle source

Internal: Method that reads a Seq entry from each of the file handles in the @input_file_ios Array. Iteration stops when no more Seq entries are found.

Yields an Array with 4 Seq objects.

Returns nothing

# File lib/data_io.rb, line 89
def each
  loop do
    entries = @input_file_ios.each_with_object([]) do |e, a|
      a << e.next_entry
    end

    break if entries.compact.size != 4

    yield entries
  end
end
open_input_files() { |self| ... } click to toggle source

Internal: Method that opens the @input_files for reading.

input_files - Array with input file paths.

Returns an Array with IO objects (file handles).

# File lib/data_io.rb, line 54
def open_input_files
  @input_file_ios = []

  @input_files.each do |input_file|
    @input_file_ios << BioPieces::Fastq.open(input_file)
  end

  yield self
ensure
  close_input_files
end
open_output_files() { |self| ... } click to toggle source

Internal: Method that opens the output files for writing.

Yields a Hash with an incrementing index as keys, and a tuple of file handles as values.

# File lib/data_io.rb, line 70
def open_output_files
  @output_file_ios = {}
  comp             = @compress

  @output_file_ios.merge!(open_output_files_samples(comp))
  @output_file_ios.merge!(open_output_files_undet(comp))

  yield self
ensure
  close_output_files
end

Private Instance Methods

append_suffix(slr) click to toggle source

Internal: Method that appends a file suffix to a given Sample, Lane, Region information String based on the @options option. The file suffix can be either “.fastq.gz”, “.fastq.bz2”, or “.fastq”.

slr - String Sample, Lane, Region information.

Examples

append_suffix("_S1_L001_R1_001")
# => "_S1_L001_R1_001.fastq.gz"

Returns String with SLR info and file suffix.

# File lib/data_io.rb, line 155
def append_suffix(slr)
  case @compress
  when /gzip/
    slr << '.fastq.gz'
  when /bzip2/
    slr << '.fastq.bz2'
  else
    slr << '.fastq'
  end

  slr
end
close_input_files() click to toggle source

Internal: Method that closes open input files.

Returns nothing.

# File lib/data_io.rb, line 233
def close_input_files
  @input_file_ios.map(&:close)
end
close_output_files() click to toggle source

Internal: Method that closes the file handles stored in @output_file_ios.

Returns nothing.

# File lib/data_io.rb, line 240
def close_output_files
  @output_file_ios.each_value { |value| value.map(&:close) }
end
extract_suffix(files, pattern) click to toggle source

Internal: Method that extracts the Sample, Lane, Region information from given files.

files - Array with FASTQ file names as Strings. pattern - String with pattern to use for matching file names.

Examples

extract_suffix("Sample1_S1_L001_R1_001.fastq.gz", "_R1_")
# => "_S1_L001_R1_001"

Returns String with SLR info. Raises unless pattern match exactly 1 file. Raises unless SLR info can be parsed.

# File lib/data_io.rb, line 127
def extract_suffix(files, pattern)
  hits = files.grep(Regexp.new(pattern))

  unless hits.size == 1
    fail DataIOError, "Expecting exactly 1 hit but got: #{hits.size}"
  end

  if hits.first =~ /.+(_S\d_L\d{3}_R[12]_\d{3}).+$/
    slr = Regexp.last_match(1)
  else
    fail DataIOError, "Unable to parse file SLR from: #{hits.first}"
  end

  append_suffix(slr)
end
identify_input_files(fastq_files) click to toggle source

Internal: Method identify the different input files from a given Array of FASTQ files. The forward index file contains a I1, the reverse index file contains a I2, the forward read file contains a R1 and finally, the reverse read file contain a R2.

fastq_files - Array with FASTQ files (Strings).

Returns an Array with input files (Strings). Raises unless 4 input_files are found.

# File lib/data_io.rb, line 177
def identify_input_files(fastq_files)
  input_files = []

  input_files << fastq_files.grep(/_I1_/).first
  input_files << fastq_files.grep(/_I2_/).first
  input_files << fastq_files.grep(/_R1_/).first
  input_files << fastq_files.grep(/_R2_/).first

  unless input_files.compact.size == 4
    fail DataIOError, 'Expecting exactly 4 input_files but got: ' \
                      "#{input_files.compact.size}"
  end

  input_files
end
open_output_files_samples(comp) click to toggle source

Internal: Method that opens the sample output files for writing.

comp - Symbol with type of output compression.

Returns a Hash with an incrementing index as keys, and a tuple of file handles as values.

# File lib/data_io.rb, line 199
def open_output_files_samples(comp)
  output_file_ios = {}

  @samples.each_with_index do |sample, i|
    file_forward = File.join(@output_dir, "#{sample.id}#{@suffix1}")
    file_reverse = File.join(@output_dir, "#{sample.id}#{@suffix2}")
    io_forward   = BioPieces::Fastq.open(file_forward, 'w', compress: comp)
    io_reverse   = BioPieces::Fastq.open(file_reverse, 'w', compress: comp)
    output_file_ios[i] = [io_forward, io_reverse]
  end

  output_file_ios
end
open_output_files_undet(comp) click to toggle source

Internal: Method that opens the undertermined output files for writing.

comp - Symbol with type of output compression.

Returns a Hash with an incrementing index as keys, and a tuple of file handles as values.

# File lib/data_io.rb, line 219
def open_output_files_undet(comp)
  output_file_ios    = {}
  file_forward = File.join(@output_dir, "Undetermined#{@suffix1}")
  file_reverse = File.join(@output_dir, "Undetermined#{@suffix2}")
  io_forward   = BioPieces::Fastq.open(file_forward, 'w', compress: comp)
  io_reverse   = BioPieces::Fastq.open(file_reverse, 'w', compress: comp)
  output_file_ios[@undetermined] = [io_forward, io_reverse]

  output_file_ios
end