class SimpleXlsxReader::Document::Mapper

For internal use; translates source xml to Sheet objects.

Constants

DATE_SYSTEM_1900
DATE_SYSTEM_1904
NumFmtMap

Map of non-custom numFmtId to casting symbol

Public Class Methods

cast(value, type, style, options = {}) click to toggle source

The heart of typecasting. The ruby type is determined either explicitly from the cell xml or implicitly from the cell style, and this method expects that work to have been done already. This, then, takes the type we determined it to be and casts the cell value to that type.

types:

  • s: shared string (see shared_string)

  • n: number (cast to a float)

  • b: boolean

  • str: string

  • inlineStr: string

  • ruby symbol: for when type has been determined by style

options:

# File lib/simple_xlsx_reader.rb, line 386
def self.cast(value, type, style, options = {})
  return nil if value.nil? || value.empty?

  # Sometimes the type is dictated by the style alone
  if type.nil? ||
    (type == 'n' && [:date, :time, :date_time].include?(style))
    type = style
  end

  casted = case type

  ##
  # There are few built-in types
  ##

  when 's' # shared string
    options[:shared_strings][value.to_i]
  when 'n' # number
    value.to_f
  when 'b'
    value.to_i == 1
  when 'str'
    value
  when 'inlineStr'
    value

  ##
  # Type can also be determined by a style,
  # detected earlier and cast here by its standardized symbol
  ##

  when :string, :unsupported
    value
  when :fixnum
    value.to_i
  when :float
    value.to_f
  when :percentage
    value.to_f / 100
  # the trickiest. note that  all these formats can vary on
  # whether they actually contain a date, time, or datetime.
  when :date, :time, :date_time
    value = Float(value)
    days_since_date_system_start = value.to_i
    fraction_of_24 = value - days_since_date_system_start

    # http://stackoverflow.com/questions/10559767/how-to-convert-ms-excel-date-from-float-to-date-format-in-ruby
    date = options.fetch(:base_date, DATE_SYSTEM_1900) + days_since_date_system_start

    if fraction_of_24 > 0 # there is a time associated
      seconds = (fraction_of_24 * 86400).round
      return Time.utc(date.year, date.month, date.day) + seconds
    else
      return date
    end
  when :bignum
    if defined?(BigDecimal)
      BigDecimal.new(value)
    else
      value.to_f
    end

  ##
  # Beats me
  ##

  else
    value
  end

  if options[:url]
    Hyperlink.new(options[:url], casted)
  else
    casted
  end
end

Public Instance Methods

base_date() click to toggle source
Returns the base_date from which to calculate dates.

Defaults to 1900 (minus two days due to excel quirk), but use 1904 if it's set in the Workbook's workbookPr. msdn.microsoft.com/en-us/library/ff530155(v=office.12).aspx

# File lib/simple_xlsx_reader.rb, line 467
def base_date
  @base_date ||=
    begin
      return DATE_SYSTEM_1900 if xml.workbook == nil
      xml.workbook.xpath("//workbook/workbookPr[@date1904]").each do |workbookPr|
        return DATE_SYSTEM_1904 if workbookPr["date1904"] =~ /true|1/i
      end
      DATE_SYSTEM_1900
    end
end
column_letter_to_number(column_letter) click to toggle source

formula fits an exponential factorial function of the form: 'A' = 1 'B' = 2 'Z' = 26 'AA' = 26 * 1 + 1 'AZ' = 26 * 1 + 26 'BA' = 26 * 2 + 1 'ZA' = 26 * 26 + 1 'ZZ' = 26 * 26 + 26 'AAA' = 26 * 26 * 1 + 26 * 1 + 1 'AAZ' = 26 * 26 * 1 + 26 * 1 + 26 'ABA' = 26 * 26 * 1 + 26 * 2 + 1 'BZA' = 26 * 26 * 2 + 26 * 26 + 1

# File lib/simple_xlsx_reader.rb, line 285
def column_letter_to_number(column_letter)
  pow = column_letter.length - 1
  result = 0
  column_letter.each_byte do |b|
    result += 26**pow * (b - 64)
    pow -= 1
  end
  result
end
custom_style_types() click to toggle source

Map of (numFmtId >= 164) (custom styles) to our best guess at the type ex. {164 => :date_time}

# File lib/simple_xlsx_reader.rb, line 341
def custom_style_types
  @custom_style_types ||=
    xml.styles.xpath('/styleSheet/numFmts/numFmt').
    inject({}) do |acc, xstyle|

    acc[xstyle.attributes['numFmtId'].value.to_i] =
      determine_custom_style_type(xstyle.attributes['formatCode'].value)

    acc
  end
end
determine_custom_style_type(string) click to toggle source

This is the least deterministic part of reading xlsx files. Due to custom styles, you can't know for sure when a date is a date other than looking at its format and gessing. It's not impossible to guess right, though.

stackoverflow.com/questions/4948998/determining-if-an-xlsx-cell-is-date-formatted-for-excel-2007-spreadsheets

# File lib/simple_xlsx_reader.rb, line 359
def determine_custom_style_type(string)
  return :float if string[0] == '_'
  return :float if string[0] == ' 0'

  # Looks for one of ymdhis outside of meta-stuff like [Red]
  return :date_time if string =~ /(^|\])[^\[]*[ymdhis]/i

  return :unsupported
end
last_cell_label(xsheet) click to toggle source

Returns the last column name, ex. 'E'

Note that excel writes a '/worksheet/dimension' node we can get the last cell from, but some libs (ex. simple_xlsx_writer) don't record this. In that case, we assume the data is of uniform column length and check the column name of the last header row. Obviously this isn't the most robust strategy, but it likely fits 99% of use cases considering it's not a problem with actual excel docs.

# File lib/simple_xlsx_reader.rb, line 255
def last_cell_label(xsheet)
  dimension = xsheet.at_xpath('/worksheet/dimension')
  if dimension
    col = dimension.attributes['ref'].value.match(/:([A-Z]+[0-9]+)/)
    col ? col.captures.first : 'A1'
  else
    last = xsheet.at_xpath("/worksheet/sheetData/row[last()]/c[last()]")
    last ? last.attributes['r'].value.match(/([A-Z]+[0-9]+)/).captures.first : 'A1'
  end
end
load_sheets() click to toggle source
# File lib/simple_xlsx_reader.rb, line 159
def load_sheets
  sheet_toc.each_with_index.map do |(sheet_name, _sheet_number), i|
    parse_sheet(sheet_name, xml.sheets[i], xml.sheet_rels[i])  # sheet_number is *not* the index into xml.sheets
  end
end
num_fmt_id(xstyle) click to toggle source

returns the numFmtId value if it's available

# File lib/simple_xlsx_reader.rb, line 316
def num_fmt_id(xstyle)
  if xstyle.attributes['numFmtId']
    xstyle.attributes['numFmtId'].value
  else
    nil
  end
end
parse_sheet(sheet_name, xsheet, xrels) click to toggle source
# File lib/simple_xlsx_reader.rb, line 177
def parse_sheet(sheet_name, xsheet, xrels)
  sheet = Sheet.new(sheet_name)
  sheet_width, sheet_height = *sheet_dimensions(xsheet)
  cells_w_links = xsheet.xpath('//hyperlinks/hyperlink').inject({}) {|acc, e| acc[e.attr(:ref)] = e.attr(:id); acc}

  sheet.rows = Array.new(sheet_height) { Array.new(sheet_width) }
  xsheet.xpath("/worksheet/sheetData/row/c").each do |xcell|
    column, row = *xcell.attr('r').match(/([A-Z]+)([0-9]+)/).captures
    col_idx = column_letter_to_number(column) - 1
    row_idx = row.to_i - 1

    type  = xcell.attributes['t'] &&
            xcell.attributes['t'].value
    style = xcell.attributes['s'] &&
            style_types[xcell.attributes['s'].value.to_i]

    # This is the main performance bottleneck. Using just 'xcell.text'
    # would be ideal, and makes parsing super-fast. However, there's
    # other junk in the cell, formula references in particular,
    # so we really do have to look for specific value nodes.
    # Maybe there is a really clever way to use xcell.text and parse out
    # the correct value, but I can't think of one, or an alternative
    # strategy.
    #
    # And yes, this really is faster than using xcell.at_xpath(...),
    # by about 60%. Odd.
    xvalue = type == 'inlineStr' ?
      (xis = xcell.children.find {|c| c.name == 'is'}) && xis.children.find {|c| c.name == 't'} :
      xcell.children.find {|c| c.name == 'f' && c.text.start_with?('HYPERLINK(') || c.name == 'v'}

    if xvalue
      value = xvalue.text.strip

      if rel_id = cells_w_links[xcell.attr('r')] # a hyperlink made via GUI
        url = xrels.at_xpath(%(//*[@Id="#{rel_id}"])).attr('Target')
      elsif xvalue.name == 'f' # only time we have a function is if it's a hyperlink
        url = value.slice(/HYPERLINK\("(.*?)"/, 1)
      end
    end

    cell = begin
      self.class.cast(value, type, style,
                      :url => url,
                      :shared_strings => shared_strings,
                      :base_date => base_date)
    rescue => e
      if !SimpleXlsxReader.configuration.catch_cell_load_errors
        error = CellLoadError.new(
          "Row #{row_idx}, Col #{col_idx}: #{e.message}")
        error.set_backtrace(e.backtrace)
        raise error
      else
        sheet.load_errors[[row_idx, col_idx]] = e.message

        xcell.text.strip
      end
    end

    # This shouldn't be necessary, but just in case, we'll create
    # the row so we don't blow up. This means any null rows in between
    # will be null instead of [null, null, ...]
    sheet.rows[row_idx] ||= Array.new(sheet_width)

    sheet.rows[row_idx][col_idx] = cell
  end

  sheet
end
shared_strings() click to toggle source

For performance reasons, excel uses an optional SpreadsheetML feature that puts all strings in a separate xml file, and then references them by their index in that file.

msdn.microsoft.com/en-us/library/office/gg278314.aspx

# File lib/simple_xlsx_reader.rb, line 519
def shared_strings
  @shared_strings ||= begin
    if xml.shared_strings
      xml.shared_strings.xpath('/sst/si').map do |xsst|
        # a shared string can be a single value...
        sst = xsst.at_xpath('t/text()')
        sst = sst.text if sst
        # ... or a composite of seperately styled words/characters
        sst ||= xsst.xpath('r/t/text()').map(&:text).join
      end
    else
      []
    end
  end
end
sheet_dimensions(xsheet) click to toggle source

Returns dimensions (1-indexed)

# File lib/simple_xlsx_reader.rb, line 267
def sheet_dimensions(xsheet)
  column, row = *last_cell_label(xsheet).match(/([A-Z]+)([0-9]+)/).captures
  [column_letter_to_number(column), row.to_i]
end
sheet_toc() click to toggle source

Table of contents for the sheets, ex. {'Authors' => 0, …}

# File lib/simple_xlsx_reader.rb, line 166
def sheet_toc
  xml.workbook.xpath('/workbook/sheets/sheet').
    inject({}) do |acc, sheet|

    acc[sheet.attributes['name'].value] =
      sheet.attributes['sheetId'].value.to_i - 1 # keep things 0-indexed

    acc
  end
end
style_type_by_num_fmt_id(id) click to toggle source

Finds the type we think a style is; For example, fmtId 14 is a date style, so this would return :date.

Note, custom styles usually (are supposed to?) have a numFmtId >= 164, but in practice can sometimes be simply out of the usual “Any Language” id range that goes up to 49. For example, I have seen a numFmtId of 59 specified as a date. In Thai, 59 is a number format, so this seems like a bad idea, but we try to be flexible and just go with it.

# File lib/simple_xlsx_reader.rb, line 332
def style_type_by_num_fmt_id(id)
  return nil if id.nil?

  id = id.to_i
  NumFmtMap[id] || custom_style_types[id]
end
style_types() click to toggle source

Excel doesn't record types for some cells, only its display style, so we have to back out the type from that style.

Some of these styles can be determined from a known set (see NumFmtMap), while others are 'custom' and we have to make a best guess.

This is the array of types corresponding to the styles a spreadsheet uses, and includes both the known style types and the custom styles.

Note that the xml sheet cells that use this don't reference the numFmtId, but instead the array index of a style in the stored list of only the styles used in the spreadsheet (which can be either known or custom). Hence this style types array, rather than a map of numFmtId to type.

# File lib/simple_xlsx_reader.rb, line 309
def style_types
  @style_types ||=
      xml.styles.xpath('/styleSheet/cellXfs/xf').map {|xstyle|
        style_type_by_num_fmt_id(num_fmt_id(xstyle))}
end