Better CSV processingnwith Ruby 2.6

: author

Kouhei Sutou/Kazuma Furuhashi

: institution

ClearCode Inc./Speee, Inc.

: content-source

RubyKaigi 2019

: date

2019-04-19

: start-time

2019-04-19T11:20:00+09:00

: end-time

2019-04-19T12:00:00+09:00

: theme

.

Ad: Silver sponsor

# img
# src = images/clear-code-rubykaigi-2019-silver-sponsor.png
# relative_height = 100
# reflect_ratio = 0.1

Slide properties

: enable-title-on-image

false

Ad: Cafe sponsor

# img
# src = images/speee-rubykaigi-2019-cafe-sponsor.png
# relative_height = 100
# reflect_ratio = 0.1

Slide properties

: enable-title-on-image

false

Kouhei Sutou

* The president of ClearCode Inc.\n
  (('note:クリアコードの社長'))
* A new maintainer of the (({csv})) library\n
  (('note:(({csv}))ライブラリーの新メンテナー'))
* The founder of ((<Red Data Tools|URL:https://red-data-tools.github.io/>)) project\n
  (('note:Red Data Toolsプロジェクトの立ち上げ人'))
  * Provides data processing tools for Ruby\n
    (('note:Ruby用のデータ処理ツールを提供するプロジェクト'))

Kazuma Furuhashi

* A member of Asakusa.rb / Red Data Tools\n
  (('note:Asakusa.rb/Red Data Toolsメンバー'))
* Worikng at Speee Inc.\n
  (('note:Speeeで働いている'))

csv in Ruby 2.6 (1)n(('note:Ruby 2.6のcsv(1)'))

Faster CSV parsingn (('note:CSVパースの高速化'))

Unquoted CSVn(('note:クォートなしのCSV'))

  AAAAA,AAAAA,AAAAA
  ...

# RT

2.5, 2.6, Faster?

432.0i/s, 764.9i/s, 1.77x

Quoted CSVn(('note:クォートありのCSV'))

  "AAAAA","AAAAA","AAAAA"
  ...

# RT

2.5, 2.6, Faster?

274.1i/s, 534.5i/s, 1.95x

Quoted separator CSV (1)n(('note:区切り文字をクォートしているCSV(1)'))

  ",AAAAA",",AAAAA",",AAAAA"
  ...

# RT

2.5, 2.6, Faster?

211.0i/s, 330.0/s, 1.56x

Quoted separator CSV (2)n(('note:区切り文字をクォートしているCSV(2)'))

  "AAAAA\r\n","AAAAA\r\n","AAAAA\r\n"
  ...

# RT

2.5, 2.6, Faster?

118.7i/s, 325.6/s, 2.74x

Quoted CSVsn(('note:クォートありのCSV'))

# RT

, 2.5, 2.6

Just quoted, 274.1i/s, 554.5i/s
Include sep1, 211.0i/s, 330.0i/s
Include sep2, 118.0i/s, 325.6i/s
(Note), (Slow down), (Still fast)

(('note:Note: “Just quoted” on 2.6 is optimized'))

Multibyte CSVn(('note:マルチバイトのCSV'))

  あああああ,あああああ,あああああ
  ...

# RT

2.5, 2.6, Faster?

371.2i/s, 626.6i/s, 1.69x

csv in Ruby 2.6 (2)n(('note:Ruby 2.6のcsv(2)'))

Faster CSV writingn (('note:CSV書き出しの高速化'))

(({CSV.generate_line}))

  # rouge ruby
  fields = ["AAAAA"] * n_columns
  n_rows.times do
    CSV.generate_line(fields)
  end

# RT

2.5, 2.6, Faster?

284.4i/s, 684.2i/s, 2.41x

(({CSV#<<}))

  # rouge ruby

  output = StringIO.new
  csv = CSV.new(output)
  n_rows.times {csv << fields}

# RT

2.5, 2.6, Faster?

2891.4i/s, 4824.1i/s, 1.67x

(({CSV.generate_line})) vs. (({CSV#<<}))

# RT

, 2.5, 2.6

(({generate_\nline})), 284.4i/s, 684.2i/s
(({<<})), 2891.4i/s, 4824.1i/s

(('tag:center')) ((*Use (({<<})) for multiple writes*))n (('note:((*複数行書き出すときは(({<<}))を使うこと*))'))

csv in Ruby 2.6 (3)n(('note:Ruby 2.6のcsv(3)'))

New CSV parsern (('tag:small:for'))n further improvementsn (('note:さらなる改良のための新しいCSVパーサー'))

Benchmark with KEN_ALL.CSVn(('note:KEN_ALL.CSVでのベンチマーク'))

01101,"060  ","0600000","ホッカイドウ","サッポロシチュウオウク",...
...(124257 lines)...
47382,"90718","9071801","オキナワケン","ヤエヤマグンヨナグニチョウ",...

(('tag:center')) Zip code data in Japann (('note:日本の郵便番号データ'))

(('tag:center')) (('tag:small')) ((<URL:www.post.japanpost.jp/zipcode/download.html>))

KEN_ALL.CSV statisticsn(('note:KEN_ALL.CSVの統計情報'))

# RT

Size(('note:(サイズ)')), 11.7MiB
(('#')) of columns(('note:(列数)')), 15
(('#')) of rows(('note:(行数)')), 124259
Encoding(('note:(エンコーディング)')), CP932

Parsing KEN_ALL.CSVn(('note:KEN_ALL.CSVのパース'))

  # rouge ruby
  CSV.foreach("KEN_ALL.CSV",
              "r:cp932") do |row|
  end

# RT

2.5, 2.6, Faster?

1.17s, 0.79s, 1.48x

Fastest parsing in pure Rubyn(('note:Ruby実装での最速のパース方法'))

# rouge ruby
input.each_line(chomp: true) do |line|
  line.split(",", -1) do |column|
  end
end

(('tag:center')) Limitation: No quoten (('note:制限:クォートがないこと'))

KEN_ALL.CSV without quoten(('note:クォートなしのKEN_ALL.CSV'))

01101,060  ,0600000,ホッカイドウ,サッポロシチュウオウク,...
...(124257 lines)...
47382,90718,9071801,オキナワケン,ヤエヤマグンヨナグニチョウ,...

Optimized no quote CSV parsingn(('note:最適化したクォートなしCSVのパース方法'))

  # rouge ruby
  CSV.foreach("KEN_ALL_NO_QUOTE.CSV",
              "r:cp932",
              quote_char: nil) {|row|}

# RT

split, 2.6, Faster?

0.32s, 0.37s, 0.86x\n(('note:(almost the same/同等)'))

Summary: Performancen(('note:まとめ:性能'))

* Parsing: 1.5x-3x faster\n
  (('note:パース:1.5x-3x高速'))
  * Max to the "split" level by using an option\n
    (('note:オプションを指定すると最大で「split」レベルまで高速化可能'))
* Writing: 1.5x-2.5x faster\n
  (('note:書き出し:1.5x-2.5x高速'))
  * Use (({CSV#<<})) than (({CSV.generate_line}))\n
    (('note:(({CSV.generate_line}))よりも(({CSV#<<}))を使うこと'))

How to improve performance (1)n(('note:速度改善方法(1)'))

Complex quoten (('note:複雑なクォート'))

Complex quoten(('note:複雑なクォート'))

"AA""AAA"
"AA,AAA"
"AA\rAAA"
"AA\nAAA"

Use (({StringScanner}))n(('note:(({StringScanner}))を使う'))

* (({String#split})) is very fast\n
  (('note:(({String#split}))は高速'))
* (({String#split})) is naive for complex quote\n
  (('note:(({String#split}))は複雑なクォートを処理するには単純過ぎる'))

2.5 uses (({String#split}))

# rouge ruby

in_extended_column = false # "...\n..." case
@input.each_line do |line|
  line.split(",", -1).each do |part|
    if in_extended_column
      # ...
    elsif part.start_with?('"')
      if part.end_with?('"')
        row << pars.gsub('""', '"') # "...""..." case
      else
        in_extended_column = true
      end
    # ...

(({split})): Complex quote

# RT

Just quoted, 274.1i/s
Include sep1, 211.0i/s
Include sep2, 118.0i/s

(('tag:center')) Slow downn (('note:遅くなる'))

2.6 uses (({StringScanner}))

# rouge ruby

row = []
until @scanner.eos?
  value = parse_column_value
  if @scanner.scan(/,/)
    row << value
  elsif @scanner.scan(/\n/)
    row << value
    yield(row)
    row = []
  end
end

(({parse_column_value}))

# rouge ruby

def parse_column_value
  parse_unquoted_column_value ||
    parse_quoted_column_value
end

(('tag:center')) Compositable componentsn (('note:部品を組み合わせられる'))

(({parse_unquoted_column_value}))

# rouge ruby

def parse_unquoted_column_value
  @scanner.scan(/[^,"\r\n]+/)
end

(({parse_quoted_column_value}))

# rouge ruby

def parse_quoted_column_value
  # Not quoted
  return nil unless @scanner.scan(/"/)
  # Quoted case ...
end

Parse methods can be compositedn(('note:パースメソッドを組み合わせられる'))

# rouge ruby

def parse_column_value
  parse_unquoted_column_value ||
    parse_quoted_column_value
end

(('tag:center')) Easy to maintainn (('note:メンテナンスしやすい'))

Point (1)n(('note:ポイント(1)'))

* Use (({StringScanner})) for complex case\n
  (('note:複雑なケースには(({StringScanner}))を使う'))
* (({StringScanner})) for complex case:\n
  (('note:複雑なケースに(({StringScanner}))を使うと:'))
  * Easy to maintain\n
    (('note:メンテナンスしやすい'))
  * No performance regression\n
    (('note:性能が劣化しない'))

(({StringScanner})): Complex quote

# RT

Just quoted, 554.5i/s
Include sep1, 330.0i/s
Include sep2, 325.6i/s

(('tag:center')) No slow down…?n (('note:遅くなっていない。。。?'))

How to improve performance (2)n(('note:速度改善方法(2)'))

Simple casen (('note:単純なケース'))

Simple casen(('note:単純なケース'))

AAAAA
"AAAAA"

Use (({String#split}))n(('note:(({String#split}))を使う'))

(({StringScanner})) isn slown for simple casen (('note:(({StringScanner}))は単純なケースでは遅い'))

Fallback to (({StringScanner})) impl.n(('note:(({StringScanner}))実装にフォールバック'))

# rouge ruby

def parse_by_strip(&block)
  @input.each_line do |line|
    if complex?(line)
      return parse_by_string_scanner(&block)
    else
      yield(line.split(","))
    end
  end
end

Quoted CSVsn(('note:クォートありのCSV'))

# RT

, (({StringScanner})), (({split})) + (({StringScanner}))

Just quoted, 311.7i/s, (('tag:right'))((*523.4i/s*))
Include sep1, 312.9i/s, 309.8i/s
Include sep2, 311.3i/s, 312.6i/s

Point (2)n(('note:ポイント(2)'))

* First try optimized version\n
  (('note:まず最適化バージョンを試す'))
* Fallback to robust version\n
  when complexity is detected\n
  (('note:複雑だとわかったらちゃんとしたバージョンにフォールバック'))

How to improve performance (3)n(('note:速度改善方法(3)'))

(({loop do}))n ↓n (({while true}))

(({loop})) vs. (({while}))

# RT

How, Throughput

(({loop})), 377i/s
(({while})), 401i/s

Point (3)n(('note:ポイント(3)'))

* (({while})) doesn't create a new scope\n
  (('note:(({while}))は新しいスコープを作らない'))
* Normally, you can use (({loop}))\n
  (('note:ふつうは(({loop}))でよい'))
  * Normally, (({loop})) isn't a bottle neck\n
    (('note:ふつうは(({loop}))がボトルネックにはならない'))

How to improve performance (4)n(('note:速度改善方法(4)'))

Lazyn (('note:遅延'))

(({CSV})) object is parser and writern(('note:(({CSV}))オブジェクトは読み書きできる'))

* 2.5: Always initializes everything\n
  (('note:2.5:常にすべてを初期化'))
* 2.6: Initializes when it's needed\n
  (('note:2.6:必要になったら初期化'))

Write performance

# RT

, 2.5, 2.6, Faster?

(({generate_\nline})), 284.4i/s, 684.2i/s, 2.41x
(({<<})), 2891.4i/s, 4824.1i/s, 1.67x

How to initialize lazilyn(('note:初期化を遅延する方法'))

# rouge ruby
def parser
  @parser ||= Parser.new(...)
end

def writer
  @writer ||= Writer.new(...)
end

Point (4)n(('note:ポイント(4)'))

* Do only needed things\n
  (('note:必要なことだけする'))
* One class for one feature\n
  (('note:機能ごとにクラスを分ける'))

New features by new parsern(('note:新しいパーサーによる新機能'))

* Add support for \" escape\n
  (('note:\\"でのエスケープをサポート'))
* Add (({strip:})) option\n
  (('note:(({strip:}))オプションを追加'))

" escapen(('note:\“でのエスケープ'))

# rouge ruby

CSV.parse(%Q["a""bc","a\\"bc"],
          liberal_parsing: {backslash_quote: true})
# [["a\\"bc", "a\\"bc"]]

(({strip:}))n(('note:(({strip:}))'))

# rouge ruby

CSV.parse(%Q[ abc  ,  " abc"], strip: true)
# [["abc", " abc"]]
CSV.parse(%Q[abca,abc], strip: "a")
# [["bc", "bc"]]

csv in Ruby 2.6 (4)n(('note:Ruby 2.6のcsv(4)'))

Keep backward compatibilityn (('note:互換性を維持'))

How to keep backward compat.n(('note:互換性を維持する方法'))

* Reconstruct test suites\n
  (('note:テストを整理'))
* Add benchmark suites\n
  (('note:ベンチマークを追加'))

Testn(('note:テスト'))

* Important to detect incompat.\n
  (('note:非互換の検出のために重要'))
* Must be easy to maintain\n
  (('note:メンテナンスしやすくしておくべき'))
  * To keep developing\n
    (('note:継続的な開発するため'))

Easy to maintenancen(('note:メンテナンスしやすい状態'))

* Easy to understand each test\n
  (('note:各テストを理解しやすい'))
* Easy to run each test\n
  (('note:各テストを個別に実行しやすい'))
  * Focusing a failed case is easy to debug\n
    (('note:失敗したケースに集中できるとデバッグしやすい'))

Benchmarkn(('note:ベンチマーク'))

* Important to detect\n
  performance regression ((*bugs*))\n
  (('note:性能劣化((*バグ*))を検出するために重要'))

benchmark_driver gem

Fully-featured benchmark driver for Ruby 3x3n (('note:Ruby 3x3のために必要な機能が揃っているベンチマークツール'))

benchmark_driver gem in csv

* YAML input is easy to use\n
  (('note:YAMLによる入力が便利'))
* Can compare multiple gem versions\n
  (('note:複数のgemのバージョンで比較可能'))
  * To detect performance regression\n
    (('note:性能劣化を検出するため'))

Benchmark for each gem versionn(('note:gemのバージョン毎のベンチマーク'))

# img
# src = images/benchmark-driver.svg
# relative_height = 100
# reflect_ratio = 0.1

Slide properties

: enable-title-on-image

false

csv/benchmark/

* convert_nil.yaml
* parse{,_liberal_parsing}.yaml
* parse_{quote_char_nil,strip}.yaml
* read.yaml, shift.yaml, write.yaml

Benchmark as a starting pointn(('note:出発点としてのベンチマーク'))

* Join csv developing!\n
  (('note:csvの開発に参加しよう!'))
* Adding a new benchmark is a good start\n
  (('note:ベンチマークの追加から始めるのはどう?'))
  * We'll focus on improving performance for benchmark cases\n
    (('note:ベンチマークが整備されているケースの性能改善に注力するよ'))

How to use improved csv?n(('note:改良されたcsvを使う方法'))

(({gem install csv}))

csv in Ruby 2.5n(('note:Ruby 2.5のcsv'))

Default gemifiedn (('note:デフォルトgem化'))

Default gemn(('note:デフォルトgem'))

* Can use it just by (({require}))\n
  (('note:(({require}))するだけで使える'))
* Can use it without entry in (({Gemfile}))\n
  (('note:(But you use it bundled in your Ruby)'))\n
  (('note:(({Gemfile}))に書かなくても使えるけど古い'))
* Can upgrade it by (({gem}))\n
  (('note:(({gem}))でアップグレードできる'))

How to use improved csv?n(('note:改良されたcsvを使う方法'))

(({gem install csv}))

Futuren(('note:今後の話'))

Fastern (('note:さらに速く'))

Improve (({String#split}))n(('note:(({String#split}))を改良'))

Accept (({“ ”}))n as normal separatorn (('note:(({“ ”}))をただの区切り文字として扱う'))

(({split(“ ”)})) works like awkn(('note:(({split(“ ”)}))はawkのように動く'))

# rouge ruby
" a  b c".split(" ", -1)
# => ["a", "b", "c"]
" a  b c".split(/ /, -1)
# => ["", "a", "", "b", "c"]

(({String#split})) in csvn(('note:csvでの(({String#split}))'))

# rouge ruby
if @column_separator == " "
  line.split(/ /, -1)
else
  line.split(@column_separator, -1)
end

(({split(string)})) vs. (({split(regexp)}))

# RT

(({regexp})), (({string})), Faster?

344448i/s, 3161117i/s, 9.18x

(('tag:center')) See also ((<“[Feauture:15771]”|URL:bugs.ruby-lang.org/issues/15771>))

Improve (({StringScanner#scan}))n(('note:(({StringScanner#scan}))を改良'))

Accept (({String}))n as patternn (('note:(({String}))もパターンとして使えるようにする'))

(({scan(string)})) vs. (({scan(regexp)}))

# RT

(({regexp})), (({string})), Faster?

14712660i/s, 18421631i/s, 1.25x

(('tag:center')) See also ((<“ruby/strscan#4”|URL:github.com/ruby/strscan/pull/4>))

Faster KEN_ALL.CSV parsing (1)n(('note:より速いKEN_ALL.CSVのパース(1)'))

# RT

, Elapsed

csv, 0.791s
FastestCSV, 0.141s

Faster KEN_ALL.CSV parsing (2)n(('note:より速いKEN_ALL.CSVのパース(2)'))

# RT

, Encoding, Elapsed

csv, CP932, 0.791s
FastestCSV, CP932, 0.141s
csv, UTF-8, 1.345s
FastestCSV, UTF-8, 0.713s

Faster KEN_ALL.CSV parsing (3)n(('note:より速いKEN_ALL.CSVのパース(3)'))

# RT

, Encoding, Elapsed

FastestCSV, UTF-8, 0.713s
Python, UTF-8, 0.208s
Apache Arrow, UTF-8, 0.145s

Further workn(('note:今後の改善案'))

* Improve transcoding performance of Ruby\n
  (('note:Rubyのエンコーディング変換処理の高速化'))
* Improve simple case parse performance\n
  by implementing parser in C\n
  (('note:シンプルなケース用のパーサーをCで実装して高速化'))
* Improve perf. of REXML as well as csv\n
  (('note:csvのようにREXMLも高速化'))

Join us!n(('note:一緒に開発しようぜ!'))

* Red Data Tools:\n
  ((<URL:https://red-data-tools.github.io/>))
* RubyData Workshop: Today 14:20-15:30
* Code Party: Today 19:00-21:00
* After Hack: Sun. 10:30-17:30

Join us!!n(('note:一緒に開発しようぜ!!'))

* OSS Gate: ((<URL:https://oss-gate.github.io/>))
  * provides a "gate" to join OSS development\n
    (('note:OSSの開発に参加する「入り口」を提供する取り組み'))
  * Both ClearCode and Speee are one of ((<sponsors|URL:https://oss-gate.github.io/sponsors/>))\n
    (('note:クリアコードもSpeeeもスポンサー'))
* OSS Gate Fukuoka:\n
  ((<URL:https://oss-gate-fukuoka.connpass.com/>))