Better CSV processingnwith Ruby 2.6¶ ↑
: author
Kouhei Sutou/Kazuma Furuhashi
: institution
ClearCode Inc./Speee, Inc.
: content-source
RubyKaigi 2019
: date
2019-04-19
: start-time
2019-04-19T11:20:00+09:00
: end-time
2019-04-19T12:00:00+09:00
: theme
.
Ad: Silver sponsor¶ ↑
# img # src = images/clear-code-rubykaigi-2019-silver-sponsor.png # relative_height = 100 # reflect_ratio = 0.1
Slide properties¶ ↑
: enable-title-on-image
false
Ad: Cafe sponsor¶ ↑
# img # src = images/speee-rubykaigi-2019-cafe-sponsor.png # relative_height = 100 # reflect_ratio = 0.1
Slide properties¶ ↑
: enable-title-on-image
false
Kouhei Sutou¶ ↑
* The president of ClearCode Inc.\n (('note:クリアコードの社長')) * A new maintainer of the (({csv})) library\n (('note:(({csv}))ライブラリーの新メンテナー')) * The founder of ((<Red Data Tools|URL:https://red-data-tools.github.io/>)) project\n (('note:Red Data Toolsプロジェクトの立ち上げ人')) * Provides data processing tools for Ruby\n (('note:Ruby用のデータ処理ツールを提供するプロジェクト'))
Kazuma Furuhashi¶ ↑
* A member of Asakusa.rb / Red Data Tools\n (('note:Asakusa.rb/Red Data Toolsメンバー')) * Worikng at Speee Inc.\n (('note:Speeeで働いている'))
csv in Ruby 2.6 (1)n(('note:Ruby 2.6のcsv(1)'))¶ ↑
Faster CSV parsingn (('note:CSVパースの高速化'))
Unquoted CSVn(('note:クォートなしのCSV'))¶ ↑
AAAAA,AAAAA,AAAAA ... # RT 2.5, 2.6, Faster? 432.0i/s, 764.9i/s, 1.77x
Quoted CSVn(('note:クォートありのCSV'))¶ ↑
"AAAAA","AAAAA","AAAAA" ... # RT 2.5, 2.6, Faster? 274.1i/s, 534.5i/s, 1.95x
Quoted separator CSV (1)n(('note:区切り文字をクォートしているCSV(1)'))¶ ↑
",AAAAA",",AAAAA",",AAAAA" ... # RT 2.5, 2.6, Faster? 211.0i/s, 330.0/s, 1.56x
Quoted separator CSV (2)n(('note:区切り文字をクォートしているCSV(2)'))¶ ↑
"AAAAA\r\n","AAAAA\r\n","AAAAA\r\n" ... # RT 2.5, 2.6, Faster? 118.7i/s, 325.6/s, 2.74x
Quoted CSVsn(('note:クォートありのCSV'))¶ ↑
# RT , 2.5, 2.6 Just quoted, 274.1i/s, 554.5i/s Include sep1, 211.0i/s, 330.0i/s Include sep2, 118.0i/s, 325.6i/s (Note), (Slow down), (Still fast)
(('note:Note: “Just quoted” on 2.6 is optimized'))
Multibyte CSVn(('note:マルチバイトのCSV'))¶ ↑
あああああ,あああああ,あああああ ... # RT 2.5, 2.6, Faster? 371.2i/s, 626.6i/s, 1.69x
csv in Ruby 2.6 (2)n(('note:Ruby 2.6のcsv(2)'))¶ ↑
Faster CSV writingn (('note:CSV書き出しの高速化'))
(({CSV.generate_line}))¶ ↑
# rouge ruby fields = ["AAAAA"] * n_columns n_rows.times do CSV.generate_line(fields) end # RT 2.5, 2.6, Faster? 284.4i/s, 684.2i/s, 2.41x
(({CSV#<<}))¶ ↑
# rouge ruby output = StringIO.new csv = CSV.new(output) n_rows.times {csv << fields} # RT 2.5, 2.6, Faster? 2891.4i/s, 4824.1i/s, 1.67x
(({CSV.generate_line})) vs. (({CSV#<<}))¶ ↑
# RT , 2.5, 2.6 (({generate_\nline})), 284.4i/s, 684.2i/s (({<<})), 2891.4i/s, 4824.1i/s
(('tag:center')) ((*Use (({<<})) for multiple writes*))n (('note:((*複数行書き出すときは(({<<}))を使うこと*))'))
csv in Ruby 2.6 (3)n(('note:Ruby 2.6のcsv(3)'))¶ ↑
New CSV parsern (('tag:small:for'))n further improvementsn (('note:さらなる改良のための新しいCSVパーサー'))
Benchmark with KEN_ALL.CSVn(('note:KEN_ALL.CSVでのベンチマーク'))¶ ↑
01101,"060 ","0600000","ホッカイドウ","サッポロシチュウオウク",... ...(124257 lines)... 47382,"90718","9071801","オキナワケン","ヤエヤマグンヨナグニチョウ",...
(('tag:center')) Zip code data in Japann (('note:日本の郵便番号データ'))
(('tag:center')) (('tag:small')) ((<URL:www.post.japanpost.jp/zipcode/download.html>))
KEN_ALL.CSV statisticsn(('note:KEN_ALL.CSVの統計情報'))¶ ↑
# RT Size(('note:(サイズ)')), 11.7MiB (('#')) of columns(('note:(列数)')), 15 (('#')) of rows(('note:(行数)')), 124259 Encoding(('note:(エンコーディング)')), CP932
Parsing KEN_ALL.CSVn(('note:KEN_ALL.CSVのパース'))¶ ↑
# rouge ruby CSV.foreach("KEN_ALL.CSV", "r:cp932") do |row| end # RT 2.5, 2.6, Faster? 1.17s, 0.79s, 1.48x
Fastest parsing in pure Rubyn(('note:Ruby実装での最速のパース方法'))¶ ↑
# rouge ruby input.each_line(chomp: true) do |line| line.split(",", -1) do |column| end end
(('tag:center')) Limitation: No quoten (('note:制限:クォートがないこと'))
KEN_ALL.CSV without quoten(('note:クォートなしのKEN_ALL.CSV'))¶ ↑
01101,060 ,0600000,ホッカイドウ,サッポロシチュウオウク,... ...(124257 lines)... 47382,90718,9071801,オキナワケン,ヤエヤマグンヨナグニチョウ,...
Optimized no quote CSV parsingn(('note:最適化したクォートなしCSVのパース方法'))¶ ↑
# rouge ruby CSV.foreach("KEN_ALL_NO_QUOTE.CSV", "r:cp932", quote_char: nil) {|row|} # RT split, 2.6, Faster? 0.32s, 0.37s, 0.86x\n(('note:(almost the same/同等)'))
Summary: Performancen(('note:まとめ:性能'))¶ ↑
* Parsing: 1.5x-3x faster\n (('note:パース:1.5x-3x高速')) * Max to the "split" level by using an option\n (('note:オプションを指定すると最大で「split」レベルまで高速化可能')) * Writing: 1.5x-2.5x faster\n (('note:書き出し:1.5x-2.5x高速')) * Use (({CSV#<<})) than (({CSV.generate_line}))\n (('note:(({CSV.generate_line}))よりも(({CSV#<<}))を使うこと'))
How to improve performance (1)n(('note:速度改善方法(1)'))¶ ↑
Complex quoten (('note:複雑なクォート'))
Complex quoten(('note:複雑なクォート'))¶ ↑
"AA""AAA" "AA,AAA" "AA\rAAA" "AA\nAAA"
Use (({StringScanner}))n(('note:(({StringScanner}))を使う'))¶ ↑
* (({String#split})) is very fast\n (('note:(({String#split}))は高速')) * (({String#split})) is naive for complex quote\n (('note:(({String#split}))は複雑なクォートを処理するには単純過ぎる'))
2.5 uses (({String#split}))¶ ↑
# rouge ruby in_extended_column = false # "...\n..." case @input.each_line do |line| line.split(",", -1).each do |part| if in_extended_column # ... elsif part.start_with?('"') if part.end_with?('"') row << pars.gsub('""', '"') # "...""..." case else in_extended_column = true end # ...
(({split})): Complex quote¶ ↑
# RT Just quoted, 274.1i/s Include sep1, 211.0i/s Include sep2, 118.0i/s
(('tag:center')) Slow downn (('note:遅くなる'))
2.6 uses (({StringScanner}))¶ ↑
# rouge ruby row = [] until @scanner.eos? value = parse_column_value if @scanner.scan(/,/) row << value elsif @scanner.scan(/\n/) row << value yield(row) row = [] end end
(({parse_column_value}))¶ ↑
# rouge ruby def parse_column_value parse_unquoted_column_value || parse_quoted_column_value end
(('tag:center')) Compositable componentsn (('note:部品を組み合わせられる'))
(({parse_unquoted_column_value}))¶ ↑
# rouge ruby def parse_unquoted_column_value @scanner.scan(/[^,"\r\n]+/) end
(({parse_quoted_column_value}))¶ ↑
# rouge ruby def parse_quoted_column_value # Not quoted return nil unless @scanner.scan(/"/) # Quoted case ... end
Parse methods can be compositedn(('note:パースメソッドを組み合わせられる'))¶ ↑
# rouge ruby def parse_column_value parse_unquoted_column_value || parse_quoted_column_value end
(('tag:center')) Easy to maintainn (('note:メンテナンスしやすい'))
Point (1)n(('note:ポイント(1)'))¶ ↑
* Use (({StringScanner})) for complex case\n (('note:複雑なケースには(({StringScanner}))を使う')) * (({StringScanner})) for complex case:\n (('note:複雑なケースに(({StringScanner}))を使うと:')) * Easy to maintain\n (('note:メンテナンスしやすい')) * No performance regression\n (('note:性能が劣化しない'))
(({StringScanner})): Complex quote¶ ↑
# RT Just quoted, 554.5i/s Include sep1, 330.0i/s Include sep2, 325.6i/s
(('tag:center')) No slow down…?n (('note:遅くなっていない。。。?'))
How to improve performance (2)n(('note:速度改善方法(2)'))¶ ↑
Simple casen (('note:単純なケース'))
Simple casen(('note:単純なケース'))¶ ↑
AAAAA "AAAAA"
Use (({String#split}))n(('note:(({String#split}))を使う'))¶ ↑
(({StringScanner})) isn slown for simple casen (('note:(({StringScanner}))は単純なケースでは遅い'))
Fallback to (({StringScanner})) impl.n(('note:(({StringScanner}))実装にフォールバック'))¶ ↑
# rouge ruby def parse_by_strip(&block) @input.each_line do |line| if complex?(line) return parse_by_string_scanner(&block) else yield(line.split(",")) end end end
Quoted CSVsn(('note:クォートありのCSV'))¶ ↑
# RT , (({StringScanner})), (({split})) + (({StringScanner})) Just quoted, 311.7i/s, (('tag:right'))((*523.4i/s*)) Include sep1, 312.9i/s, 309.8i/s Include sep2, 311.3i/s, 312.6i/s
Point (2)n(('note:ポイント(2)'))¶ ↑
* First try optimized version\n (('note:まず最適化バージョンを試す')) * Fallback to robust version\n when complexity is detected\n (('note:複雑だとわかったらちゃんとしたバージョンにフォールバック'))
How to improve performance (3)n(('note:速度改善方法(3)'))¶ ↑
(({loop do}))n ↓n (({while true}))
(({loop})) vs. (({while}))¶ ↑
# RT How, Throughput (({loop})), 377i/s (({while})), 401i/s
Point (3)n(('note:ポイント(3)'))¶ ↑
* (({while})) doesn't create a new scope\n (('note:(({while}))は新しいスコープを作らない')) * Normally, you can use (({loop}))\n (('note:ふつうは(({loop}))でよい')) * Normally, (({loop})) isn't a bottle neck\n (('note:ふつうは(({loop}))がボトルネックにはならない'))
How to improve performance (4)n(('note:速度改善方法(4)'))¶ ↑
Lazyn (('note:遅延'))
(({CSV})) object is parser and writern(('note:(({CSV}))オブジェクトは読み書きできる'))¶ ↑
* 2.5: Always initializes everything\n (('note:2.5:常にすべてを初期化')) * 2.6: Initializes when it's needed\n (('note:2.6:必要になったら初期化'))
Write performance¶ ↑
# RT , 2.5, 2.6, Faster? (({generate_\nline})), 284.4i/s, 684.2i/s, 2.41x (({<<})), 2891.4i/s, 4824.1i/s, 1.67x
How to initialize lazilyn(('note:初期化を遅延する方法'))¶ ↑
# rouge ruby def parser @parser ||= Parser.new(...) end def writer @writer ||= Writer.new(...) end
Point (4)n(('note:ポイント(4)'))¶ ↑
* Do only needed things\n (('note:必要なことだけする')) * One class for one feature\n (('note:機能ごとにクラスを分ける'))
New features by new parsern(('note:新しいパーサーによる新機能'))¶ ↑
* Add support for \" escape\n (('note:\\"でのエスケープをサポート')) * Add (({strip:})) option\n (('note:(({strip:}))オプションを追加'))
" escapen(('note:\“でのエスケープ'))¶ ↑
# rouge ruby CSV.parse(%Q["a""bc","a\\"bc"], liberal_parsing: {backslash_quote: true}) # [["a\\"bc", "a\\"bc"]]
(({strip:}))n(('note:(({strip:}))'))¶ ↑
# rouge ruby CSV.parse(%Q[ abc , " abc"], strip: true) # [["abc", " abc"]] CSV.parse(%Q[abca,abc], strip: "a") # [["bc", "bc"]]
csv in Ruby 2.6 (4)n(('note:Ruby 2.6のcsv(4)'))¶ ↑
Keep backward compatibilityn (('note:互換性を維持'))
How to keep backward compat.n(('note:互換性を維持する方法'))¶ ↑
* Reconstruct test suites\n (('note:テストを整理')) * Add benchmark suites\n (('note:ベンチマークを追加'))
Testn(('note:テスト'))¶ ↑
* Important to detect incompat.\n (('note:非互換の検出のために重要')) * Must be easy to maintain\n (('note:メンテナンスしやすくしておくべき')) * To keep developing\n (('note:継続的な開発するため'))
Easy to maintenancen(('note:メンテナンスしやすい状態'))¶ ↑
* Easy to understand each test\n (('note:各テストを理解しやすい')) * Easy to run each test\n (('note:各テストを個別に実行しやすい')) * Focusing a failed case is easy to debug\n (('note:失敗したケースに集中できるとデバッグしやすい'))
Benchmarkn(('note:ベンチマーク'))¶ ↑
* Important to detect\n performance regression ((*bugs*))\n (('note:性能劣化((*バグ*))を検出するために重要'))
benchmark_driver gem¶ ↑
Fully-featured benchmark driver for Ruby 3x3n (('note:Ruby 3x3のために必要な機能が揃っているベンチマークツール'))
benchmark_driver gem in csv¶ ↑
* YAML input is easy to use\n (('note:YAMLによる入力が便利')) * Can compare multiple gem versions\n (('note:複数のgemのバージョンで比較可能')) * To detect performance regression\n (('note:性能劣化を検出するため'))
Benchmark for each gem versionn(('note:gemのバージョン毎のベンチマーク'))¶ ↑
# img # src = images/benchmark-driver.svg # relative_height = 100 # reflect_ratio = 0.1
Slide properties¶ ↑
: enable-title-on-image
false
csv/benchmark/¶ ↑
* convert_nil.yaml * parse{,_liberal_parsing}.yaml * parse_{quote_char_nil,strip}.yaml * read.yaml, shift.yaml, write.yaml
Benchmark as a starting pointn(('note:出発点としてのベンチマーク'))¶ ↑
* Join csv developing!\n (('note:csvの開発に参加しよう!')) * Adding a new benchmark is a good start\n (('note:ベンチマークの追加から始めるのはどう?')) * We'll focus on improving performance for benchmark cases\n (('note:ベンチマークが整備されているケースの性能改善に注力するよ'))
How to use improved csv?n(('note:改良されたcsvを使う方法'))¶ ↑
(({gem install csv}))
csv in Ruby 2.5n(('note:Ruby 2.5のcsv'))¶ ↑
Default gemifiedn (('note:デフォルトgem化'))
Default gemn(('note:デフォルトgem'))¶ ↑
* Can use it just by (({require}))\n (('note:(({require}))するだけで使える')) * Can use it without entry in (({Gemfile}))\n (('note:(But you use it bundled in your Ruby)'))\n (('note:(({Gemfile}))に書かなくても使えるけど古い')) * Can upgrade it by (({gem}))\n (('note:(({gem}))でアップグレードできる'))
How to use improved csv?n(('note:改良されたcsvを使う方法'))¶ ↑
(({gem install csv}))
Futuren(('note:今後の話'))¶ ↑
Fastern (('note:さらに速く'))
Improve (({String#split}))n(('note:(({String#split}))を改良'))¶ ↑
Accept (({“ ”}))n as normal separatorn (('note:(({“ ”}))をただの区切り文字として扱う'))
(({split(“ ”)})) works like awkn(('note:(({split(“ ”)}))はawkのように動く'))¶ ↑
# rouge ruby " a b c".split(" ", -1) # => ["a", "b", "c"] " a b c".split(/ /, -1) # => ["", "a", "", "b", "c"]
(({String#split})) in csvn(('note:csvでの(({String#split}))'))¶ ↑
# rouge ruby if @column_separator == " " line.split(/ /, -1) else line.split(@column_separator, -1) end
(({split(string)})) vs. (({split(regexp)}))¶ ↑
# RT (({regexp})), (({string})), Faster? 344448i/s, 3161117i/s, 9.18x
(('tag:center')) See also ((<“[Feauture:15771]”|URL:bugs.ruby-lang.org/issues/15771>))
Improve (({StringScanner#scan}))n(('note:(({StringScanner#scan}))を改良'))¶ ↑
Accept (({String}))n as patternn (('note:(({String}))もパターンとして使えるようにする'))
(({scan(string)})) vs. (({scan(regexp)}))¶ ↑
# RT (({regexp})), (({string})), Faster? 14712660i/s, 18421631i/s, 1.25x
(('tag:center')) See also ((<“ruby/strscan#4”|URL:github.com/ruby/strscan/pull/4>))
Faster KEN_ALL.CSV parsing (1)n(('note:より速いKEN_ALL.CSVのパース(1)'))¶ ↑
# RT , Elapsed csv, 0.791s FastestCSV, 0.141s
Faster KEN_ALL.CSV parsing (2)n(('note:より速いKEN_ALL.CSVのパース(2)'))¶ ↑
# RT , Encoding, Elapsed csv, CP932, 0.791s FastestCSV, CP932, 0.141s csv, UTF-8, 1.345s FastestCSV, UTF-8, 0.713s
Faster KEN_ALL.CSV parsing (3)n(('note:より速いKEN_ALL.CSVのパース(3)'))¶ ↑
# RT , Encoding, Elapsed FastestCSV, UTF-8, 0.713s Python, UTF-8, 0.208s Apache Arrow, UTF-8, 0.145s
Further workn(('note:今後の改善案'))¶ ↑
* Improve transcoding performance of Ruby\n (('note:Rubyのエンコーディング変換処理の高速化')) * Improve simple case parse performance\n by implementing parser in C\n (('note:シンプルなケース用のパーサーをCで実装して高速化')) * Improve perf. of REXML as well as csv\n (('note:csvのようにREXMLも高速化'))
Join us!n(('note:一緒に開発しようぜ!'))¶ ↑
* Red Data Tools:\n ((<URL:https://red-data-tools.github.io/>)) * RubyData Workshop: Today 14:20-15:30 * Code Party: Today 19:00-21:00 * After Hack: Sun. 10:30-17:30
Join us!!n(('note:一緒に開発しようぜ!!'))¶ ↑
* OSS Gate: ((<URL:https://oss-gate.github.io/>)) * provides a "gate" to join OSS development\n (('note:OSSの開発に参加する「入り口」を提供する取り組み')) * Both ClearCode and Speee are one of ((<sponsors|URL:https://oss-gate.github.io/sponsors/>))\n (('note:クリアコードもSpeeeもスポンサー')) * OSS Gate Fukuoka:\n ((<URL:https://oss-gate-fukuoka.connpass.com/>))