Red Arrow¶ ↑
: subtitle
((*Ruby*)) and ((*Apache Arrow*))
: author
Sutou Kouhei
: institution
ClearCode Inc.
: content-source
RubyKaigi Takeout 2021
: date
2021-09-11
: start-time
2021-09-11T13:30:00+09:00
: end-time
2021-09-11T13:55:00+09:00
: theme
.
Sutou KouheinA president Rubyist¶ ↑
The president of ClearCode Inc.n (('note:クリアコードの社長'))
# img # src = images/clear-code-rubykaigi-takeout-2021-gold-sponsor.png # relative_height = 100 # reflect_ratio = 0.1
Sutou KouheinAn Apache Arrow contributor¶ ↑
* A member of PMC of Apache Arrow\n (('note:PMC: Project Management Committee'))\n (('note:Apache Arrowのプロジェクト管理委員会メンバー')) * #2 commits(('note:(コミット数2位)')) # img # src = images/apache-arrow-commits-kou.png # relative_height = 120 # reflect_ratio = 0.1
Sutou KouheinThe pioneer in Ruby and Arrow¶ ↑
* The author of Red Arrow\n (('note:Red Arrowの作者')) * Red Arrow: * The official Apache Arrow library for Ruby\n (('note:公式のRuby用のApache Arrowライブラリー')) * GObject Introspection based bindings\n (('note:GObject Introspectionベースのバインディング')) * Apache Arrow GLib is developed for Red Arrow\n (('note:Red ArrowのためにApache Arrow GLibも開発'))
GObject Introspection?¶ ↑
(('tag:center')) (('tag:margin-bottom * -0.3')) A way to implement bindingsn (('note:バインディングの実装方法の1つ'))
# img # src = https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2016/how-to-create-bindings-2016.pdf # relative_height = 90
(('tag:center')) (('note:((<URL:rubykaigi.org/2016/presentations/ktou.html>))'))
Why do I work on Red Arrow?n(('note:なぜRed Arrowの開発をしているか'))¶ ↑
* To use Ruby for data processing!\n (('note:データ処理でRubyを使いたい!')) * At least a part of data processing\n (('note:データ処理の全部と言わず一部だけでも')) * Results of my 5 years of work:\n (('note:私のここ5年の仕事の成果')) * We can use Ruby for some data processing!\n (('note:いくつかのデータ処理でRubyを使える!'))
Goal of this talkn(('note:このトークのゴール'))¶ ↑
* You want to use Ruby\n for some data processing\n (('note:いくつかのデータ処理でRubyを使いたくなる')) * You join Red Data Tools project\n (('note:Red Data Toolsプロジェクトに参加する'))
Red Data Tools project?¶ ↑
# blockquote Red Data Tools is a project that provides data processing tools for Ruby
(('note:Red Data ToolsはRuby用のデータ処理ツールを提供するプロジェクト'))
(('note:((<URL:red-data-tools.github.io/>))'))
Data processing?¶ ↑
… how?
0. Why do you want?n(('note:0. データ処理の目的を明らかにする'))¶ ↑
* What problem do you want to resolve?\n (('note:どんな問題を解決したい?')) * What data is needed for it?\n (('note:そのためにはどんなデータが必要?')) * ...
No Red Arrow support in this arean (('note:このあたりにはRed Arrowを使えない'))
1. Collect datan(('note:1. データ収集'))¶ ↑
* Where are data?\n (('note:データはどこにある?')) * Where are collected data stored?\n (('note:集めたデータはどこに保存する?')) * ...
Some Red Arrow supports in this arean (('note:このあたりでは少しRed Arrowを使えない'))
Common datasetn(('note:よく使われるデータセット'))¶ ↑
# rouge ruby require "datasets" Datasets::Iris.new Datasets::PostalCodeJapan.new Datasets::Wikipedia.new
(('note:((<Red Datasets|URL:github.com/red-data-tools/red-datasets>))'))
Output: Local filen(('note:出力先:ローカルファイル'))¶ ↑
# rouge ruby require "datasets-arrow" dataset = Datasets::PostalCodeJapan.new dataset.to_arrow.save("codes.csv") dataset.to_arrow.save("codes.arrow")
(('note:((<Red Datasets Arrow|URL:github.com/red-data-tools/red-datasets-arrow>))'))
(({#save}))¶ ↑
* General serialize API for table data\n (('note:テーブルデータ用の汎用シリアライズAPI')) * Serialize as the specified format\n (('note:指定したフォーマットにシリアライズ')) * If you use Red Arrow object for in-memory table data, you can serialize to many formats! Cool!\n (('note:メモリー上のテーブルデータをRed Arrowオブジェクトにするといろんなフォーマットにシリアライズできる!かっこいい!')) * Extensible!\n (('note:拡張可能!'))
(({#save})): Implementation¶ ↑
# rouge ruby module Arrow class Table def save(output) saver = TableSaver.new(self, output) saver.save end end end
(({#save})): Implementation¶ ↑
# rouge ruby class Arrow::TableSaver def save format = detect_format(@output) __send__("save_as_#{format}") end def save_as_csv end end
(({#save})): Extend by Red Parquet¶ ↑
# rouge ruby module Parquet::ArrowTableSavable def save_as_parquet end Arrow::TableSaver.include(self) end
(('note:Red Parquet is a subproject of Red Arrow'))n (('note:Red ParquetはRed Arrowのサブプロジェクト'))
(({#save})): Extended¶ ↑
# rouge ruby require "datasets-arrow" require "parquet" dataset = Datasets::PostalCodeJapan.new dataset.to_arrow.save("codes.parquet")
Output: Online storage: Fluentdn(('note:出力先:オンラインストレージ:Fluentd'))¶ ↑
* fluent-plugin-s3-arrow: * Collect data by Fluentd\n (('note:Fluentdでデータ収集')) * Format data as Apache Parquet by ((*Red Arrow*))\n (('note:((*Red Arrow*))でApache Parquet形式にデータを変換')) * Store data to Amazon S3 by fluent-plugin-s3\n (('note:fluent-plugin-s3でAmazon S3にデータを保存')) * By @kanga33 at Speee/Red Data Tools\n (('note:Speee/Red Data Toolsの香川さんが開発'))
(('note:((<URL:github.com/red-data-tools/fluent-plugin-s3-arrow/>))'))
Output: Online storage: Red Arrown(('note:出力先:オンラインストレージ:Red Arrow'))¶ ↑
# rouge ruby require "datasets-arrow" require "arrow-dataset" dataset = Datasets::PostalCodeJapan.new url = URL("s3://mybucket/codes.parquet") dataset.to_arrow.save(url)
(('Implementing…'))n (('note:実装中。。。'))
(({#save})): Implementing…¶ ↑
# rouge ruby class Arrow::TableSaver def save if @output.is_a?(URI) __send__("save_to_uri") else __send__("save_to_file") end end end
Collect data w/ Red Arrow: Wrap upn(('note:Red Arrowでデータ収集:まとめ'))¶ ↑
* Usable as serializer for common formats\n (('note:よくあるフォーマットにシリアライズするツールとして使える')) * Usable as writer to common locations\n (('note:in the near future...'))\n (('note:近いうちによくある出力先に書き出すツールとして使える'))
2. Read datan(('note:2. データ読み込み'))¶ ↑
* What format is used?\n (('note:どんなフォーマットで保存されている?')) * Where are collected data?\n (('note:収集したデータはどこ?')) * How large is collected data?\n (('note:データはどれかで大きい?'))
Formatn(('note:フォーマット'))¶ ↑
# rouge ruby require "arrow" table = Arrow::Table.load("data.csv") table = Arrow::Table.load("data.json") table = Arrow::Table.load("data.arrow") table = Arrow::Table.load("data.orc")
(({.load}))¶ ↑
* General deserialize API for table data\n (('note:テーブルデータ用の汎用デシリアライズAPI')) * Deserialize common formats\n (('note:よく使われているフォーマットからデシリアライズ')) * Extensible!\n (('note:拡張可能!'))
(({.load})): Implementation¶ ↑
# rouge ruby module Arrow def Table.load(input) loader = TableLoader.new(self, input) loader.load end end
(({.load})): Implementation¶ ↑
# rouge ruby class Arrow::TableLoader def load format = detect_format(@output) __send__("load_as_#{format}") end def load_as_csv end end
(({.load})): Extend by Red Parquet¶ ↑
# rouge ruby module Parquet::ArrowTableLoadable def load_as_parquet end Arrow::TableLoader.include(self) end
(('note:Red Parquet is a subproject of Red Arrow'))n (('note:Red ParquetはRed Arrowのサブプロジェクト'))
(({.load})): Extended¶ ↑
# rouge ruby require "parquet" table = Arrow::Table.load("data.parquet")
(({.load})): More extensible¶ ↑
# rouge ruby class Arrow::TableLoader def load if @output.is_a?(URI) __send__("load_from_uri") else __send__("load_from_file") end end end
(({.load})): Extend by Red Arrow Dataset¶ ↑
# rouge ruby module ArrowDataset::ArrowTableLoadable def load_from_uri end Arrow::TableLoader.include(self) end
(('note:Red Arrow Dataset is a subproject of Red Arrow'))n (('note:Red Arrow DatasetはRed Arrowのサブプロジェクト'))
Location: Online storagen(('note:場所:オンラインストレージ'))¶ ↑
# rouge ruby require "arrow-dataset" url = URI("s3://bucket/path...") table = Arrow::Table.load(url)
Location: RDBMSn(('note:場所:RDBMS'))¶ ↑
# rouge ruby require "arrow-activerecord" User.all.to_arrow
(('note:((<Red Arrow Active Record|URL:github.com/red-data-tools/red-arrow-activerecord>))'))
Location: Networkn(('note:場所:ネットワーク'))¶ ↑
# rouge ruby require "arrow-flight" client = ArrowFlight::Client.new(url) info = client.list_flights[0] reader = client.do_get(info.endpoints[0].ticket) table = reader.read_all
(('note:((<Introducing Apache Arrow Flight: A Framework for Fast Data Transport|URL:arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/>))'))
Large datan(('note:大規模データ'))¶ ↑
* Apache Arrow format * Designed for large data\n (('note:大規模データ用に設計されている')) * For large data\n (('note:大規模データ用に必要なもの')) * Fast load\n (('note:高速にロードできること')) * ...
Fast load: Benchmarkn(('note:高速ロード:ベンチマーク'))¶ ↑
# rouge ruby require "datasets-arrow" dataset = Datasets::PostalCodeJapan.new table = dataset.to_arrow # 124271 records n = 5 n.times do |i| table.save("codes.#{i}.csv") table.save("codes.#{i}.arrow") CSV.read("codes.#{i}.csv") Arrow::Table.load("codes.#{i}.csv") Arrow::Table.load("codes.#{i}.arrow") table = table.concatenate([table]) end
Fast load: Benchmark: Alln(('note:高速ロード:ベンチマーク:すべて'))¶ ↑
# charty # backend = pyplot # type = line # x = N (times) # y = Elapsed time (sec) # color = Approach # markers = true # relative_height = 100 Approach,N (times),Elapsed time (sec) Apache Arrow,1,0.000437 Apache Arrow,2,0.000421 Apache Arrow,3,0.000472 Apache Arrow,4,0.000573 Apache Arrow,5,0.000899 CSV: Red Arrow,1,0.012443 CSV: Red Arrow,2,0.021403 CSV: Red Arrow,3,0.040435 CSV: Red Arrow,4,0.074629 CSV: Red Arrow,5,0.138448 CSV: Ruby,1,0.828678 CSV: Ruby,2,1.840314 CSV: Ruby,3,3.797536 CSV: Ruby,4,8.205680 CSV: Ruby,5,19.850910
Slide properties¶ ↑
: enable-title-on-image
false
Fast load: Benchmark: Red Arrown(('note:高速ロード:ベンチマーク:Red Arrow'))¶ ↑
# charty # backend = pyplot # type = line # x = N (times) # y = Elapsed time (sec) # color = Approach # markers = true # relative_height = 100 Approach,N (times),Elapsed time (sec) Apache Arrow,1,0.000437 Apache Arrow,2,0.000421 Apache Arrow,3,0.000472 Apache Arrow,4,0.000573 Apache Arrow,5,0.000899 CSV: Red Arrow,1,0.012443 CSV: Red Arrow,2,0.021403 CSV: Red Arrow,3,0.040435 CSV: Red Arrow,4,0.074629 CSV: Red Arrow,5,0.138448
Slide properties¶ ↑
: enable-title-on-image
false
How to implement fast loadn(('note:高速ロードの実装方法'))¶ ↑
# img # src = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/why-apache-arrow-format-is-fast.pdf # relative_height = 80
(('tag:center')) (('note:((<URL:slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/>))'))
Read data with Red Arrow: Wrap upn(('note:Red Arrowでデータ読み込み:まとめ'))¶ ↑
* Easy to read common formats\n (('note:よくあるフォーマットのデータを簡単に読める')) * Easy to read from common locations\n (('note:よくある場所にあるデータを簡単に読める')) * Large data ready\n (('note:大規模データも扱える'))
3. Explore datan(('note:3. データ探索'))¶ ↑
* Preprocess data(('note:(データを前処理)')) * Filter out needless data(('note:(不要なデータを除去)')) * ... * Summarize data and visualize them\n (('note:(データを要約して可視化)')) * ...
Red Arrow can be used for some operationsn (('note:いくつかの操作でRed Arrowを使える'))
Filter: Red Arrown(('note:絞り込み:Red Arrow'))¶ ↑
# rouge ruby table = Datasets::PostalCodeJapan.new.to_arrow table.n_rows # 124271 filtered_table = table.slice do |slicer| slicer.prefecture == "東京都" # Tokyo end filtered_table.n_rows # 3887
Filter: Performancen(('note:絞り込み:性能'))¶ ↑
# rouge ruby dataset = Datasets::PostalCodeJapan.new arrow_dataset = dataset.to_arrow dataset.find_all do |row| row.prefecture == "東京都" # Tokyo end # 1.256s arrow_dataset.slice do |slicer| slicer.prefecture == "東京都" # Tokyo end # 0.001s
Filter: Performancen(('note:絞り込み:性能'))¶ ↑
# charty # backend = pyplot # type = bar # x = Elapsed time (sec) # y = Implementation # relative_height = 100 Implementation,Elapsed time (sec) Ruby,1.2567864 Arrow,0.001395
Slide properties¶ ↑
: enable-title-on-image
false
Apache Arrow data: Interchangeablen(('note:Apache Arrow data:交換可能'))¶ ↑
* With low cost thanks to fast load\n (('note:高速ロードできるので低コスト')) * Apache Arrow data ready systems are increasing\n (('note:Apache Arrowデータを扱えるシステムは増加中')) * e.g. DuckDB: in-process SQL OLAP DBMS\n (('note:(SQLite like DBMS for OLAP)'))\n (('note:OLAP: OnLine Analytical Processing'))\n (('note:例:DuckDB:同一プロセス内で動くデータ分析用SQL DB管理システム'))
Filter: DuckDBn(('note:絞り込み:DuckDB'))¶ ↑
# rouge ruby require "arrow-duckdb" codes = Datasets::PostalCodeJapan.new.to_arrow db = DuckDB::Database.open c = db.connect c.register("codes", codes) do # Use codes without copy c.query("SELECT * FROM codes WHERE prefecture = ?", "東京都", # Tokyo output: :arrow) # Output as Apache Arrow data .to_table.n_rows # 3887 end
Summarize: Group + aggregationn(('note:要約:グループ化して集計'))¶ ↑
# rouge ruby iris = Datasets::Iris.new.to_arrow iris.group(:label).count(:sepal_length) # count(sepal_length) label # 0 50 Iris-setosa # 1 50 Iris-versicolor # 2 50 Iris-virginica
Visualize: Chartyn(('note:可視化:Charty'))¶ ↑
# rouge ruby require "charty" Charty.backends.use("pyplot") Charty.scatter_plot(data: iris, x: :sepal_length, y: :sepal_width, color: :label) .save("iris.png")
Visualize: Charty: Resultn(('note:可視化:Charty:結果'))¶ ↑
# img # src = images/iris.png # relative_height = 100
Slide properties¶ ↑
: enable-title-on-image
false
4. Use insightn(('note:4. 知見を活用'))¶ ↑
* Write report\n(('note:(レポートにまとめたり)')) * Build a model\n(('note:(モデルを作ったり)')) * ...
No Red Arrow support in this area for nown (('note:Can be used for passing data to other tools like DuckDB and Charty'))n (('note:今のところこのあたりにはRed Arrowを使えない'))n (('note:DuckDBやChartyにデータを渡すように他のツールにデータを渡すためには使える'))
Data processing and Red Arrown(('note:Red Arrowでデータ処理'))¶ ↑
* Red Arrow helps us in some areas\n (('note:いくつかの領域ではRed Arrowを使える')) * Collect, read and explore data\n (('note:データを収集して読み込んで探索するとか')) * Some tools can integrate with Red Arrow\n (('note:いくつかのツールはRed Arrowと連携できる')) * Fluentd, DuckDB, Charty, ...
Red Arrow and Ruby 3.0¶ ↑
* MemoryView support * Ractor support
MemoryView¶ ↑
# blockquote MemoryView provides the features to share multidimensional homogeneous arrays of fixed-size element on memory among extension libraries.
(('note:MemoryViewは多次元数値配列(数値はすべて同じ型)を共有する機能を提供します。'))
(('note:((<URL:tech.speee.jp/entry/2020/12/24/093131>)) (Japanese)'))
Numeric arrays in Red Arrown(('note:Red Arrow内の数値配列'))¶ ↑
* (({Arrow::NumericArray})) family * 1-dimensional numeric array\n (('note:1次元数値配列')) * (({Arrow::Tensor})) * Multidimensional homogeneous numeric arrays\n (('note:多次元数値配列'))
MemoryView: Red Arrow¶ ↑
* (({Arrow::NumericArray})) family * Export as MemoryView: Support\n (('note:MemoryViewとしてエクスポート:対応済み')) * Import from MemoryView: Not yet\n (('note:MemoryViewをインポート:未対応')) * (({Arrow::Tensor})) * Export/Import: Not yet\n (('note:エクスポート・インポート:未対応'))
(('note:Join Red Data Tools to work on this!'))n (('note:対応を進めたい人はRed Data Toolsに来てね!'))
MemoryView: C++¶ ↑
* Some problems are found by this work\n (('note:Red Arrowの対応作業でいくつかの問題が見つかった')) * Can't use (({private})) as member name\n (('note:メンバー名に(({private}))を使えない')) * Can't assign to (({const})) variable with cast\n (('note:キャストしても(({const}))変数に代入できない')) * Ruby 3.1 will fix them\n (('note:Ruby 3.1では直っているはず'))
Ractor¶ ↑
# blockquote Ractor is designed to provide a parallel execution feature of Ruby without thread-safety concerns.
(('note:Ractorはスレッドセーフかどうかを気にせずに並列実行するための機能です。'))
(('note:((<URL:techlife.cookpad.com/entry/2020/12/26/131858>)) (Japanese)'))
Red Arrow and concurrencyn(('note:Red Arrowと並列性'))¶ ↑
* Red Arrow data are immutable\n (('note:Red Arrowデータは変更不可')) * Ractor can share frozen objects\n (('note:Ractorはfrozenなオブジェクトを共有可能'))
Ractor: Red Arrow¶ ↑
# rouge ruby require "datasets-arrow" table = Datasets::PostalCodeJapan.new.to_arrow Ractor.make_shareable(table) Ractor.new(table) do |t| t.slice do |slicer| slicer.prefecture == "東京都" # Tokyo end end
Ractor: Red Arrow: Benchmark¶ ↑
# rouge ruby n_ractors = 4 n_jobs_per_ractor = 1000 n_jobs = n_ractors * n_jobs_per_ractor n_jobs.times do table.slice {|s| s.prefecture == "東京都"} end n_ractors.times.collect do Ractor.new(table, n_jobs_per_ractor) do |t, n| n.times {t.slice {|s| s.prefecture == "東京都"}} end end.each(&:take)
Ractor: Red Arrow: Benchmark¶ ↑
# charty # backend = pyplot # type = bar # x = Elapsed time (sec) # y = Approach # relative_height = 100 Approach,Elapsed time (sec) Sequential,4.573742 Ractor,1.454987
Slide properties¶ ↑
: enable-title-on-image
false
Wrap upn(('note:まとめ'))¶ ↑
* Ruby can be used\n in some data processing work\n (('note:いくつかのデータ処理作業にRubyを使える')) * Red Arrow helps you!\n (('note:Red Arrowが有用なケースがあるはず!')) * Ruby 3.0 has useful features for data processing work\n (('note:Ruby 3.0にはデータ処理作業に有用な機能があるよ')) * Red Arrow starts supporting them\n (('note:Red Arrowはそれらのサポートを進めている'))
Goal of this talkn(('note:このトークのゴール'))¶ ↑
* You want to use Ruby\n for some data processing\n (('note:いくつかのデータ処理でRubyを使いたくなる')) * You join Red Data Tools project\n (('note:あなたがRed Data Toolsプロジェクトに参加する'))
Feature workn(('note:今後の仕事'))¶ ↑
* Implement DataFusion bindings by adding C API to DataFusion\n (('note:DataFusionにC APIを追加してバインディングを実装')) * DataFusion: Apache Arrow native query execution framework written in Rust\n (('note:((<URL:https://github.com/apache/arrow-datafusion/>))'))\n (('note:DataFusion:Rust実装のApache Arrowベースのクエリー実行フレームワーク')) * Add Active Record like API to Red Arrow\n (('note:Red ArrowにActive Record風のAPIを追加')) * Improve MemoryView/Ractor support\n (('note:MemoryView/Ractorサポートを進める'))
Red Data Tools¶ ↑
(('tag:center')) (('tag:x-large')) Join us!
(('note:((<URL:gitter.im/red-data-tools/en>))'))
(('note:((<URL:gitter.im/red-data-tools/ja>))'))
OSS Gate on-boardingn(('note:OSS Gateオンボーディング'))¶ ↑
* Supports accepting newcomers by OSS projects such as Ruby & Red Arrow\n (('note:RubyやRed ArrowといったOSSプロジェクトが新人を受け入れることを支援')) * Contact me!(('note:興味がある人は私に教えて!')) * (('tag:x-small'))OSS project members who want to accept newcomers\n (('note:新人を受け入れたいOSSプロジェクトのメンバー')) * (('tag:x-small'))Companies which want to support OSS Gate on-boarding\n (('note:OSS Gateオンボーディングを支援したい会社'))
(('note:((<URL:oss-gate.github.io/on-boarding/>))'))
ClearCode Inc.¶ ↑
* Recruitment: Developer to work on Red Arrow related business\n (('note:採用情報:Red Arrow関連のビジネスをする開発者')) * (('note:((<URL:https://www.clear-code.com/recruitment/>))')) * Business: Apache Arrow/Red Arrow related technical support/consulting:\n (('note:仕事:Apache Arrow/Red Arrow関連の技術サポート・コンサルティング')) * (('note:((<URL:https://www.clear-code.com/contact/>))'))