Red Arrow

: subtitle

((*Ruby*)) and ((*Apache Arrow*))

: author

Sutou Kouhei

: institution

ClearCode Inc.

: content-source

RubyKaigi Takeout 2021

: date

2021-09-11

: start-time

2021-09-11T13:30:00+09:00

: end-time

2021-09-11T13:55:00+09:00

: theme

.

Sutou KouheinA president Rubyist

The president of ClearCode Inc.n (('note:クリアコードの社長'))

# img
# src = images/clear-code-rubykaigi-takeout-2021-gold-sponsor.png
# relative_height = 100
# reflect_ratio = 0.1

Sutou KouheinAn Apache Arrow contributor

* A member of PMC of Apache Arrow\n
  (('note:PMC: Project Management Committee'))\n
  (('note:Apache Arrowのプロジェクト管理委員会メンバー'))
* #2 commits(('note:(コミット数2位)'))

# img
# src = images/apache-arrow-commits-kou.png
# relative_height = 120
# reflect_ratio = 0.1

Sutou KouheinThe pioneer in Ruby and Arrow

* The author of Red Arrow\n
  (('note:Red Arrowの作者'))
* Red Arrow:
  * The official Apache Arrow library for Ruby\n
    (('note:公式のRuby用のApache Arrowライブラリー'))
  * GObject Introspection based bindings\n
    (('note:GObject Introspectionベースのバインディング'))
  * Apache Arrow GLib is developed for Red Arrow\n
    (('note:Red ArrowのためにApache Arrow GLibも開発'))

GObject Introspection?

(('tag:center')) (('tag:margin-bottom * -0.3')) A way to implement bindingsn (('note:バインディングの実装方法の1つ'))

# img
# src = https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2016/how-to-create-bindings-2016.pdf
# relative_height = 90

(('tag:center')) (('note:((<URL:rubykaigi.org/2016/presentations/ktou.html>))'))

Why do I work on Red Arrow?n(('note:なぜRed Arrowの開発をしているか'))

* To use Ruby for data processing!\n
  (('note:データ処理でRubyを使いたい!'))
  * At least a part of data processing\n
    (('note:データ処理の全部と言わず一部だけでも'))
* Results of my 5 years of work:\n
  (('note:私のここ5年の仕事の成果'))
  * We can use Ruby for some data processing!\n
    (('note:いくつかのデータ処理でRubyを使える!'))

Goal of this talkn(('note:このトークのゴール'))

* You want to use Ruby\n
  for some data processing\n
  (('note:いくつかのデータ処理でRubyを使いたくなる'))
* You join Red Data Tools project\n
  (('note:Red Data Toolsプロジェクトに参加する'))

Red Data Tools project?

# blockquote

Red Data Tools is a project that provides data processing tools for Ruby

(('note:Red Data ToolsはRuby用のデータ処理ツールを提供するプロジェクト'))

(('note:((<URL:red-data-tools.github.io/>))'))

Data processing?

… how?

0. Why do you want?n(('note:0. データ処理の目的を明らかにする'))

* What problem do you want to resolve?\n
  (('note:どんな問題を解決したい?'))
* What data is needed for it?\n
  (('note:そのためにはどんなデータが必要?'))
* ...

No Red Arrow support in this arean (('note:このあたりにはRed Arrowを使えない'))

1. Collect datan(('note:1. データ収集'))

* Where are data?\n
  (('note:データはどこにある?'))
* Where are collected data stored?\n
  (('note:集めたデータはどこに保存する?'))
* ...

Some Red Arrow supports in this arean (('note:このあたりでは少しRed Arrowを使えない'))

Common datasetn(('note:よく使われるデータセット'))

# rouge ruby

require "datasets"
Datasets::Iris.new
Datasets::PostalCodeJapan.new
Datasets::Wikipedia.new

(('note:((<Red Datasets|URL:github.com/red-data-tools/red-datasets>))'))

Output: Local filen(('note:出力先:ローカルファイル'))

# rouge ruby

require "datasets-arrow"
dataset = Datasets::PostalCodeJapan.new
dataset.to_arrow.save("codes.csv")
dataset.to_arrow.save("codes.arrow")

(('note:((<Red Datasets Arrow|URL:github.com/red-data-tools/red-datasets-arrow>))'))

(({#save}))

* General serialize API for table data\n
  (('note:テーブルデータ用の汎用シリアライズAPI'))
  * Serialize as the specified format\n
    (('note:指定したフォーマットにシリアライズ'))
  * If you use Red Arrow object for in-memory table data, you can serialize to many formats! Cool!\n
    (('note:メモリー上のテーブルデータをRed Arrowオブジェクトにするといろんなフォーマットにシリアライズできる!かっこいい!'))
* Extensible!\n
  (('note:拡張可能!'))

(({#save})): Implementation

# rouge ruby

module Arrow
  class Table
    def save(output)
      saver = TableSaver.new(self, output)
      saver.save
    end
  end
end

(({#save})): Implementation

# rouge ruby

class Arrow::TableSaver
  def save
    format = detect_format(@output)
    __send__("save_as_#{format}")
  end
  def save_as_csv
  end
end

(({#save})): Extend by Red Parquet

# rouge ruby

module Parquet::ArrowTableSavable
  def save_as_parquet
  end
  Arrow::TableSaver.include(self)
end

(('note:Red Parquet is a subproject of Red Arrow'))n (('note:Red ParquetはRed Arrowのサブプロジェクト'))

(({#save})): Extended

# rouge ruby

require "datasets-arrow"
require "parquet"
dataset = Datasets::PostalCodeJapan.new
dataset.to_arrow.save("codes.parquet")

Output: Online storage: Fluentdn(('note:出力先:オンラインストレージ:Fluentd'))

* fluent-plugin-s3-arrow:
  * Collect data by Fluentd\n
    (('note:Fluentdでデータ収集'))
  * Format data as Apache Parquet by ((*Red Arrow*))\n
    (('note:((*Red Arrow*))でApache Parquet形式にデータを変換'))
  * Store data to Amazon S3 by fluent-plugin-s3\n
    (('note:fluent-plugin-s3でAmazon S3にデータを保存'))
  * By @kanga33 at Speee/Red Data Tools\n
    (('note:Speee/Red Data Toolsの香川さんが開発'))

(('note:((<URL:github.com/red-data-tools/fluent-plugin-s3-arrow/>))'))

Output: Online storage: Red Arrown(('note:出力先:オンラインストレージ:Red Arrow'))

# rouge ruby

require "datasets-arrow"
require "arrow-dataset"
dataset = Datasets::PostalCodeJapan.new
url = URL("s3://mybucket/codes.parquet")
dataset.to_arrow.save(url)

(('Implementing…'))n (('note:実装中。。。'))

(({#save})): Implementing…

# rouge ruby

class Arrow::TableSaver
  def save
    if @output.is_a?(URI)
      __send__("save_to_uri")
    else
      __send__("save_to_file")
    end
  end
end

Collect data w/ Red Arrow: Wrap upn(('note:Red Arrowでデータ収集:まとめ'))

* Usable as serializer for common formats\n
  (('note:よくあるフォーマットにシリアライズするツールとして使える'))
* Usable as writer to common locations\n
  (('note:in the near future...'))\n
  (('note:近いうちによくある出力先に書き出すツールとして使える'))

2. Read datan(('note:2. データ読み込み'))

* What format is used?\n
  (('note:どんなフォーマットで保存されている?'))
* Where are collected data?\n
  (('note:収集したデータはどこ?'))
* How large is collected data?\n
  (('note:データはどれかで大きい?'))

Formatn(('note:フォーマット'))

# rouge ruby

require "arrow"
table = Arrow::Table.load("data.csv")
table = Arrow::Table.load("data.json")
table = Arrow::Table.load("data.arrow")
table = Arrow::Table.load("data.orc")

(({.load}))

* General deserialize API for table data\n
  (('note:テーブルデータ用の汎用デシリアライズAPI'))
  * Deserialize common formats\n
    (('note:よく使われているフォーマットからデシリアライズ'))
* Extensible!\n
  (('note:拡張可能!'))

(({.load})): Implementation

# rouge ruby

module Arrow
  def Table.load(input)
    loader = TableLoader.new(self, input)
    loader.load
  end
end

(({.load})): Implementation

# rouge ruby

class Arrow::TableLoader
  def load
    format = detect_format(@output)
    __send__("load_as_#{format}")
  end
  def load_as_csv
  end
end

(({.load})): Extend by Red Parquet

# rouge ruby

module Parquet::ArrowTableLoadable
  def load_as_parquet
  end
  Arrow::TableLoader.include(self)
end

(('note:Red Parquet is a subproject of Red Arrow'))n (('note:Red ParquetはRed Arrowのサブプロジェクト'))

(({.load})): Extended

# rouge ruby

require "parquet"
table = Arrow::Table.load("data.parquet")

(({.load})): More extensible

# rouge ruby

class Arrow::TableLoader
  def load
    if @output.is_a?(URI)
      __send__("load_from_uri")
    else
      __send__("load_from_file")
    end
  end
end

(({.load})): Extend by Red Arrow Dataset

# rouge ruby

module ArrowDataset::ArrowTableLoadable
  def load_from_uri
  end
  Arrow::TableLoader.include(self)
end

(('note:Red Arrow Dataset is a subproject of Red Arrow'))n (('note:Red Arrow DatasetはRed Arrowのサブプロジェクト'))

Location: Online storagen(('note:場所:オンラインストレージ'))

# rouge ruby

require "arrow-dataset"
url = URI("s3://bucket/path...")
table = Arrow::Table.load(url)

Location: RDBMSn(('note:場所:RDBMS'))

# rouge ruby

require "arrow-activerecord"
User.all.to_arrow

(('note:((<Red Arrow Active Record|URL:github.com/red-data-tools/red-arrow-activerecord>))'))

Location: Networkn(('note:場所:ネットワーク'))

# rouge ruby

require "arrow-flight"
client = ArrowFlight::Client.new(url)
info = client.list_flights[0]
reader = client.do_get(info.endpoints[0].ticket)
table = reader.read_all

(('note:((<Introducing Apache Arrow Flight: A Framework for Fast Data Transport|URL:arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/>))'))

Large datan(('note:大規模データ'))

* Apache Arrow format
  * Designed for large data\n
    (('note:大規模データ用に設計されている'))
* For large data\n
  (('note:大規模データ用に必要なもの'))
  * Fast load\n
    (('note:高速にロードできること'))
  * ...

Fast load: Benchmarkn(('note:高速ロード:ベンチマーク'))

# rouge ruby

require "datasets-arrow"
dataset = Datasets::PostalCodeJapan.new
table = dataset.to_arrow # 124271 records
n = 5
n.times do |i|
  table.save("codes.#{i}.csv")
  table.save("codes.#{i}.arrow")
  CSV.read("codes.#{i}.csv")
  Arrow::Table.load("codes.#{i}.csv")
  Arrow::Table.load("codes.#{i}.arrow")
  table = table.concatenate([table])
end

Fast load: Benchmark: Alln(('note:高速ロード:ベンチマーク:すべて'))

# charty
# backend = pyplot
# type = line
# x = N (times)
# y = Elapsed time (sec)
# color = Approach
# markers = true
# relative_height = 100
Approach,N (times),Elapsed time (sec)
Apache Arrow,1,0.000437
Apache Arrow,2,0.000421
Apache Arrow,3,0.000472
Apache Arrow,4,0.000573
Apache Arrow,5,0.000899
CSV: Red Arrow,1,0.012443
CSV: Red Arrow,2,0.021403
CSV: Red Arrow,3,0.040435
CSV: Red Arrow,4,0.074629
CSV: Red Arrow,5,0.138448
CSV: Ruby,1,0.828678
CSV: Ruby,2,1.840314
CSV: Ruby,3,3.797536
CSV: Ruby,4,8.205680
CSV: Ruby,5,19.850910

Slide properties

: enable-title-on-image

false

Fast load: Benchmark: Red Arrown(('note:高速ロード:ベンチマーク:Red Arrow'))

# charty
# backend = pyplot
# type = line
# x = N (times)
# y = Elapsed time (sec)
# color = Approach
# markers = true
# relative_height = 100
Approach,N (times),Elapsed time (sec)
Apache Arrow,1,0.000437
Apache Arrow,2,0.000421
Apache Arrow,3,0.000472
Apache Arrow,4,0.000573
Apache Arrow,5,0.000899
CSV: Red Arrow,1,0.012443
CSV: Red Arrow,2,0.021403
CSV: Red Arrow,3,0.040435
CSV: Red Arrow,4,0.074629
CSV: Red Arrow,5,0.138448

Slide properties

: enable-title-on-image

false

How to implement fast loadn(('note:高速ロードの実装方法'))

# img
# src = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/why-apache-arrow-format-is-fast.pdf
# relative_height = 80

(('tag:center')) (('note:((<URL:slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/>))'))

Read data with Red Arrow: Wrap upn(('note:Red Arrowでデータ読み込み:まとめ'))

* Easy to read common formats\n
  (('note:よくあるフォーマットのデータを簡単に読める'))
* Easy to read from common locations\n
  (('note:よくある場所にあるデータを簡単に読める'))
* Large data ready\n
  (('note:大規模データも扱える'))

3. Explore datan(('note:3. データ探索'))

* Preprocess data(('note:(データを前処理)'))
  * Filter out needless data(('note:(不要なデータを除去)'))
  * ...
* Summarize data and visualize them\n
  (('note:(データを要約して可視化)'))
* ...

Red Arrow can be used for some operationsn (('note:いくつかの操作でRed Arrowを使える'))

Filter: Red Arrown(('note:絞り込み:Red Arrow'))

# rouge ruby

table = Datasets::PostalCodeJapan.new.to_arrow
table.n_rows # 124271
filtered_table = table.slice do |slicer|
  slicer.prefecture == "東京都" # Tokyo
end
filtered_table.n_rows # 3887

Filter: Performancen(('note:絞り込み:性能'))

# rouge ruby

dataset = Datasets::PostalCodeJapan.new
arrow_dataset = dataset.to_arrow
dataset.find_all do |row|
  row.prefecture == "東京都" # Tokyo
end # 1.256s
arrow_dataset.slice do |slicer|
  slicer.prefecture == "東京都" # Tokyo
end # 0.001s

Filter: Performancen(('note:絞り込み:性能'))

# charty
# backend = pyplot
# type = bar
# x = Elapsed time (sec)
# y = Implementation
# relative_height = 100
Implementation,Elapsed time (sec)
Ruby,1.2567864
Arrow,0.001395

Slide properties

: enable-title-on-image

false

Apache Arrow data: Interchangeablen(('note:Apache Arrow data:交換可能'))

* With low cost thanks to fast load\n
  (('note:高速ロードできるので低コスト'))
* Apache Arrow data ready systems are increasing\n
  (('note:Apache Arrowデータを扱えるシステムは増加中'))
  * e.g. DuckDB: in-process SQL OLAP DBMS\n
    (('note:(SQLite like DBMS for OLAP)'))\n
    (('note:OLAP: OnLine Analytical Processing'))\n
    (('note:例:DuckDB:同一プロセス内で動くデータ分析用SQL DB管理システム'))

Filter: DuckDBn(('note:絞り込み:DuckDB'))

# rouge ruby

require "arrow-duckdb"
codes = Datasets::PostalCodeJapan.new.to_arrow
db = DuckDB::Database.open
c = db.connect
c.register("codes", codes) do # Use codes without copy
  c.query("SELECT * FROM codes WHERE prefecture = ?",
          "東京都", # Tokyo
          output: :arrow) # Output as Apache Arrow data
   .to_table.n_rows # 3887
end

Summarize: Group + aggregationn(('note:要約:グループ化して集計'))

# rouge ruby

iris = Datasets::Iris.new.to_arrow
iris.group(:label).count(:sepal_length)
#     count(sepal_length)     label
# 0                    50     Iris-setosa
# 1                    50     Iris-versicolor
# 2                    50     Iris-virginica

Visualize: Chartyn(('note:可視化:Charty'))

# rouge ruby

require "charty"
Charty.backends.use("pyplot")
Charty.scatter_plot(data: iris,
                    x: :sepal_length,
                    y: :sepal_width,
                    color: :label)
      .save("iris.png")

Visualize: Charty: Resultn(('note:可視化:Charty:結果'))

# img
# src = images/iris.png
# relative_height = 100

Slide properties

: enable-title-on-image

false

4. Use insightn(('note:4. 知見を活用'))

* Write report\n(('note:(レポートにまとめたり)'))
* Build a model\n(('note:(モデルを作ったり)'))
* ...

No Red Arrow support in this area for nown (('note:Can be used for passing data to other tools like DuckDB and Charty'))n (('note:今のところこのあたりにはRed Arrowを使えない'))n (('note:DuckDBやChartyにデータを渡すように他のツールにデータを渡すためには使える'))

Data processing and Red Arrown(('note:Red Arrowでデータ処理'))

* Red Arrow helps us in some areas\n
  (('note:いくつかの領域ではRed Arrowを使える'))
  * Collect, read and explore data\n
    (('note:データを収集して読み込んで探索するとか'))
* Some tools can integrate with Red Arrow\n
  (('note:いくつかのツールはRed Arrowと連携できる'))
  * Fluentd, DuckDB, Charty, ...

Red Arrow and Ruby 3.0

* MemoryView support
* Ractor support

MemoryView

# blockquote

MemoryView provides the features to share multidimensional homogeneous arrays of fixed-size element on memory among extension libraries.

(('note:MemoryViewは多次元数値配列(数値はすべて同じ型)を共有する機能を提供します。'))

(('note:((<URL:tech.speee.jp/entry/2020/12/24/093131>)) (Japanese)'))

Numeric arrays in Red Arrown(('note:Red Arrow内の数値配列'))

* (({Arrow::NumericArray})) family
  * 1-dimensional numeric array\n
    (('note:1次元数値配列'))
* (({Arrow::Tensor}))
  * Multidimensional homogeneous numeric arrays\n
    (('note:多次元数値配列'))

MemoryView: Red Arrow

* (({Arrow::NumericArray})) family
  * Export as MemoryView: Support\n
    (('note:MemoryViewとしてエクスポート:対応済み'))
  * Import from MemoryView: Not yet\n
    (('note:MemoryViewをインポート:未対応'))
* (({Arrow::Tensor}))
  * Export/Import: Not yet\n
    (('note:エクスポート・インポート:未対応'))

(('note:Join Red Data Tools to work on this!'))n (('note:対応を進めたい人はRed Data Toolsに来てね!'))

MemoryView: C++

* Some problems are found by this work\n
  (('note:Red Arrowの対応作業でいくつかの問題が見つかった'))
  * Can't use (({private})) as member name\n
    (('note:メンバー名に(({private}))を使えない'))
  * Can't assign to (({const})) variable with cast\n
    (('note:キャストしても(({const}))変数に代入できない'))
* Ruby 3.1 will fix them\n
  (('note:Ruby 3.1では直っているはず'))

Ractor

# blockquote

Ractor is designed to provide a parallel execution feature of Ruby without thread-safety concerns.

(('note:Ractorはスレッドセーフかどうかを気にせずに並列実行するための機能です。'))

(('note:((<URL:techlife.cookpad.com/entry/2020/12/26/131858>)) (Japanese)'))

Red Arrow and concurrencyn(('note:Red Arrowと並列性'))

* Red Arrow data are immutable\n
  (('note:Red Arrowデータは変更不可'))
* Ractor can share frozen objects\n
  (('note:Ractorはfrozenなオブジェクトを共有可能'))

Ractor: Red Arrow

# rouge ruby

require "datasets-arrow"
table = Datasets::PostalCodeJapan.new.to_arrow
Ractor.make_shareable(table)
Ractor.new(table) do |t|
  t.slice do |slicer|
    slicer.prefecture == "東京都" # Tokyo
  end
end

Ractor: Red Arrow: Benchmark

# rouge ruby

n_ractors = 4
n_jobs_per_ractor = 1000
n_jobs = n_ractors * n_jobs_per_ractor
n_jobs.times do
  table.slice {|s| s.prefecture == "東京都"}
end
n_ractors.times.collect do
  Ractor.new(table, n_jobs_per_ractor) do |t, n|
    n.times {t.slice {|s| s.prefecture == "東京都"}}
  end
end.each(&:take)

Ractor: Red Arrow: Benchmark

# charty
# backend = pyplot
# type = bar
# x = Elapsed time (sec)
# y = Approach
# relative_height = 100
Approach,Elapsed time (sec)
Sequential,4.573742
Ractor,1.454987

Slide properties

: enable-title-on-image

false

Wrap upn(('note:まとめ'))

* Ruby can be used\n
  in some data processing work\n
  (('note:いくつかのデータ処理作業にRubyを使える'))
  * Red Arrow helps you!\n
    (('note:Red Arrowが有用なケースがあるはず!'))
* Ruby 3.0 has useful features for data processing work\n
  (('note:Ruby 3.0にはデータ処理作業に有用な機能があるよ'))
  * Red Arrow starts supporting them\n
    (('note:Red Arrowはそれらのサポートを進めている'))

Goal of this talkn(('note:このトークのゴール'))

* You want to use Ruby\n
  for some data processing\n
  (('note:いくつかのデータ処理でRubyを使いたくなる'))
* You join Red Data Tools project\n
  (('note:あなたがRed Data Toolsプロジェクトに参加する'))

Feature workn(('note:今後の仕事'))

* Implement DataFusion bindings by adding C API to DataFusion\n
  (('note:DataFusionにC APIを追加してバインディングを実装'))
  * DataFusion: Apache Arrow native query execution framework written in Rust\n
    (('note:((<URL:https://github.com/apache/arrow-datafusion/>))'))\n
    (('note:DataFusion:Rust実装のApache Arrowベースのクエリー実行フレームワーク'))
* Add Active Record like API to Red Arrow\n
  (('note:Red ArrowにActive Record風のAPIを追加'))
* Improve MemoryView/Ractor support\n
  (('note:MemoryView/Ractorサポートを進める'))

Red Data Tools

(('tag:center')) (('tag:x-large')) Join us!

(('note:((<URL:gitter.im/red-data-tools/en>))'))

(('note:((<URL:gitter.im/red-data-tools/ja>))'))

OSS Gate on-boardingn(('note:OSS Gateオンボーディング'))

* Supports accepting newcomers by OSS projects such as Ruby & Red Arrow\n
  (('note:RubyやRed ArrowといったOSSプロジェクトが新人を受け入れることを支援'))
* Contact me!(('note:興味がある人は私に教えて!'))
  * (('tag:x-small'))OSS project members who want to accept newcomers\n
    (('note:新人を受け入れたいOSSプロジェクトのメンバー'))
  * (('tag:x-small'))Companies which want to support OSS Gate on-boarding\n
    (('note:OSS Gateオンボーディングを支援したい会社'))

(('note:((<URL:oss-gate.github.io/on-boarding/>))'))

ClearCode Inc.

* Recruitment: Developer to work on Red Arrow related business\n
  (('note:採用情報:Red Arrow関連のビジネスをする開発者'))
  * (('note:((<URL:https://www.clear-code.com/recruitment/>))'))
* Business: Apache Arrow/Red Arrow related technical support/consulting:\n
  (('note:仕事:Apache Arrow/Red Arrow関連の技術サポート・コンサルティング'))
  * (('note:((<URL:https://www.clear-code.com/contact/>))'))