fplyr-package {fplyr} | R Documentation |
fplyr: Read, Process and Write
Description
This package provides a set of functions to quickly read files chunk by
chunk, apply a function to each chunk, and return the result. It is
especially useful when the files to be processed don't fit into the
RAM. Familiarity with the data.table
package is essential in order
to use fplyr
.
Definitions
- Chunked file:
A delimited file where many contiguous rows have the same value on the first field. See the example below.
- Block:
Any portion of the chunked file such that the first field does not change.
- Chunk:
Chunks are used internally; they consist of one or more block, but regular users should not be concerned with them, and can consider chunks and blocks as synonyms.
Main functions
The main functions are ffply
and flply
. The former writes the processed
data into a file, while the latter returns it as a list. The former is also much faster.
There is also fdply
, which returns a data.table
and is useful to only read
a certain number of chunks from the file (one by default). fmply
is useful
when the original file needs to be processed in many ways and each outcome must
be written to a different file.
Note
Throughout the documentation of this package, the word 'file' actually means 'chunked file.'
Examples
A chunked file may look as follows:
V1 | V2 | V3 | V4 |
ID01 | ABC | Berlin | 0.1 |
ID01 | DEF | London | 0.5 |
ID01 | GHI | Rome | 0.3 |
ID02 | ABC | Lisbon | 0.2 |
ID02 | DEF | Berlin | 0.6 |
ID02 | LMN | Prague | 0.8 |
ID02 | OPQ | Dublin | 0.7 |
ID03 | DEF | Lisbon | -0.1 |
ID03 | LMN | Berlin | 0.01 |
ID03 | XYZ | Prague | 0.2 |
The important thing is that the first field has some contiguous lines that take the same value. The values of the other fields are unimportant. This package is useful to process this kind of files, block by block.
Author(s)
Maintainer: Federico Marotta federicomarotta@mail.com (ORCID)
See Also
Useful links: