README

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

nanoarrow

Installation

install.packages("nanoarrow")

# install.packages("remotes")
remotes::install_github("apache/arrow-nanoarrow/r")

library(nanoarrow)

Example

The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the ArrowSchema which represents a data type of an array, the ArrowArray which represents the values of an array, and an ArrowArrayStream, which represents zero or more ArrowArrays with a common ArrowSchema. All three can be wrapped by R objects using the nanoarrow R package.

Schemas

Use infer_nanoarrow_schema() to get the ArrowSchema object that corresponds to a given R vector type; use as_nanoarrow_schema() to convert an object from some other data type representation (e.g., an arrow R package DataType like arrow::int32()); or use na_XXX() functions to construct them.

infer_nanoarrow_schema(1:5)
#> <nanoarrow_schema int32>
#>  $ format    : chr "i"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 2
#>  $ children  : list()
#>  $ dictionary: NULL
as_nanoarrow_schema(arrow::schema(col1 = arrow::float64()))
#> <nanoarrow_schema struct>
#>  $ format    : chr "+s"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 0
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_schema double>
#>   .. ..$ format    : chr "g"
#>   .. ..$ name      : chr "col1"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>  $ dictionary: NULL
na_int64()
#> <nanoarrow_schema int64>
#>  $ format    : chr "l"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 2
#>  $ children  : list()
#>  $ dictionary: NULL

Arrays

as_nanoarrow_array(1:5)
#> <nanoarrow_array int32[5]>
#>  $ length    : int 5
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 2
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   ..$ :<nanoarrow_buffer data<int32>[5][20 b]> `1 2 3 4 5`
#>  $ dictionary: NULL
#>  $ children  : list()
as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
#> <nanoarrow_array struct[2]>
#>  $ length    : int 2
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_array double[2]>
#>   .. ..$ length    : int 2
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 2
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2`
#>   .. ..$ dictionary: NULL
#>   .. ..$ children  : list()
#>  $ dictionary: NULL

You can use as.vector() or as.data.frame() to get the R representation of the object back:

array <- as_nanoarrow_array(data.frame(col1 = c(1.1, 2.2)))
as.data.frame(array)
#>   col1
#> 1  1.1
#> 2  2.2

Even though at the C level the ArrowArray is distinct from the ArrowSchema, at the R level we attach a schema wherever possible. You can access the attached schema using infer_nanoarrow_schema():

infer_nanoarrow_schema(array)
#> <nanoarrow_schema struct>
#>  $ format    : chr "+s"
#>  $ name      : chr ""
#>  $ metadata  : list()
#>  $ flags     : int 0
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_schema double>
#>   .. ..$ format    : chr "g"
#>   .. ..$ name      : chr "col1"
#>   .. ..$ metadata  : list()
#>   .. ..$ flags     : int 2
#>   .. ..$ children  : list()
#>   .. ..$ dictionary: NULL
#>  $ dictionary: NULL

Array Streams

The easiest way to create an ArrowArrayStream is from a list of arrays or objects that can be converted to an array using as_nanoarrow_array():

stream <- basic_array_stream(
  list(
    data.frame(col1 = c(1.1, 2.2)),
    data.frame(col1 = c(3.3, 4.4))
  )
)

You can pull batches from the stream using the $get_next() method. The last batch will return NULL.

stream$get_next()
#> <nanoarrow_array struct[2]>
#>  $ length    : int 2
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_array double[2]>
#>   .. ..$ length    : int 2
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 2
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `1.1 2.2`
#>   .. ..$ dictionary: NULL
#>   .. ..$ children  : list()
#>  $ dictionary: NULL
stream$get_next()
#> <nanoarrow_array struct[2]>
#>  $ length    : int 2
#>  $ null_count: int 0
#>  $ offset    : int 0
#>  $ buffers   :List of 1
#>   ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>  $ children  :List of 1
#>   ..$ col1:<nanoarrow_array double[2]>
#>   .. ..$ length    : int 2
#>   .. ..$ null_count: int 0
#>   .. ..$ offset    : int 0
#>   .. ..$ buffers   :List of 2
#>   .. .. ..$ :<nanoarrow_buffer validity<bool>[0][0 b]> ``
#>   .. .. ..$ :<nanoarrow_buffer data<double>[2][16 b]> `3.3 4.4`
#>   .. ..$ dictionary: NULL
#>   .. ..$ children  : list()
#>  $ dictionary: NULL
stream$get_next()
#> NULL

You can pull all the batches into a data.frame() by calling as.data.frame() or as.vector():

stream <- basic_array_stream(
  list(
    data.frame(col1 = c(1.1, 2.2)),
    data.frame(col1 = c(3.3, 4.4))
  )
)

as.data.frame(stream)
#>   col1
#> 1  1.1
#> 2  2.2
#> 3  3.3
#> 4  4.4

After consuming a stream, you should call the release method as soon as you can. This lets the implementation of the stream release any resources (like open files) it may be holding in a more predictable way than waiting for the garbage collector to clean up the object.

Integration with the arrow package

The nanoarrow package implements as_nanoarrow_schema(), as_nanoarrow_array(), and as_nanoarrow_array_stream() for most arrow package types. Similarly, it implements arrow::as_arrow_array(), arrow::as_record_batch(), arrow::as_arrow_table(), arrow::as_record_batch_reader(), arrow::infer_type(), arrow::as_data_type(), and arrow::as_schema() for nanoarrow objects such that you can pass equivalent nanoarrow objects into many arrow functions and vice versa.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.