The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Allocation and length

In this vignette, we compare the computation time/memory usage of dense matrix and sparse Matrix.

Allocation and length

We begin with an analysis of the time/memory it takes to create these objects. In the atime code below, we allocate a vector for comparison, and we specify a result function which computes the length of the object x created by each expression. This means atime will save length as a function of data size N (in addition to time and memory).

library(Matrix)
N_seq <- as.integer(10^seq(1,7,by=0.25))
vec.mat.result <- atime::atime(
  N=N_seq,
  vector=numeric(N),
  matrix=matrix(0, N, N),
  Matrix=Matrix(0, N, N),
  result=function(x)data.frame(length=length(x)))
plot(vec.mat.result)
#> Warning in ggplot2::scale_y_log10("median line, min/max band"): log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-1

The plot above shows three panels, one for each unit.

Comparison with bench::press

An alternative method to compute asymptotic timings is via bench::press, which provides functionality for parameterized benchmarking (similar to atime_grid). Because atime() has special treatment of the N parameter, the code required for asymptotic measurement is relatively simple; compare the atime code above to the bench::press code below, which measures the same asymptotic quantities (seconds, kilobytes, length).

seconds.limit <- 0.01
done.vec <- NULL
measure.vars <- c("seconds","kilobytes","length")
press_result <- bench::press(
  N = N_seq,
  {
    exprs <- function(...){
      as.list(match.call()[-1])
    }
    elist <- exprs(
      vector=numeric(N),
      matrix=matrix(0, N, N),
      Matrix=Matrix(0, N, N))
    elist[names(done.vec)] <- NA #Don't run exprs which already exceeded limit.
    mark.args <- c(elist, list(iterations=10, check=FALSE))
    mark.result <- do.call(bench::mark, mark.args)
    ## Rename some columns for easier interpretation.
    desc.vec <- attr(mark.result$expression, "description")
    mark.result$description <- desc.vec
    mark.result$seconds <- as.numeric(mark.result$median)
    mark.result$kilobytes <- as.numeric(mark.result$mem_alloc/1024)
    ## Compute length column to measure in addition to time/memory.
    mark.result$length <- NA
    for(desc.i in seq_along(desc.vec)){
      description <- desc.vec[[desc.i]]
      result <- eval(elist[[description]])
      mark.result$length[desc.i] <- length(result)
    }
    ## Set NA time/memory/length for exprs which were not run.
    mark.result[desc.vec %in% names(done.vec), measure.vars] <- NA
    ## If expr went over time limit, indicate it is done.
    over.limit <- mark.result$seconds > seconds.limit
    over.desc <- desc.vec[is.finite(mark.result$seconds) & over.limit]
    done.vec[over.desc] <<- TRUE
    mark.result
  }
)
#> Running with:
#>           N
#>  1       10
#>  2       17
#>  3       31
#>  4       56
#>  5      100
#>  6      177
#>  7      316
#>  8      562
#>  9     1000
#> 10     1778
#> 11     3162
#> 12     5623
#> 13    10000
#> 14    17782
#> 15    31622
#> 16    56234
#> 17   100000
#> 18   177827
#> 19   316227
#> 20   562341
#> 21  1000000
#> 22  1778279
#> 23  3162277
#> 24  5623413
#> 25 10000000

The bench::press code above is relatively complicated, because it re-implements two functions that are provided by atime:

Below we visualize the results from bench::press,

library(data.table)
(press_long <- melt(
  data.table(press_result),
  measure.vars=measure.vars,
  id.vars=c("N","description"),
  na.rm=TRUE))
#>            N description variable        value
#>        <int>      <char>   <fctr>        <num>
#>   1:      10      vector  seconds 1.050000e-06
#>   2:      10      matrix  seconds 2.850000e-06
#>   3:      10      Matrix  seconds 4.426000e-04
#>   4:      17      vector  seconds 1.300000e-06
#>   5:      17      matrix  seconds 3.300000e-06
#>  ---                                          
#> 164: 1000000      Matrix   length 1.000000e+12
#> 165: 1778279      vector   length 1.778279e+06
#> 166: 1778279      Matrix   length 3.162276e+12
#> 167: 3162277      vector   length 3.162277e+06
#> 168: 3162277      Matrix   length 9.999996e+12
if(require(ggplot2)){
  gg <- ggplot()+
    ggtitle("bench::press results for comparison")+
    facet_grid(variable ~ ., labeller=label_both, scales="free")+
    geom_line(aes(
      N, value,
      color=description),
      data=press_long)+
    scale_x_log10(limits=c(NA, max(press_long$N*2)))+
    scale_y_log10("")
  if(requireNamespace("directlabels")){
    directlabels::direct.label(gg,"right.polygons")
  }else gg
}
#> Warning in scale_y_log10(""): log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-3

We can see that the plot from atime and bench::press are consistent.

Complexity class estimation with atime

Below we estimate the best asymptotic complexity classes:

vec.mat.best <- atime::references_best(vec.mat.result)
plot(vec.mat.best)
#> Warning in ggplot2::scale_y_log10(""): log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-4

The plot above shows that

Below we estimate the throughput for some given limits:

vec.mat.pred <- predict(
  vec.mat.best,
  seconds=vec.mat.result$seconds.limit,
  kilobytes=1000,
  length=1e6)
plot(vec.mat.pred)
#> Warning in ggplot2::scale_x_log10("N", breaks = meas[, 10^seq(ceiling(min(log10(N))), : log-10 transformation
#> introduced infinite values.
#> Warning in ggplot2::scale_y_log10("median line, min/max band"): log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-5

In the plot above we can see the throughput N for a given limit of kilobytes, length or seconds. Below we use Matrix as a reference, and compute the throughput ratio, Matrix to other.

library(data.table)
dcast(vec.mat.pred$prediction[
, ratio := N[expr.name=="Matrix"]/N, by=unit
], unit + unit.value ~ expr.name, value.var="ratio")
#> Key: <unit, unit.value>
#>         unit unit.value Matrix    matrix    vector
#>       <char>      <num>  <num>     <num>     <num>
#> 1: kilobytes      1e+03      1  357.6434 0.9996669
#> 2:    length      1e+06      1    1.0000 0.0010000
#> 3:   seconds      1e-02      1 1121.5711 0.7316179

From the table above (matrix column), we can see that the throughput of Matrix is 100-1000x larger than matrix, for the given limits.

Matrix Multiplication, 90% sparsity

First we show the difference between sparse and dense matrix multiplication, when the matrix has 90% zeros (10% non-zeros).

library(Matrix)
sparse.prop <- 0.9
dense.prop <- 1-sparse.prop
mult.result <- atime::atime(
  N=as.integer(10^seq(1,4,by=0.25)),
  setup={
    m <- matrix(0, N, N)
    set.seed(1)
    w <- rnorm(N)
    N.not.zero <- as.integer(dense.prop*N^2)
    m[sample(N^2, N.not.zero)] <- rnorm(N.not.zero)
    M <- Matrix(m)
  },
  sparse = M %*% w,
  dense = m %*% w,
  result=TRUE)
plot(mult.result)
#> Warning in ggplot2::scale_y_log10("median line, min/max band"): log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-7

Above we see that sparse is faster than dense, but by constant factors. Below we estimate the best asymptotic complexity classes:

mult.best <- atime::references_best(mult.result)
plot(mult.best)
#> Warning in ggplot2::scale_y_log10(""): log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-8

Above we see that both sparse and dense matrix multiplication are quadratic O(N^2) time (for a quadratic number of non-zero entries).

Finally, we verify below that both methods yield the same result:

library(data.table)
mult.compare <- dcast(
  mult.result$measurements, N ~ expr.name, value.var="result"
)[
, equal := paste(all.equal(as.numeric(dense[[1]]), as.numeric(sparse[[1]])))
, by=N
][]
tibble::tibble(mult.compare)
#> # A tibble: 13 × 4
#>        N dense             sparse         equal                             
#>    <int> <list>            <list>         <chr>                             
#>  1    10 <dbl [10 × 1]>    <dgeMatrx[,1]> TRUE                              
#>  2    17 <dbl [17 × 1]>    <dgeMatrx[,1]> TRUE                              
#>  3    31 <dbl [31 × 1]>    <dgeMatrx[,1]> TRUE                              
#>  4    56 <dbl [56 × 1]>    <dgeMatrx[,1]> TRUE                              
#>  5   100 <dbl [100 × 1]>   <dgeMatrx[,1]> TRUE                              
#>  6   177 <dbl [177 × 1]>   <dgeMatrx[,1]> TRUE                              
#>  7   316 <dbl [316 × 1]>   <dgeMatrx[,1]> TRUE                              
#>  8   562 <dbl [562 × 1]>   <dgeMatrx[,1]> TRUE                              
#>  9  1000 <dbl [1,000 × 1]> <dgeMatrx[,1]> TRUE                              
#> 10  1778 <dbl [1,778 × 1]> <dgeMatrx[,1]> TRUE                              
#> 11  3162 <dbl [3,162 × 1]> <dgeMatrx[,1]> TRUE                              
#> 12  5623 <NULL>            <dgeMatrx[,1]> Numeric: lengths (0, 5623) differ 
#> 13 10000 <NULL>            <dgeMatrx[,1]> Numeric: lengths (0, 10000) differ

Matrix multiplication, linear number of non-zeros

Next we show the difference between sparse and dense matrix multiplication, when the matrix has a linear number of non-zeros (asymptotically fewer than in the previous section).

library(Matrix)
mult.result <- atime::atime(
  N=as.integer(10^seq(1,4,by=0.25)),
  setup={
    m <- matrix(0, N, N)
    set.seed(1)
    w <- rnorm(N)
    N.not.zero <- N
    m[sample(N^2, N.not.zero)] <- rnorm(N.not.zero)
    M <- Matrix(m)
  },
  sparse = M %*% w,
  dense = m %*% w,
  result=TRUE)
plot(mult.result)
#> Warning in ggplot2::scale_y_log10("median line, min/max band"): log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-10

Above we see that sparse is asymptotically faster than dense (different asymptotic slopes). Below we estimate the best asymptotic complexity classes:

mult.best <- atime::references_best(mult.result)
plot(mult.best)
#> Warning in ggplot2::scale_y_log10(""): log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-11

Above we see that sparse is linear O(N) time whereas dense is quadratic O(N^2) time (for a linear number of non-zero entries).

Finally, we verify below that both methods yield the same result:

library(data.table)
mult.compare <- dcast(
  mult.result$measurements, N ~ expr.name, value.var="result"
)[
, equal := paste(all.equal(as.numeric(dense[[1]]), as.numeric(sparse[[1]])))
, by=N
][]
tibble::tibble(mult.compare)
#> # A tibble: 13 × 4
#>        N dense             sparse         equal                             
#>    <int> <list>            <list>         <chr>                             
#>  1    10 <dbl [10 × 1]>    <dgeMatrx[,1]> TRUE                              
#>  2    17 <dbl [17 × 1]>    <dgeMatrx[,1]> TRUE                              
#>  3    31 <dbl [31 × 1]>    <dgeMatrx[,1]> TRUE                              
#>  4    56 <dbl [56 × 1]>    <dgeMatrx[,1]> TRUE                              
#>  5   100 <dbl [100 × 1]>   <dgeMatrx[,1]> TRUE                              
#>  6   177 <dbl [177 × 1]>   <dgeMatrx[,1]> TRUE                              
#>  7   316 <dbl [316 × 1]>   <dgeMatrx[,1]> TRUE                              
#>  8   562 <dbl [562 × 1]>   <dgeMatrx[,1]> TRUE                              
#>  9  1000 <dbl [1,000 × 1]> <dgeMatrx[,1]> TRUE                              
#> 10  1778 <dbl [1,778 × 1]> <dgeMatrx[,1]> TRUE                              
#> 11  3162 <dbl [3,162 × 1]> <dgeMatrx[,1]> TRUE                              
#> 12  5623 <NULL>            <dgeMatrx[,1]> Numeric: lengths (0, 5623) differ 
#> 13 10000 <NULL>            <dgeMatrx[,1]> Numeric: lengths (0, 10000) differ

Matrix multiplication, linear and quadratic number of non-zeros

In this section we show how you can code both comparisons at the same time, without repetition. The trick is to first define a list of parameters to vary:

param.list <- list(
  non_zeros=c("N","N^2/10"),
  fun=c("matrix","Matrix")
)

After that we create a grid of expressions to evaluate, by expanding the parameter grid:

(expr.list <- atime::atime_grid(
  param.list,
  Mw=L[[fun]][[non_zeros]]%*%w,
  collapse="\n"))
#> $`Mw non_zeros=N\nfun=Matrix`
#> L[["Matrix"]][["N"]] %*% w
#> 
#> $`Mw non_zeros=N\nfun=matrix`
#> L[["matrix"]][["N"]] %*% w
#> 
#> $`Mw non_zeros=N^2/10\nfun=Matrix`
#> L[["Matrix"]][["N^2/10"]] %*% w
#> 
#> $`Mw non_zeros=N^2/10\nfun=matrix`
#> L[["matrix"]][["N^2/10"]] %*% w
#> 
#> attr(,"parameters")
#>                          expr.name expr.grid non_zeros    fun
#>                             <char>    <char>    <char> <char>
#> 1:      Mw non_zeros=N\nfun=Matrix        Mw         N Matrix
#> 2:      Mw non_zeros=N\nfun=matrix        Mw         N matrix
#> 3: Mw non_zeros=N^2/10\nfun=Matrix        Mw    N^2/10 Matrix
#> 4: Mw non_zeros=N^2/10\nfun=matrix        Mw    N^2/10 matrix

Finally we pass the list of expressions to atime, along with a setup argument which creates the required list L of input data, based on the parameters:

mult.result <- atime::atime(
  N=as.integer(10^seq(1,3.5,by=0.25)),
  setup={
    L <- list()
    set.seed(1)
    w <- rnorm(N)
    for(non_zeros in param.list$non_zeros){
      N.not.zero <- as.integer(eval(str2lang(non_zeros)))
      m <- matrix(0, N, N)
      m[sample(N^2, N.not.zero)] <- rnorm(N.not.zero)
      for(fun in param.list$fun){
        L[[fun]][[non_zeros]] <- get(fun)(as.numeric(m), N, N)
      }
    }
  },
  expr.list=expr.list)
plot(mult.result)
#> Warning in ggplot2::scale_y_log10("median line, min/max band"): log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-15

Below we estimate the best asymptotic complexity classes:

mult.best <- atime::references_best(mult.result)
plot(mult.best)
#> Warning in ggplot2::scale_y_log10(""): log-10 transformation introduced infinite values.

plot of chunk unnamed-chunk-16

Below we show an alternative visualization:

only.seconds <- mult.best
only.seconds$measurements <- mult.best$measurements[unit=="seconds"]
only.seconds$plot.references <- mult.best$plot.references[unit=="seconds"]
if(require(ggplot2)){
  plot(only.seconds)+
    facet_grid(non_zeros ~ fun, labeller=label_both)
}

plot of chunk unnamed-chunk-17

Conclusion

Overall in this vignette we have shown how atime can be used to demonstrate when sparse matrices can be used for efficient computations.

We also showed a comparison between atime and bench::press, which highlighted two areas where atime is more convenient (stopping after exceeding a time limit, and measuring quantities other than time/memory as a function of data size N).

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.