Analysis on the dependency heaviness of R packages

R packages under analysis were retrieved from CRAN/Biocoductor on 2021-10-28. There are <%=n_cran%> packages from CRAN and <%=n_bioc%> packages from Bioconductor (bioc version 3.14).

Measures in the table
For a package denoted as P, its direct dependency packages are listed in the Depends, Imports, LinkingTo, Suggestes and Enhances fields in its DESCRIPTION file. We define the following dependency categories for package P:

  • Parent packages: the packages listed in the Depends, Importsand LinkingTo fields (packages in the red box in the following diagram). They are also called the strong direct dependency packages of P. Parent packages are enforced to be installed when installing package P.
  • Strong dependency packages: the total packages by recursively looking for parent packages (package category A and B, plus packages in the red box). They are also called upstream packages. Note strong dependency packages contain parent packages. Strong dependency packages are enforced to be installed when installing package P.
  • All dependency packages: the total packages by recursively looking for parent packages, but on the level of package P, parents for packages in Suggests and Enhances are also included (package category A, B, C and D, plus all packages listed in the box of package P). It simulates when moving all Suggests and Enhances packages to Depends/Imports of P, the number of strong dependency packages.
  • Child packages: the packages whose parent packages include package P (package category E).
  • Downstream packages: the total packages by recursively looking for child packages (package category E and F). Note downstream packages include child packages.
  • Indirect downstream packages: downstream packages excluding child packages (package category F).
<%= paste(readLines(system.file("website", "dependency_diagram.svg", package = "pkgndep")), collapse = "\n") %>

Next various measures for the heaviness are defined as follows:

  • Heaviness from a parent: If package A is a parent of package P (i.e. package P strongly and directly depends on A), the heaviness of A on P is calculated as $n_1 - n_2$ where $n_1$ is the number of parents of P and $n_2$ is the number of parents of P if moving A to Suggests of P. In other words, the heaviness measures the number of additionally uniquely required packages that A brings to P.
  • Since a package may have multiple parents, max heaviness from parents or total heaviness from parents are used to measure how heavy parents are.
  • Heaviness of a package on all its child packages: For package P, assuming it has $K_c$ child packages and the $k^{th}$ child is denoted as $A_k$. Denote $n_{1k}$ as the number of strong dependencies of $A_k$ and $n_{2k}$ as the number of strong dependencies of $A_k$ if moving P to its Suggests, the heaviness of P on its child packages is calculated as $\frac{1}{K_c} \sum_k^{K_c}(n_{1k} - n_{2k})$. So here the heaviness measures the average number of additional packages P brings to its child packages.
  • Heaviness of a package on all its downstream packages: The definition is similar to the heaviness of a package on all its child packages. For package P, assuming it has $K_d$ downstream packages and the $k^{th}$ downstream package is denoted as $B_k$. Denote $n_{1k}$ as the number of strong dependencies of package $B_k$. Since P can affect its downstream in an indirect manner, we recalculate the global dependency relations for all packages after moving P to all its child packages' Suggests. Then we denote $n_{2k}$ as the number of strong dependencies of $B_k$ in the modified dependency graph. The heaviness of P on its downstream packages is calculated as $\frac{1}{K_d} \sum_k^{K_d}(n_{1k} - n_{2k})$.
  • Heaviness of a package on all its indirect downstream packages: The calculation is the same as "heaviness on downstream packages" except now the child packages are excluded from downstream packages.

Here the measure of Heaviness of a package on all its child packages is more important to developers, since it tells how many additional depedency packages are expected to be imported when they add a new parent package to their packages.

All these measures have a trend that small $K$ (i.e. number of parents, children or downstream packages) leads to high heaviness values. Packages with small $K$ are in general of less interests. What is more important is to see, e.g. which package heavily affects a lot of children or downstream packages (i.e. with large $K$). Thus, the original definition of heaviness is adjusted correspondingly to decrease the heaviness more for smaller $K$. A detailed explanation of the adjusted heaviness can be found in the tab "Heaviness analysis".

The previous definition of heaviness only measures the effect of a single package. Here we define another measure called "co-heaviness from parent package" that measures the number of additional dependency packages simultaneously imported by two parents. Denote P's two strong parent packages as A and B, denote $S_A$ as the set of reduced dependency packages when only moving A to Suggests of P, denote $S_B$ as the set of reduced dependency packages when only moving B to Suggests of P, and denote $S_{AB}$ as the set of reduced dependency packages when moving A and B together to Suggests of P, the co-heaviness of A, B on P is calculatd as $ \left | S_{AB} \setminus \cup (S_A, S_B) \right | $ where $|A|$ is the number of elements in set A and $A \setminus B$ is the set of elements in A but not in B.



Legends:

High heaviness Packages with adjusted heaviness on child packages higher than <%=CUTOFF$adjusted_heaviness_on_children[2]%>.

Median heaviness Packages with adjusted heaviness on child packages between <%=CUTOFF$adjusted_heaviness_on_children[1]%> and <%=CUTOFF$adjusted_heaviness_on_children[2]%>.

reducible Packages whose parent's heaviness could be reduced, i.e. only a limited number of functions are imported from parent.

Columns:      Heaviness from parent packages      Heaviness on child/downstream packages


The full table of dependency heaviness analysis can be obtained by df = pkgndep::all_pkg_stat_snapshot().

<% reducible_str = ifelse(only_reducible, 'on', '') exclude_children_str = ifelse(exclude_children, 'on', '') if(exclude_children) { col.names = c(qq("Package"), "Repository", qq("Number of strong dependency packages"), qq("Number of all dependency packages"), qq("Number of parent packages"), qq("Max heaviness from parent packages"), qq("Max co-heaviness from parent packages"), qq("Heaviness on child packages"), qq("Number of child packages"), qq("Heaviness on indirect downstream packages (excluding children)"), qq("Number of indirect downstream packages (excluding children)")) } else { col.names = c(qq("Package"), "Repository", qq("Number of strong dependency packages"), qq("Number of all dependency packages"), qq("Number of parent packages"), qq("Max heaviness from parent packages"), qq("Max co-heaviness from parent packages"), qq("Heaviness on child packages"), qq("Number of child packages"), qq("Heaviness on downstream packages"), qq("Number of downstream packages")) } %> <%= as.character(knitr::kable(df2, format = "html", row.names = FALSE, escape = FALSE, table.attr = "id='dependency-table' class='table table-striped'", col.names = col.names, align = c("l", rep("r", ncol(df2) - 1)))) %> <% if(package == "") { %> <% if(order_by == "adjusted_heaviness_on_children") order_by = "" %>
records per page, showing <%=ind[1]%> to <%=ind[length(ind)]%> of <%=nrow(df)%> pacakges.
<% nr = nrow(df) if(nr > records_per_page) { %> <%= page_select(page, ceiling(nr/records_per_page), qq("order_by=@{order_by}&reducible=@{reducible_str}&exclude_children=@{exclude_children_str}")) %> <% } %> <% } %>

Loading content...