--- title: "eulerr under the hood" author: "Johan Larsson" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{eulerr under the hood} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: eulerr.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( echo = FALSE, warning = FALSE, message = FALSE, fig.height = 3, fig.width = 3, collapse = TRUE, comment = "#>" ) library(grid) library(RConics) library(eulerr) library(lattice) pal <- c( "#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7" ) ellipse_arc <- function(saxes = c(1, 1), loc = c(0, 0), theta = 0, n = 200, rng = c(0, 2 * pi)) { b <- min(saxes[1], saxes[2]) a <- max(saxes[1], saxes[2]) d2 <- (a - b) * (a + b) if (length(rng) == 1) { phi <- rng - theta } else { phi <- seq(rng[1], rng[2], len = n) - theta } sp <- sin(phi) cp <- cos(phi) r <- a * b / sqrt((saxes[2] * cp)^2 + (saxes[1] * sp)^2) P <- matrix(nrow = n, ncol = 2) P[, 1] <- r * cp P[, 2] <- r * sp if (theta != 0) { P <- P %*% matrix(c(cos(theta), sin(theta), -sin(theta), cos(theta)), byrow = TRUE, nrow = 2, ncol = 2 ) } P <- P + matrix(loc[1:2], nrow = nrow(P), ncol = 2, byrow = TRUE) P } set.seed(1) ``` ## Introduction **eulerr** relies on an extensive machinery to turn user input into a pretty Euler diagram. Little of this requires any tinkering from the user. To make that happen, however, **eulerr** needs to make several well-formed decisions about the design of the diagram on behalf of the user, which is not a trivial task. This document outlines the implementation of **eulerr** from input to output. It is designed to be an evolving documentation on the innards of the program. ## Input Euler diagrams present relationships between sets, wherefore the data must describe these relationships, either directly or indirectly. **eulerr** allows several alternatives for this data, namely, * intersections and relative complements $$ A \setminus B = 3 \quad B \setminus A = 2 \quad A \cap B=1, $$ * unions and identities $$ A=4 \quad B=3 \quad A \cap B=1, $$ * a matrix of binary (or boolean) indices, $$ \begin{bmatrix} \mathbf{A} & \mathbf{B}\\ 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 1 & 1 \\ 0 & 1 \\ 0 & 1, \end{bmatrix} $$ * a list of sample spaces $$ \begin{array}{l} A = \{a,\,b,\,c,\,d\}\\ B = \{a,\,e,\,f,\} \end{array}, $$ * or a two- or three-way table $A$ $A^\mathsf{c}$ ---------------- --- ---------------- $B$ 1 2 $A^\mathsf{c}$ 3 0 As an additional feature for the matrix form, the user may supply a factor variable with which to split the data set before fitting the diagram, which sometimes improves diagrams where the set relationships vary across categories. Whichever type of input is provided, **eulerr** will translate it to the first and second types, *intersections and relative complements* and *unions and identities*, which will be used in the steps to come. The Euler diagram is then fit in two steps: first, an initial layout is formed with circles using only the sets' pairwise relationships. Second, this layout is fine-tuned taking all $2^N-1$ intersections into consideration. ## Initial layout For our initial layout, we adopt a constrained version of multi-dimensional scaling (MDS) that has been adapted from **venn.js** [@Frederickson_2016], which in turn is a modification of an algorithm used in **venneuler** [@wilkinson_exact_2012]. In it, we consider only the pairwise intersections between sets, attempting to position their respective shapes so as to minimize the difference between the separation between their centers required to obtain an optimal overlap and the actual overlap of the shapes in the diagram. This problem is unfortunately intractable for ellipses, being that there is an infinite number of ways by which we can position two ellipses to obtain a given overlap. Thus, we restrict ourselves to circles in our initial layout, for which we can use the circle--circle overlap formula to numerically find the required distance, $d$, for each pairwise relationship. $$ \begin{aligned} O_{ij} = & r_i^2\arccos\left(\frac{d_{ij}^2 + r_i^2 - r_j^2}{2d_{ij}r_i}\right) + r_j^2\arccos\left(\frac{d_{ij}^2 + r_j^2 - r_i^2}{2d_{ij}r_j}\right) -\\ &\quad \frac{1}{2}\sqrt{(-d_{ij} + r_i + r_j)(d_{ij} + r_i - r_j)(d_{ij} - r_i + r_j)(d_{ij} + r_i + r_j)}, \end{aligned} $$ where $r_i$ and $r_j$ are the radii of the circles representing the $i$^th^ and $j$^th^ sets respectively, $O_{ij}$ their overlap, and $d_{ij}$ their separation. ```{r, fig.cap = "The circle--circle overlap is computed as a function of the discs' separation ($d_{ij}$), radii ($r_i,r_j$), and area of overlap ($O_{ij}$)."} c0 <- ellipseToConicMatrix(c(1, 1), c(0, 0), 0) c1 <- ellipseToConicMatrix(c(0.7, 0.7), c(1.2, 0), 0) pp <- intersectConicConic(c0, c1) theta0 <- atan2(pp[2, 1], pp[1, 1]) theta1 <- atan2(pp[2, 2], pp[1, 2]) phi0 <- atan2(pp[2, 1], pp[1, 1] - 1.2) phi1 <- atan2(pp[2, 2], pp[1, 2] - 1.2) seg <- rbind( ellipse_arc(c(0.7, 0.7), c(1.2, 0), 0, n = 50, c(-pi, phi0)), ellipse_arc(c(1, 1), c(0, 0), 0, n = 100, c(theta0, theta1)), ellipse_arc(c(0.7, 0.7), c(1.2, 0), 0, n = 50, c(phi1, pi)) ) ospot <- c(mean(seg[, 1]), mean(seg[, 2])) xyplot( 1 ~ 1, xlim = c(-1.1, 2), ylim = c(-1.5, 1.1), asp = "iso", xlab = NULL, ylab = NULL, scales = list(draw = FALSE), par.settings = list(axis.line = list(col = "transparent")), panel = function() { panel.polygon(seg, col = "steelblue1", alpha = 0.5, border = "transparent" ) grid::grid.circle(0, 0, r = 1, default.units = "native", gp = gpar(fill = "transparent") ) grid::grid.circle(1.2, 0, r = 0.7, default.units = "native", gp = gpar(fill = "transparent") ) pBrackets::grid.brackets(1.2, -1.2, 0, -1.2, h = 0.05) pBrackets::grid.brackets(-1, 0, 0, 0, h = 0.05) pBrackets::grid.brackets(1.2, 0, 1.9, 0, h = 0.05) panel.lines(c(0, 0), c(0, -1.2), lty = 2, col = "grey65") panel.lines(c(1.2, 1.2), c(0, -1.2), lty = 2, col = "grey65") panel.text(0.6, unit(-1.3, "native"), labels = expression(italic(d[ij])), pos = 1 ) panel.text(ospot[1], ospot[2], labels = expression(italic(O[ij]))) panel.text(-0.5, 0.1, labels = expression(italic(r[i])), pos = 3, default.units = "native" ) panel.text(1.55, 0.1, labels = expression(italic(r[j])), pos = 3, default.units = "native" ) panel.points(c(0, 1.2), c(0, 0), pch = 21, col = 1, cex = 1, fill = "white" ) } ) ``` Setting $r_i = \sqrt{F_i/\pi}$, where $F_i$ is the size of the $i$^th^ set, we are able to obtain $d$ numerically using the squared difference between $O$ and the desired overlap as loss function, $$ \mathcal{L}(d_{ij}) = \left(O_{ij} - (F_i \cap F_j) \right)^2, \quad \text{for } i < j \leq n, $$ which we optimize using `optimize()`^[According to the documentation, `optimize()` consists of a "combination of golden section search and successive parabolic interpolation." from **stats**.] For a two-set combination, this is all we need to plot an exact diagram, given that we now have the two circles' radii and separation and may place the circles arbitrarily as long as their separation, $d$, remains the same. This is not, however, the case with more than two sets. With three or more sets, the circles need to be arranged so that they interfere minimally with one another. In some cases, the set configuration allows this to be accomplished flawlessly, but often, compromises must me made. As is often the case in this context, this turns out to be another optimization problem. It can be tackled in many ways; **eulerr**'s approach is based on a method developed by @Frederickson_2015, which the author describes as constrained multi-dimensional scaling. The algorithm tries to position the circles so that the separation between each pair of circles matches the separation required from the separation equation. If the two sets are disjoint, however, the algorithm is indifferent to the relative locations of those circles as long as they do not intersect. The equivalent applies to subset sets: as long as the circle representing the smaller set remains within the larger circle, their locations are free to vary. In all other cases, the loss function is the residual sums of squares of the optimal separation of circles, $d$, that we found in the overlap equation, and the actual distance in the layout we are currently exploring. $$ \mathcal{L}(h,k) = \displaystyle \ell{\sum_{1\leq i