Gradient package:base R Documentation Gradient Computation Description: These are the facilities for computing gradients via run-time automatic differentiation. For symbolic differentiation facilities, see 'deriv'. For numerical differentiation, see 'numericDeriv'. The present implementation of these facilities is preliminary. Gradient computation for some operations is planned but not yet implemented, and for some operations that are supported, gradient computations are much less efficient than is planned for a full-fledged version. At present, there is no support for higher-order derivatives, or for differentiation of complex values. Usage: with gradient (var=v.expr) expr with gradient (var) expr with gradient (var1=v1.expr,var2=v2.expr) expr with gradient (var1,var2) expr track gradient (var=v.expr) expr track gradient (var) expr track gradient (var1=v1.expr,var2=v2.expr) expr track gradient (var1,var2) expr back gradient (var=v.expr) expr back gradient (var) expr back gradient (var1=v1.expr,var2=v2.expr) expr back gradient (var1,var2) expr compute gradient (var1=v1.expr, var2=v2.expr) expr as g1.expr, g2.expr compute gradient (var1, var2) expr as g1.expr, g2.expr gradient_of (e) no_gradient (e) Arguments: var, var1, var2: (And so forth for var3, etc.) Names (not strings) of variables with respect to which gradients will be tracked or computed. expr: An expression, often a compound expression, of a form such as '{expr1;expr2}', which is evaluated in a new environment containing the new variable or variables listed. g1.expr, g2.expr: (and so forth for g3.expr, etc.) Expressions giving the gradients of the specified 'expr' with respect to 'var1', 'var2', etc. e: An expression whose gradient is computed (if possible), or is not computed. Details: The 'with gradient', 'track gradient', and 'back gradient' constructs all ask that gradients of an expression with respect to certain variables be computed, with differences between them as described below. The 'gradient_of' and 'no_gradient' functions may be used within these constructs. The 'compute gradient' construct specifies how gradients are computed; it is not needed for the many functions for which gradient computations are built in. Gradients can be computed for expressions that are not differentiable at some points, with the gradient returned at such points being arbitrary. Asking for gradient information: The 'with gradient' construct evaluates 'expr' in a new environment (with the current environment as its parent) containing the variable 'var' (or variables 'var1', 'var2', etc), with initial value 'v.expr' (or initial values 'v1.expr', 'v2.expr', etc.). The value of 'expr' is returned as that of 'with gradient', with a '"gradient"' attribute attached containing the derivative of the value of 'expr' with respect to 'var', or the derivatives with respect to 'var1', 'var2', etc.. (Any existing '"gradient"' attribute is discarded.) During the evaluation of 'expr' within 'with gradient', assignments to local variables (in the new environment) will record the gradient of the assigned value with respect to 'var' (or 'var1', 'var2', etc.), and with respect to the variables listed in any enclosing 'with gradient', 'track gradient', or active 'back gradient' constructs, so that if the value of the variable is used to compute the final value for 'with gradient', or is the argument of call of 'gradient_of', its gradient can be computed. This includes assignments made to a for loop variable. Note, however, that the gradients of attribute values are not recorded, nor are gradients recorded when non-local assignments are made with '<<-'. Tracking of gradients continues when a function is called with one or more arguments with tracked gradients. This includes functions for S3 methods, but not S4 methods. Tracking will also be performed when an expression is evaluated with 'eval', if the environment used for the evaluation is one in which gradients are being tracked. Within 'expr', or a function called from within it, the gradient of an expression, 'e', with respect the variable or variables of the innermost enclosing 'with gradient', 'track gradient', or active 'back gradient' construct can be found with 'gradient_of(e)'. The value of 'gradient_of' does not itself have any gradient information - hence 'gradient_of(gradient_of(e))' will not produce a second derivative (it will always be zero). Tracking of gradients can be explicitly suppressed (to save time) with 'no_gradient(e)'. Gradient tracking is automatically suppressed when evaluating expressions that are used as 'if' or 'while' conditions, as indexes, or as expressions iterated over with 'for'. The 'track gradient' construct is like 'with gradient', except that the gradient is not attached to the final result. It is therefore useful only in combination with calls of 'gradient_of', or if it is inside (in a dynamic sense) another 'track gradient', 'with gradient', or active 'back gradient' construct. When these gradient constructs are nested, derivatives with respect to an inner variable may determine derivatives with respect to an outer variable, by application of the chain rule (backpropagation). The 'back gradient' construct is like 'track gradient', except that it does not allow 'gradient_of' unless it is evaluated within a 'with gradient' or 'track gradient' construct, and hence only needs to track gradients if they will be backpropagated to give gradients for such a (dynamically) enclosing gradient construct. It may therefore be included at little performance cost in a function that may either be called for only its value, or may be called for both its value and that value's gradient, depending on whether it is called from within a gradient construct. In forms of 'with gradient', 'track gradient', and 'back gradient' in which no expression follows a variable, the expression is assumed to be the same as the variable (evaluated in the current (not the new) environment). When a 'with gradient', 'track gradient', or 'back gradient' construct is exitted, either normally or by an error exit, the variables defined within its environment are removed. This prevents unnecessary duplicaton of objects, but does mean that functions that have been created with this environment as their enclosing environment will no longer have access to these variables. Specifying how gradients should be computed: The gradient of an expression can be specified explicity using the 'compute gradient' construct, as an alternative to simply letting the gradient be obtained automatically, or as a necessary measure if the expression contains built-in functions for which automatic differentiation has not yet been implemented. The 'expr' within 'compute gradient' is evaluated in a new environment in which one or more variables have been defined (two in the above templates). The initial values of these variables are as specified, defaulting to the variable's value evaluated in the current environment. Gradients with respect to these new variables are not tracked automatically, but are instead specified by the expressions after 'as'. The chain rule is used to translate these gradients to gradients with respect to variables used to compute 'v1.expr', 'v2.expr', etc. The gradient computed by the expressions after 'as' should in the same form as is returned by 'gradient_of' or attached as a 'gradient' attribute by 'with gradient', as described in the value section below, except that Jacobain matrices that are square with zero values off the diagonal may be returned as a vector of just the diagonal elements. The 'dim' attribute (if present) of the Jacobian matrices returned is ignored, but the numeric values returned must nevertheless correspond to those for a matrix with the correct dimensions. A Jacobian matrix may be replaced by a function that computes the Jacobian matrix, or computes products of it with another matrix. Such a function must have 0, 1, or 2 arguments, which if present must be named 'right' or 'left'. The function may be called with no arguments, with just a 'right' argument (if present), or with just a 'left' argument (if present); the function must be prepared for any of these possibilities. If called with no arguments (testable with 'missing'), the function must return the Jacobian matrix (which may be just the diagonal part, as above). If called with a 'right' argument (a matrix, or a vector representing a matrix with one column), the function must return the product of the Jacobian matrix with the 'right' matrix. If called with a 'left' argument (a matrix, or vector representing a matrix with one row), the function must return the product of the 'left' matrix with the Jacobian matrix. For these latter cases, it is possible that the function may be able to compute the product efficiently without ever creating the full Jacobian matrix. The function will not be called with both 'right' and 'left' arguments. The function may be called more than once, and may be called after evaluation of the 'compute gradient' expression has completed. NOTE: At present, the function will always be called with no arguments, even if it has been written to handle 'right' or 'left' arguments, so this feature is presently mostly pointless, but it will become useful in future. If computation of a gradient has not been requested, 'compute gradient' will evaluate only the value, skipping evaluation of the expressions after 'as'. It is possible for the gradient expression to be evaluted for some variables but not others (e.g., in the form shown above, 'g2.expr' might be evaluated but 'g1.expr' not be evaluated). Functions and operators that know how to compute gradients: Many functions and operators defined in the base and other packages know how to compute gradients. Computation of gradients for built-in functions is skipped when it is known that the gradient will not be needed, so the overhead of the gradient facility when it is not being used is quite small. For those builtin functions that don't know how to compute gradients (but for which gradients exist mathematically), gradients will appear to be zero. When a builtin functions returns 'NA' or a 'NaN' value, the gradient will be regarded as zero (without an error or warning, and maybe not consistently at present). Gradients may be defined for real-valued random generation functions (eg, 'rnorm'). The gradient for these functions indicates how a change in the distribution parameters would produce a change in the generated random value, if the state of the random number generator when calling the function were kept fixed. The following built-in functions and operators will compute gradients, with respect to all their real scalar, vector, matrix, or array arguments (unless noted), or when applicable, arguments that are lists of reals (or lists of lists of reals, etc.): if, ifelse, switch c, rbind, cbind, list, matrix, array, unlist, rep, rep.int, rep_len [ (for vector lists and numeric vectors, including matrices & arrays) [<- (for vector lists and numeric vectors, including matrices & arrays) [[ (for vector lists and numeric vectors, including matrices & arrays) [[<- (for vector lists and numeric vectors, including matrices & arrays; vector as index (indexing at multiple levels) not supported yet) $ (for vector lists, not pairlists or environments) $<- (for vector lists, not pairlists for environments) data.frame, as.data.frame, cbind.data.frame, rbind.data.frame methods for access and assignment to data frames with [, [[, and $ +, -, *, /, ^ (+ and - may be unary) abs, sqrt, expm1, exp, log1p, log2, log10, log (one-argument form only) cos, sin, tan, acos, asin, atan, atan2 cosh, sinh, tanh, acosh, asinh, atanh gamma, lgamma, digamma, trigamma, psigamma, beta, lbeta dbeta, pbeta (1st argument only), qbeta (1st argument only) dchisq (no ncp arg), pchisq (1st only, no ncp), qchisq (1st only, no ncp) dbinom, pbinom dcauchy, pcauchy, qcauchy, rcauchy dexp, pexp, qexp, rexp df (no ncp argument), pf (1st arg only, no ncp), qf (1st arg only,no ncp) dgamma, pgamma (1st and 3rd args only), qgamma (1st and 3rd args only) Note: 3rd argument of dgamma/pgamma/qgamma may be either rate or scale dgeom, pgeom dlogis, plogis, qlogis, rlogis dlnorm, plnorn, qlnorm, rlnorm dnbinom (3rd arg only), pnbinom (3rd arg only) Note: 3rd argument of dnbinom/pnbinom can be either prob or mu dnorm, pnorm, qnorm, rnorm dpois, ppois dt (with no ncp arg), pt (1st arg only, no ncp), qt (1st arg only,no ncp) dunif, punif, qunif, runif dweibull, pweibull, qweibull, rweibull apply, lapply, sapply, vapply, replicate, simplify2array as.vector, as.list, as.double, as.numeric, as.real, as.matrix unclass, drop, structure, get_rm storage.mode<-, length<- invisible mean, sum, prod, min, max, pmin, pmax, cumsum, cumprod, cummax, cummin rowSums, rowMeans, .rowSums, .rowMeans, colSums, colMeans, .colSums, .colMeans t, %*%, crossprod, tcrossprod det, determinant solve (default method only, for matrices, not QR decompositions) diag (for creating diagonal matrix or extracting diagonal from matrix) diag<- The following replacement functions do not compute any gradient information, but do leave undisturbed any gradient information that is associated with the variable that they update: attr<-, attributes<-, names<-, dim<-, dimnames<-, class<-, oldClass<- Value: When the gradient is with respect to only one variable, the value returned by 'gradient_of' or attached as a '"gradient"' attribute by 'with gradient' will be a scalar real if this variable has a scalar real value and the expression this is the gradient of also has a scalar real value. When the variable is scalar, but the expression is a real vector (or matrix, or array) of length greater than one, the gradient will be a vector of the same length. When the variable is a vector (or matrix, or array) of length greater than one, the gradient will be a matrix (referred to as the "Jacobian" matrix) whose number of columns equals the length of the variable's value, and whose number of rows equals the length of the expression's value. Note that 'as.matrix' can be used to convert a numeric gradient value to always be a matrix, if that is desired. Note also that any 'dim' attribute for the expression or variable is ignored when creating the Jacobian matrix, which is always a matrix (not a higher-dimensional array), or a simple vector if the Jacobian matrix has only one row or one column. This can of course be changed if desired (see the example below). The gradient of a list is a list of the same form, with gradients as described above replacing the numeric elements. For a gradient with respect to a list variable, the gradient value is a list (or nested lists) of the same form, with elements that are the gradient of the expression value with respect to that element of the list; note that the gradient of an expression value could itself be a list. When the 'with gradient' or 'track gradient' construct has more than one variable, the gradient will be a list of gradient values, with names corresponding to the variables. Note that this is the same form as would be obtained if the values of these variables were combined into a named list which was then used as a single variable in the 'with gradient' or 'track gradient' construct (see the example below). Examples: a <- with gradient (x=3) sin(x) attr(a,"gradient") # should be cos(3) a <- with gradient (x=3,y=2) sin(x+2*y) attr(a,"gradient")$x # should be cos(7) attr(a,"gradient")$y # should be 2*cos(7) x <- 3 a <- with gradient (x) { r <- sin(x); r^2 } attr(a,"gradient") # should be 2*sin(3)*cos(3) sqr <- function (y) y^2 # gradients can be tracked through sqr x <- 3 a <- with gradient (x) { r <- sin(x); sqr(r) } attr(a,"gradient") # should be 2*sin(3)*cos(3) funny <- function (x,y) { # has a discontinuity q <- no_gradient(2*x) # gradient of 2*x won't be tracked if (q>y/2) # gradient of y/2 won't be tracked sin(x+y) else cos(x+y) } track gradient (a = 3) { print (gradient_of(funny(a,a))) # prints 2*cos(3+3) print (gradient_of(funny(a,8*a))) # prints -9*sin(3+24) } sigmoid <- function (x) compute gradient (x) { v <- 1 / (1+exp(-x)); v } as v * (1-v) sigmoid(1) # no gradient computed, only value with gradient (x=1) sigmoid(x) # both value and gradient computed track gradient (x=1) # should compute the same gradient gradient_of (1/(1+exp(-x))) # as above track gradient (x=c(-3,1,2)) # gradient will be a diagonal jacobian gradient_of (1/(1+exp(-x))) # matrix, given by its diagonal elements set.seed(123); with gradient (r=5) rexp(1,r) set.seed(123); v1<-rexp(1,4.999) set.seed(123); v2<-rexp(1,5.001) (v2-v1) / 0.002 # should be close to rexp gradient above r <- with gradient (a=7) list(a,a^2) attr(r,"gradient") # should be a list of 1 and 14 r <- with gradient (a=7) list(square=a^2,cube=a^3) attr(r,"gradient")$cube # gradients of lists retain names with gradient (a=7,b=9) { r <- list() r$aplusb <- a+b r$atimesb <- a*b r$asqbsq <- r$atimesb^2 list (r, a^2*b^2) # a^2*b^2 should be the same as asqbsq } with gradient (a=7) { L <- list(square=a^2,cube=a^3) list (L$square*L$cube, a^5) # both values and gradients of the two } # list elements should be the same with gradient (a=3) # only the gradient with respect to a with gradient (b=10*a) # will be shown, but the chain rule c(a,b,a*b) # is used to convert derivatives wrt # to b into derivatives wrt to a with gradient (a=5,b=6) c(a^2,a*b) # these give the same with gradient (x=list(a=5,b=6)) c(x$a^2,x$a*x$b) # value and gradient with gradient (a=2) { # find derivatives of powers of a, from L <- list(a) # a^1 to a^10, evaluated at a=2 for (i in 2..10) L[[i]] <- L[[i-1]] * a L } with gradient (a=2) { # same as above, but with a real vector V <- a # rather than a list for (i in 2..10) V[i] <- V[i-1] * a V } L <- as.list(seq(0,1,length=11)) # list for use below... with gradient (L) { # tracks gradient w.r.t. 11 elements of L p <- 0 for (i along L) p <- p + i*L[[i]] p^2 + p^3 + p^4 + p^5 # every operation computes derivatives } # w.r.t. all 11 elements of L with gradient (L) { # compute same result more efficiently... p <- 0 for (i along L) p <- p + i*L[[i]] back gradient (p) # operations in the expresson below p^2 + p^3 + p^4 + p^5 # compute derivative w.r.t. p only, then } # chain rule gives gradient w.r.t. L V <- seq(0,1,length=11) # same as above, but using a real vector with gradient (V) { # the gradient will be a 1 x 11 matrix p <- 0 for (i along V) p <- p + i*V[i] back gradient (p) p^2 + p^3 + p^4 + p^5 } track gradient (a=c(7,5)) gradient_of (c(a,a^2,a^3,a[1]*a[2])) # produces a 7 x 2 Jacobian matrix # Make table of (a+j)^i and derivatives w.r.t. a, for a equal to 1.1. tbl <- function (a) { # a function to compute such a table T <- matrix (nrow=4, ncol=5) for (i, j along T) T[i,j] <- if (i == 1) a+j else T[i-1,j] * (a+j) T } T <- with gradient (a=1.1) tbl(a) # call tbl asking for the gradient T # note the gradient is a simple vector, array(attr(T,"gradient"),dim(T)) # but dimensions can be added if needed # Gradient tracking for numeric data in a data frame. a <- c(3,1,5) b <- c(9,2,7) r <- with gradient (a,b) data.frame(a^2,a*b,b^2) print(r) # no gradient seen, since print.data.frame # doesn't print attributes... attr(r,"gradient") # ...but it's there # Defining and using a function to combine a value and a gradient. val_and_grad <- function (v) # the call of gradient_of here will be list (v, gradient_of(v)) # valid when val_and_grad is called from # a place where gradient_of is valid... track gradient (a = 7) { b <- a^2; val_and_grad(10*b) } # ... such as here