# Recent Posts

• Reverse ELBO — The ELBO can be written as a KL divergence. However, the KL divergence is sensitive to the order of arguments. So, what happens if we swap the arguments? Do we get a 'reverse ELBO'?
• The Chow-Robbins game with an unknown coin — The Chow-Robbins game is simple to describe: toss a fair coin repeatedly and stop whenever you want, receiving as a reward the average number of heads accrued at the time you stop. It is all the more surprising that the expected payoff of this innocently looking game is not known exactly till today. However, efficient numerical methods and tight bounds exist. This note shows how a related problem where the coin bias is not known in advance can be treated in a similar manner using dynamic programming.
• EVaR vs MaxEnt vs BI vs EM for PS — This note describes four approaches to policy search (PS) which all can be understood as injecting optimism to facilitate exploration.
• The $\alpha$-Gaussian — The Gaussian distribution is known to be the maximum entropy distribution with fixed first and second moments. What is the distribution that maximizes the $\alpha$-entropy? The $\alpha$-Gaussian?
• LQR and the golden ratio — How suboptimal is the steady-state LQR feedback controller when applied for a finite amount of time? The golden ratio will provide an answer.
• EM as KL minimization — EM is an algorithm for ML estimation, so it obviously minimizes a KL. Although the interpretation presented here is equivalent to the one given by Neal and Hinton (1998) in terms of free energy, it may be easier to follow for uninitiated.
• Forward kinematics by torus slicing — Already the simplest two-link planar robot arm has no unique inverse kinematics. To understand where exactly the non-uniqueness comes from, look into the geometry of the configuration space.
• Why is there a minus in the Lagrangian? — In the Lagrangian formulation of classical mechanics, there is a minus sign between the kinetic and the potential energy. On the other hand, in the Hamiltonian formulation there is a plus. Often, in reinforcement learning, we interpret a cost function as a kind of energy to be minimized. So, why do we always add terms and never subtract?
• Differentiable line segment vs disk (or circle) collision detection — It is straightforward to check whether a line intersects a disk. It is a bit harder with a line segment and a circle. And it becomes really interesting if you want to express such a condition as a differentiable inequality.
• Logistic regression as KL minimization — A minute of thought over a glass of wine reveals that the logistic regression objective is the KL divergence between a parametrized data distribution and an empirical one.
• Risk-averse linearly-solvable control — Risk-averse counterpart to Kappen-Todorov linearly solvable control framework.
• Feynman-Kac formula — Summary with a view towards applications in controlled diffusion.
• Linearly constrained relative entropy minimization — Policy improvement by moment matching with minimal novelty.
• Expectation through CDF — A beautiful formula for computing expectations using CDF instead of PDF.
• Entropic proximal policy search — A generic algorithm.
• Discounting in ergodic MDPs — Does discounting make sense when there is no time?
• Ergodic policy search as a linear program — Summary.
• Deriving the HJB equation — This post provides an informal derivation of the Hamilton-Jacobi-Bellman equation that does not explicitly rely on Bellman's optimality principle.
• Smoothing and differentiation — When minimizing a noisy or a highly oscillating function, it is reasonable to smoothen it before computing the derivative. Linear smoothing and differentiation are commutative operators, therefore their order can be switched. Using the reparameterization trick to shift the dependence on the parameters from the function to the smoothing kernel, one can compute the derivative of the smoothened function even if the gradient of the function itself is not available.
• Gaussian process vs kernel ridge regression — We have seen in the previous post that estimating Gaussian conditional mean is the same as performing linear regression. In this post, we will show that kernel ridge regression plays the role of linear regression for Gaussian processes.
• Gaussian conditioning vs linear regression — Modeling inputs and outputs as jointly Gaussian and then conditioning on the inputs to predict expected outputs is equivalent to plain linear regression.
• $\alpha$-Divergence between Gaussians — Deriving the explicit formula for the $\alpha$-divergence between two univariate Gaussians.
• Change of variables and necessary conditions for optimality — For minimization of a multivariate function, often new variables get introduced to simplify calculations. However, not every change of variables is equally good.
• TD error vs advantage vs Bellman error — Clarifying the definitions of the TD error, the advantage, and the Bellman error in RL.
• All smooth $f$-divergences are locally the same — It is widely known that the KL divergence and the reverse KL locally look like the Fisher metric. This result can be generalized to any twice differentiable $f$-divergence.
• Geodesic distance between probability distributions is not the KL divergence — The Fisher metric allows one to compute the geodesic distance between probability distributions by line integration. Although the KL locally coincides with the Fisher metric, it is not the geodesic distance.
• The EM algorithm — A brief summary.
• Inertia tensor under affine change of basis — Inertia tensor defines a well-known bi-linear form—the kinetic energy; hence, its coordinates transform under rotations as those of a bi-linear form. Translations of the basis are described by the parallel axis theorem. By combining rotation and translation, one can express the inertia tensor in any basis given its coordinates in the frame located at the center of mass.
• Discounted reward vs average reward reinforcement learning — There are two approaches to formulating the goal in infinite-horizon MDPs. One way is to introduce a discount factor and maximize the expected sum of discounted future rewards. The other way is to maximize the expected reward under stationary state distribution assuming ergodicity. It is easy to show that the two approaches are equivalent.
• KL minimization vs maximum likelihood estimation — Maximum likelihood estimation can be seen as minimization of the KL divergence from a parametric distribution to an empirical distribution.
• Many faces of entropy — The definition of entropy should be explicit about the base measure. Then information-theoretic concepts such as KL divergence, joint entropy, and mutual information naturally follow.
• Bregman divergence of alpha-divergence — Bregman divergence of the KL divergence is the KL divergence again. What is Bregman divergence of alpha divergence? Is there another stable point of Bregman divergence?
• KL between trajectory distributions vs KL between policies — The KL divergence between trajectory distributions nicely decomposes into a sum of KL divergences between policies over time and state.
• Between line and parabola — How do functions between $y = x$ and $y = x^2$ look like? That's not so straightforward, because the ordinary definition $y = x^\alpha = \exp(\alpha \log x)$ doesn't work for negative $x$.
• Determinant of exponential and exponent of adjoint — Two notable equalities from Lie theory.
• Circle definitions — Generative models are similar in spirit to parametric definition of a circle. Discriminative models are akin to definition through constraints. Are there other ways to define a circle?
• Transition probability function — Definition and properties.
• Probability distribution vs cumulative distribution function — Several terms in probability theory may make an impression that they define the same thing. One should, however, carefully discern between abstract entities and their representations. Here is a brief summary of the basic probability theory notions which clarifies the distinctions.
• Tensor powers at the service of humanity — Some funny identities relating tensor powers, direct products, and exponentiation.
• Types, morphisms, and concepts in C++ — From a set-theoretic point of view.
• Quaternionic change of basis — Imagine you are given coordinates of a quaternion in a coordinate frame. If you know how to get to that frame from a base frame, what are the coordinates of the quaternion in the base frame?
• Isometries of the $l_1^n$ space — Isometries of the Euclidean space are well-known: they are reflections, translations, and rotations. What are their analogs for the $l_1^n$ space?
• Distance between rotations — Engineers are used to think that everything is a vector in a high-dimensional vector space. Rotations, however, are better thought of as Lie groups acting on vector spaces. It is crystally clear what the distance between vectors in a Euclidean space is. Definition of distance between elements of a rotation group, on the other hand, may very well surprise you.
• Is a Circle a kind of an Ellipse in C++? — Intuition suggests that one should inherit Circle from Ellipse to represent the relationship "is a kind of" between them. However, there are flaws in such reasoning.
• Fisher metric vs KL-divergence — The Fisher information metric can be viewed as the second derivative of the KL-divergence.
• CSS styles for blogging — Links to popular CSS frameworks.
• Jacobian transpose vs pseudoinverse — Understanding differences between the two inverse kinematics algorithms.
• Tensor product vs direct product vs Cartesian product — All of these are different ways to build complex objects from simple ones.
• Time value of money for engineers — Why is \$1 today worth more than tomorrow? In this post, we'll see how to transform money in time, learn about the concepts of present value (PV) and future value (FV), and create a savings plan for a happy retirement.
• Change of basis vs linear transformation — Learn to discern rotation of a vector from rotation of a basis.
• Run matplotlib in a virtualenv on Ubuntu 16.04 with different backends — If figures do not show up, rebuild matplotlib from sources.
• Jekyll syntax highlighting themes — Beautiful pastel code highlighting for rouge.
• Fourier transforms explained — DFT vs DTFT vs FFT. WTF?
• Recover hard bricked OnePlus One — Reset to the factory state when the phone does not respond to any input.

#### Quotes

The first and most necessary topic in philosophy is the practical application of principles; as, "We ought not to lie": the second is that of demonstrations; as, "Why it is that we ought not to lie": the third, that which gives strength and logical connection to the other two; as, "Why this is a demonstration". For what is demonstration? What is consequence? What contradiction? What truth? What falsehood? The third point is then necessary on the account of the second; and the second on the account of the first. But the most necessary, and that whereon we ought to rest, is the first. But we do just the contrary. For we spend all our time on the third point, and employ all our diligence about that, and entirely neglect the first. Therefore, at the same time that we lie, we are very ready to show how it is demonstrated that lying is not right.