Lecture Script

Introduction

Reference material

In terms of background, please refer to the Maths for Intelligent Systems [<https://www.user.tu-berlin.de/mtoussai/teaching/Lecture-Maths.pdf>] as well as the Intro to Robotics [<https://www.user.tu-berlin.de/mtoussai/teaching/Lecture-Robotics.pdf>] lecture scripts. Here a list of further teaching material:

Coding Getting Started

Please follow the instructions at github/robotics-course [<https://marctoussaint.github.io/robotics-course/>] for setting up the python package. This includes a series of tutorials, which can also be downloaded here [<https://github.com/MarcToussaint/rai-tutorials>].

Scene & Robot Description

Generally speaking, a scene is a collection of objects (including robot parts). We typically assume objects to be rigid bodies with fixed shape – which clearly is a simplification relative to real world. More about this below, in section 1.14.1.

However, formally we define a scene as a collection of frames, which is short for coordinate frames. We can think of these frames as oriented locations in 3D space – and various things can be associated to these frames. If a rigid shape and mass is associated to a frame, then it makes a typical rigid body. But frames can also be associated to robot joint locations or virtual landmarks.

Transformations

Let \(i=1,..,m\) enumerate \(m\) frames in a scene. Each frame has a pose \(X_i\in SE(3)\), where \(SE(3) = {\mathbb{R}}^3 \times SO(3)\) is the group of 3D transformations, namely the cross-product of translations and rotations. We always assume a world origin to be defined and use the word pose specifically for the transformation from world origin to the object frame.

Transformations in \(A\in SE(3)\) are tuples \(A = (t, r)\), where \(t\in{\mathbb{R}}^3\) is a translation and \(r\in SO(3)\) a rotation – see Appendix 1.10 for more details. Rotations can be represented as matrix \(R\) (see the Maths script on properties of rotation matrices), and a pose as the \(4\times 4\) homogeneous transform \(\left(\begin{array}{cc}R & t \\ 0 & 1\end{array}\right)\). However, more commonly in code we represent rotations as a 4D quaternion \(r\in{\mathbb{R}}^4\) with unit length \(|r| = 1\). I always use the convention \(r=(r_0,\bar r)\), where the first entry \(r_0 = \cos(\theta/2)\) relates to the total rotation angle \(\theta\), and the last three entries \(\bar r = \sin(\theta/2)~ \underline w\) relate to the unit length rotation axis \(\underline w\).

Euler angles and the scaled rotation vector are alternative rotation representations – but never use them. The appendix 1.10 introduces to all these representations and derives conversion equations to relate them.

The illustrates how you can manually define frames in a configuration and set absolute or relative transformations.

Coordinates and Composition of Transformations

|image|

[figTransforms] Composition of transforms.

Consider Fig. [figTransforms], were we have three frames \(1,2,3\) in addition to the world origin frame \(W\). Each frame has a global pose \(X_1, X_2, X_3\), and relative transforms \(Q_{W\to 1}, Q_{1\to 2}, Q_{2\to 3}\). We have

\[\begin{split}\begin{aligned} X_1 &= Q_{W\to 1} \\ X_2 &= Q_{W\to 1} \circ Q_{1\to2} \\ X_3 &= Q_{W\to 1} \circ Q_{1\to2} \circ Q_{1\to3} ~.\end{aligned}\end{split}\]

Note that when composing relative transforms, we concatenate (append) them on the right! Intuitively, this describes a concatenation of turtle commands, where a turtle is initially placed on the world origin, then translates, then rotations, then translates relative to its own pose, then rotations relative to its own pose, etc, and ends up in pose \(X_3\).

Now consider the position of a point in 3D space. It can be given in world coordinates \(x^W\), but also in relative coordinates \(x^1, x^2, x^3\). We have

\[\begin{aligned} x^W &= Q_{W\to 1}~ Q_{1\to2}~ Q_{1\to3}~ x^3 = X_3~ x^3 ~.\end{aligned}\]

Now you might want to ask: “does \(Q_{1\to 2}\) describe the forward or the backward transformation from frame \(1\) to frame \(2\)?” But this question is somewhat ill-posed. The situation is:

  • \(Q_{1\to 2}\) describes the translation and rotation of frame \(2\) relative to \(1\). So you may call it the “forward FRAME transformation”.

  • \(Q_{1\to 2}\) describes the coordinate transformation from \(x^2\) to \(x^1 = Q_{1\to 2} x^2\). So you may call it the “backward COORDINATE transformation”.

In the view of fundamental linear algebra, this should not surprise as basis vectors transform covariant, while coordinates transform contra-variant. The appendix 1.10.2.1 explains this again in more detail and with an explicit example.

Scene Tree or Forest

Scenes are typically represented as trees, with the world origin as a root, and the pose of children specified by a relative transformation from the parent. For instance, a scene with a book on a table on the ground on the world, would have four frames with poses \(X_0, X_1, X_2, X_3\) (of the world, ground, table, book), but the scene would typically be represented by relative transforms \(Q_1, Q_2, Q_3\) such that

\[X_i = X_{i{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}} \circ Q_i ~.\]

Note that every frame can only have a single parent, and we can abbreviate the notation \(Q_i \equiv Q_{\text{parent}(i)\to i}\).

Scenes can also be a forest of frames, where some frames have no parent and their pose \(X_i\) must be specified, while for non-roots the relative transform \(Q_i\) is specified. We usually only talk about trees, but include meaning forests.

The also demonstrates how to define a fram a child of another, thereby defining a frame tree. Instead of the absolute pose X, you typically specify the relative transformation Q for such a child frame.

Kinematics

Robots as Parameterized Trees

The key thing in robotics is that some relative transforms (between robot links) are “motorized” and can be moved. Formally, this means that some of the relative transforms \(Q_i\) in our scene have degrees of freedom (dof) \(q_i \in {\mathbb{R}}^{d_i}\).

For typical robots (with hinge or linear joints) each \(q_i\) is just a single number (the joint dimensionality \(d_i=1\)). E.g., a hinge joint around the (local) \(x\)-axis has a single dof \(q_i\in{\mathbb{R}}\) that parameterizes the relative transform

\[\begin{split}\begin{aligned} Q_i(q_i) = \left(\begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & \cos(q_i) & -\sin(q) & 0 \\ 0 & \sin(q_i) & \cos(q) & 0 \\ 0 & 0 & 0 & 1\end{array}\right) ~.\end{aligned}\end{split}\]

And a prismatic (or translational) joint along the (local) \(x\)-axis parameterizes

\[\begin{split}\begin{aligned} Q_i(q_i) = \left(\begin{array}{cccc} 1 & 0 & 0 & q \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right) ~.\end{aligned}\end{split}\]

Other joint types (universal, cylindrical) are less common.

A bit special are ball (spherical) joints: They parameterize arbitrary rotations within \(Q_i\) – in principle they could be described as having 3 dofs (as the Lie group \(SO(3)\) is a 3D manifold), however, in code it is practice to again use quaternions to parameterize rotations, which means \(q_i\in{\mathbb{R}}^4\) for ball joints. However, note that this is an over parameterization: If \(q_i\) is externally “set” by a user or some algorithm, it may not (exactly) be normalized but \(Q_i(q_i)\) is defined to be the proper rotation that corresponds to the quaternion \(q_i/|q_i|\). Note that if a user or algorithms sets such a quaternion parameter to zero, that’s a singularity and strict error.

In the , when a joint is define for the first time, play around with alternative joint types, e.g. a quatBall. The tutorial also lists which joint types are pre-defined.

In the scene tree, some of the relative transforms \(Q_i\) are parameterized by dofs, \(Q_i(q_i)\). Note that \(X_\text{parent$(i)$}\) is the joint origin, i.e., determines the location and orientation of the joint axis, while \(X_i = X_\text{parent$(i)$} Q_i\) is the joint (output) frame. In a robot structure one typically has chains of alternating rigid and parameterized transforms, e.g.,

a rigid transform \(Q_{\pi(i)}\) from world into the origin of joint \(i\)

a parameterized transform \(Q_i(q_i)\) representing the joint motion (We call this one the joint frame, as it hosts the joint dofs.)

a rigid transform \(Q_{i \to \pi(j)}\) from the output of \(i\) into the origin of a following joint \(j\)

a parameterized transform \(Q_j(q_j)\)

etc

There is a minimalistic convention of describing robot structures, called Denavit-Hartenberg convention. These describe the rigid transformations between joints using only 4 numbers instead of 6 (which pre-determines the zero calibration as well as the “lateral” positioning of the following joint origin). But there is no need to use this convention and the above notation is conceptually cleaner and leads to intuitive, freely user-defined joint origins.

In the you find a section on interactively editing a scene description file mini.g. Using this you can try to invent your own robot and environment. The tutorial also shows how to load pre-define robot models. The appendix [secConfigFiles] provides a more formal specification of the yaml-style file syntax.

Forward Kinematics

We use the word configuration for an “articulated scene”, i.e., where some relative transforms \(Q_i(q_i)\) are parameterized by dofs \(q_i \in {\mathbb{R}}^{d_i}\) (and also other dofs such as forces or timings might be represented). A configuration can include multiple robots – from our perspective there is no difference between one or multiple robots. It’s just a parameterized forest of frames.

We define the joint vector \(q\in{\mathbb{R}}^n\) to be the stacking of all dofs \(q_i\) (all dofs of a configuration). Given the joint vector, we can forward chain all relative transformations in the scene and thereby compute the absolute pose \(X_i(q)\) of every frame as a function of \(q\).

This function \(q \mapsto X_i(q)\) is the core of forward kinematics. It describes how the joint vector \(q\) determines the pose of all frames in the configuration.

The precise definition of the term forward kinematics varies across textbooks. I find the most concise definition to be the mapping from all dofs \(q\) to the full configuration state \(\{X_i(q)\}_{i=1}^m\), which so far we described in terms of all frame poses. This definition is consistent with the formal description of kinematics as the theory of possible motions of a system configuration (see 1.14.2).

But in practice, the word forward kinematics is often used simply as the mapping from \(q\) to one particular “feature” of the configuration. For instance, if \(X_i(q)=(t_i(q),r_i(q))\) is the pose of some frame \(i\) (e.g. the “end-effector”), forward kinematics can describe the mapping

  • \(q\mapsto t_i(q)\)   to the position of frame \(i\)

  • \(q\mapsto r_i(q) \cdot \textbf{e}_x\)   to the \(x\)-axis of frame \(i\) (where \(\textbf{e}_x = (1,0,0)^{\!\top\!}\)).

  • \(q\mapsto X_i(q) p\)   to the world coordinate of a point attached to frame \(i\) with fixed relative offset \(p\).

Each of these are 3-dimensional features. Let introduce a more formal notation for these three basic features:

\[\begin{split}\begin{aligned} q \mapsto \phi^{\textsf{pos}}_{i,p}(q) &= X_i(q)~ p \quad\in {\mathbb{R}}^3 ~, \\ q \mapsto \phi^{\textsf{vec}}_{i,v}(q) &= r_i(q) \cdot v \quad\in {\mathbb{R}}^3 ~, \\ q \mapsto \phi^{\textsf{quat}}_{i}(q) &= r_i(q) \quad\in {\mathbb{R}}^4 ~,\end{aligned}\end{split}\]

where \(\phi^{\textsf{pos}}_{i,p}(q)\) is the (world) position of a point attached to frame \(i\) with relative offset \(p\), \(\phi^{\textsf{vec}}_{i,v}(q)\) is the world coordinates of a vector \(v\) attached to frame \(i\), and \(\phi^{\textsf{quat}}_{i}(q)\) is the 4D quaternion orientation of frame \(i\). From these three, many others features can be derived.

E.g., also the \(3\times 3\) rotation matrix is a useful basic feature (as it is often used in equations). We can easily construct it by concatenating columns, \(\phi^{\textsf{rot}}_i = (\phi^{\textsf{vec}}_{i,e_x}, \phi^{\textsf{vec}}_{i,e_y}, \phi^{\textsf{vec}}_{i,e_z}) \in {\mathbb{R}}^{3\times 3}\) for basis vectors \(e_x,e_y,e_z\) of frame \(i\). (Note that the Jacobian (defined later) of this is a \(3\times 3 \times n\) tensor.)

The output space of the kinematic map is also called task space. However, I often just call it kinematic feature.

The illustrates how you get the joint vector \(q\) and set it. This way you can animate the configuration. Also the positions and orientations of all frames can be queried directly – realizing the most basic kind of forward kinematics.

Jacobians

We will use kinematic features \(\phi\) to formulate differentiable constraint and optimization problem. Therefore, we assume all kinematic features \(\phi\) are differentiable and we can efficiently compute the Jacobian

\[\begin{aligned} J(q) = \frac{\partial}{\partial q}\phi(q) ~.\end{aligned}\]

If \(y = \phi(q)\), then this Jacobian tells us how a velocity \(\dot q\) in joint space implies a velocity \(\dot y\) in task space,

\[\begin{aligned} \dot y = J(q) \dot q ~.\end{aligned}\]

Recall that the forward kinematics is essentially implemented by forward chaining the relative transforms \(Q_i\). If we use an auto-differentiable programming language for this, we’d directly have the Jacobians. However, the Jacobians can also directly be expressed analytically and their computation turns out simpler and more efficient than the forward chaining itself. To implement a kinematic engine we essentially need to figure out how the different joint types contribute to the Jacobians of the three basic features above. This is covered by considering the following cases:

Rotational Joint

Consider that somewhere on the path from world to frame \(i\) there is a rotational (hinge) joint \(j\) positioned at \(p_j\) and with unit axis vector \(a_j\) (both in world coordinates). Now consider a point attached to frame \(i\) at world coordinate \(p\). (Note that we needed forward kinematics to determine \(p_j, a_j\), and \(p\).) Then the velocity \(\dot p\) relates to the joint angle velocity \(\dot q_j\) by

\[\dot p = [a_j \times (p - p_j)]~ \dot q_j ~.\]

Now assume a vector \(v\) attached to frame \(i\). Its velocity is

\[\dot v = [a_j \times v]~ \dot q_j = [-{\text{skew}}(v)~ a_j]~ \dot q_j ~.\]

Now consider the quaternion \(r_i\) of frame \(i\). Its velocity (much less obvious, see appendix Eq. ([eqQuatVel])) is

\[\dot r_i = {\frac{1}{2}}[(0,a_j)\circ r_i]~ \dot q_j ~.\]

Recall that \(q\in{\mathbb{R}}^n\) is the full joint vector. Let \(j\) be the dof index of our rotational joint such that \(q_j \in {\mathbb{R}}\) is the scalar joint angle. Further, let \(p_j,a_j\) be the joint position and axis, and \(p\) a world query point. We define two matrices that are zero except for the provided columns:

\[\begin{split}\begin{aligned} J^{\textsf{ang}}\in {\mathbb{R}}^{3 \times n} \quad\text{with}\quad &J^{\textsf{ang}}_{:,j} = a_j ~, \\ J^{\textsf{pos}}(p) \in {\mathbb{R}}^{3 \times n} \quad\text{with}\quad &J^{\textsf{pos}}_{:,j} = a_j \times (p - p_j) ~.\end{aligned}\end{split}\]

With these two matrices we can rewrite the above equations as

\[\begin{split}\begin{aligned} \dot p &= J^{\textsf{pos}}(p)~ \dot q \\ \dot v &= [-{\text{skew}}(v)~ J^{\textsf{ang}}(p)]~ \dot q \\ \dot r &= {\frac{1}{2}}[\text{Skew}(r)~ \bar J^{\textsf{ang}}(p)]~ \dot q \quad\text{where}\quad \text{Skew}(w,x,y,z) = \left(\begin{array}{cccc} +w & -x & -y & -z \\ +x & +w & +z & -y \\ +y & -z & +w & +x \\ +z & +y & -x & +w\end{array}\right) ~, \label{eqQuatRate}\end{aligned}\end{split}\]

where by convention the cross-product \([A\times v]\) for a \(3\times n\) matrix with a 3-vector takes the cross-products row-wise (could perhaps better be written \([-v\times A]\)). The last equation is derived in the appendix with Eq. ([eqQuatVel]), where we discuss how an angular velocity translates to a quaternion velocity. The bar in \(\bar J^{\textsf{ang}}\) makes this a \(4\times n\) matrix by inserting a zero top row (analogous to \((0,w)\) in ([eqQuatVel])). The \(\text{Skew}\) is an unusual definition of a skew matrix for quaternions, so that quaternion multiplication \(a \circ b\) can be written linearly as \(\text{Skew}(b)~ a\).

Now, if in our scene tree we have more than one rotational joint between world and frame \(i\), each of these joints simply contribute non-zero columns to our basic matrices \(J^{\textsf{ang}}, J^{\textsf{pos}}(p)\). So this is the core of what we have to implement for rotational joints.

Translational Joint

A translational (prismatic) joint on the path from world to frame \(i\) also contribute a column to the basic matrix \(J^{\textsf{pos}}(p)\), but contributes notion to \(J^{\textsf{ang}}\) (as it does not imply rotational velocity in the sub-branch). Specifically, let \(a_j\) the translational axis of the joint with dof index \(j\), then it simply contributes a column

\[\begin{aligned} J^{\textsf{pos}}_{:,j} = a_j ~.\end{aligned}\]

That’s it for translational joints.

Quaternion Joint

Trickier, but important for ball and free joints is to also know how a quaternion joint contributes columns to \(J^{\textsf{ang}}\) and \(J^{\textsf{pos}}(p)\). Modifying a quaternion parameterization \(q_j\in{\mathbb{R}}^4\) of a relative transform \(Q_j(q_j)\) implies in some way a rotational velocity down the branch. So the effect should be similar to a rotational joint, but without fixed axis and modulated by the normalization of \(q_j\). The solution is derived in the appendix with Eq. ([eqQuatJac]) and summarized here: Let \(X_j\) be the output pose of the quaternion joint. (Yes, output!) And let \(R_j\) be the \(3\times 3\) rotation matrix for the world pose \(X_j\), and let \(r_j \in {\mathbb{R}}^4\) be the quaternion of the relative joint transform \(Q_j\). Then

\[\begin{aligned} \label{eqQuatJoint1} J^{\textsf{ang}}_{:,j} = \frac{1}{|q|} R_j J(r_j) ~,\quad\text{where}\quad J(r)_{:,k} &= -2 (e_k \circ r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1})_{1:3} ~.\end{aligned}\]

Here, \(e_i\) for \(k=0,..,3\) are the unit quaternions and the matrix \(J(r)\in{\mathbb{R}}{3 \times 4}\) describes how a variation of a quaternion \(r\) induces a 3D rotation vector relative to the output space of \(r\). I call this the quaternion Jacobian. The derivation is found in the appendix when discussion how a quaternion velocity implies and angular velocity. The multiplication with \(R_j\) transforms this rotation vector to world coordinates. The division by \(|q_j|\) accounts when the dof \(q_j\) is not (exactly) normalized.

As we figured out the angular vector induced by a variation of a quaternion joint, this also defines the column it contributes to the positional Jacobian:

\[\begin{aligned} J^{\textsf{pos}}_{:,j}(p) = [\frac{1}{|q|} R_j J(r_j)] \times (p - p_j) ~,\end{aligned}\]

where \(p_j\) is the position of the quaternion joint.

Note how differently we treat the quaternion \(q_j\) as a joint parameterization \(Q_j(q_j)\) and the quaternion \(r_i\) as a kinematic (“output”) feature of frame \(i\). For instance, we can have the Jacobian of the quaternion \(r_i\) w.r.t. the quaternion joint parameterization \(q_j\), by inserting ([eqQuatJoint1]) into ([eqQuatRate]). And even if all other transformation in the scene are identities and the output quaternion \(r_i\) is “essentially identical” to the joint quaternion \(q_j\), the Jacobian is still not exactly identity, as it accounts for normalization (and potential flip of sign).

General Concept of Differentiable Features

In the previous sections we focussed on the 3 mappings \(\phi^{\textsf{pos}}_{i,p}(q), \phi^{\textsf{vec}}_{i,v}(q), \phi^{\textsf{quat}}_{i}(q)\). The Jacobians of these are given via \(J^{\textsf{pos}}_{:,j}(p)\) and \(J^{\textsf{ang}}_{:,j}(p)\). If these are given (e.g. implemented by an efficient core kinematics engine), then many other features can be computed based on them.

We assume a single configuration \(q\), or a whole set of configurations \(\{q_1,..,q_T\}\), with each \(q_i \in\mathbb{R}\) the DOFs of that configuration.

In general, a (0-order) feature \(\phi\) is a differentiable mapping

\[\phi: q \mapsto \mathbb{R}^D\]

of a single configuration into some \(D\)-dimensional space.

The introduces to features that are readily implemented in the rai code.

(In C++, new features can be implemented by overloading the abstract Feature class. Implementing new features is typically done by first evaluating existing features and then “forward chaining” the computation of the new feature – quite similar to how models are defined in pyTorch or similar autodiff frameworks. The C++ code uses autodiff (which forward chains Jacobians directly at computation) for most features.)

When using features in code, one can additionally specify a target and scale, which defines a subsequent linear transformation:

\[\phi(q) \gets \texttt{scale} \cdot (\phi(q) - \texttt{target})\]

Note that the scale can be a matrix, which projects the feature. E.g., if you want to define a 2D feature which is the \(xy\)-position of frame \(i\), then you can use a matrix \(\texttt{scale}= \left(\begin{array}{ccc}1 & 0 & 0 \\ 0 & 1 & 0\end{array}\right)\).

Further, a feature can also be of higher order, which by default means a finite difference of a 0-order features. In general, a higher-order features is a differentiable mapping

\[\phi: (q_0,q_1,..,q_k) \mapsto \mathbb{R}^D\]

of a \((k+1)\)-tuple of configurations to a \(D\)-dimensional space. This is typically used in the context of path configurations, which is a sequence of configurations used in path optimization.

Given any 0-order feature \(\phi\), by default that defines its 1st and 2st order feature as

\[\phi(q_0,q_1) = \frac{1}{\tau}(\phi(q_1) - \phi(q_0))\]

and

\[\phi(q_0,q_1,q_2) = \frac{1}{\tau^2}(\phi(q_2) - 2 \phi(q_1) + \phi(q_0)) ~,\]

which are the finite difference approximations of the feature’s velocity and acceleration. However, one can also directly implement higher-order features, e.g. to represent dynamics constraints, or more elaborate acceleration/torque cost features.

Summary: Implementing a Kinematic Engine

The above provides all essentials necessary to implement a rather general kinematic engine. To summarize:

  • Represent a scene configuration as a tree of frames, where for each frame we store the absolute pose \(X\) and relative transform \(Q\). We also annotate which relative transforms \(Q\) have dofs and how many. We need to maintain an index mapping that tells us which entries \(q_j\) of the full joint vector parameterize a given relative transformation \(Q_j(q_j)\) (essentially mapping between \(q\)-indices and frame indices).

  • An efficient implementation of forward chaining transformations: Given the absolute poses \(X\) of all root frames and all relative transforms \(Q\), implement an efficient algorithm to forward chain transformations to ensure any \(X_i\). Do this lazily on demand: Only when an absolute frame \(X_i\) is actually queried call this forward chaining for this \(X_i\) only.

  • An efficient implementation of the matrices \(J^{\textsf{pos}}\) and \(J^{\textsf{ang}}\), which, for any query frame \(i\), determines which joints are on the path from \(i\) to a root frame and for each of these joints contributes the corresponding columns to \(J^{\textsf{pos}}\) and \(J^{\textsf{ang}}\). To account for large systems (esp. path configurations, see below) matrices should be returned in sparse format.

Based on this, one provides more convenient user functions that allow to query kinematic features for any frame \(i\), including the pose \(X_i\), and on demand also provide the Jacobian of that feature.

Inverse Kinematics

|image|

We can “puppeteer” a robot by defining optimization problems with task space constraints and solve for the joint state.

We introduced forward kinematics as a mapping from an \(n\)-dimensional joint vector \(q\in{\mathbb{R}}^n\) to some \(d\)-dimensional kinematic feature \(y=\phi(q) \in{\mathbb{R}}^d\). Inverse kinematics roughly means to invert this mapping, i.e., given a desired target \(y^*\) in task space, find a joint vector \(q\) such that \(\phi(q) = y^*\). As often \(n>d\), the inversion is under-specified (leading to what is called “redundancy”). But just as the pseudo-inverse of a linear transformation addresses this, we can generalize this to a non-linear \(\phi\) – namely in an optimality formulation.

Given \(\phi\) and a target \(y^*\), a good option is to define inverse kinematics as the non-linear mathematical program (NLP)

\[\begin{aligned} \label{eqIKNLP} q^* = \argmin_q f(q) ~~\text{s.t.}~~\phi(q) = y^* ~.\end{aligned}\]

The cost term \(f(q)\) is called regularization and indicates a preference among all solutions that satisfy \(\phi(q) = y\). One might typically choose it as a squared distance \(f(q) = |\!|q-q_0|\!|^2_W\) to some “default” \(q_0\), which could be the homing state of a robot or its current state.

In practice, I recommend always using a proper NLP solver to solve inverse kinematics. As discussing optimization is beyond this script we are here already done with describing inverse kinematics! It is “nothing more” than defining a constraint problem of the sort ([eqIKNLP]) and passing it to a solver. In the coding part below I will discuss the richness in options to define such constraint problems with our differentiable features.

Only for educational purpose we will also derive the classical pseudo-inverse Jacobian solution to IK below.

Building an NLP from features

Eq. ([eqIKNLP]) describes IK as an NLP. Appendix 1.12.1 provides a technical reference of how we define NLPs mathematically and in code. Essentially, an NLP is specified by adding objectives, where each objective is given by a feature function \(\phi_i\) and an indicator \(\varrho_i\) that defines whether the feature contributes a linear cost (f), sum-of-squares cost (sos), equality constraint (eq), or inequality constraint (ineq) to the NLP.

The illustrates how an Inverse Kinematics problem can be specified as NLP. The core is the addObjective method, which adds a kinematic feature (optimally with transformed by scaling and target) as a cost or constraint (depending on the f, sos, eq, or ineqindicator) to the NLP.

Classical Derivation of Pseudo-Inverse Jacobian Solution

I strongly recommend using an NLP solver and general constraint and cost formulations to tackle IK problems – and you can skip over this section. However, for completeness I provide here also the basic derivation of classical pseudo-inverse Jacobian solutions.

Pseudo-inverse Jacobian.

We first simplify the problem to minimize

\[\begin{aligned} \label{eqSoft} f(q) = |\!|\phi(q) - y^*|\!|^2_C + |\!|q-q_0|\!|^2_W ~.\end{aligned}\]

Instead of exactly ensuring \(\phi(q) = y^*\), this only minimizes a penalty \(|\!|\phi(q) - y^*|\!|^2_C\). Here \(C\) is the norm’s metric, i.e., \(|\!|v|\!|^2_C = v^{\!\top\!}C v\), but you may instead simply assume \(C\) is a scalar. For finite \(C\) and \(W\) this approximate formulation might be undesirable. But later we will actually be able to investigate the limit \(C\to\infty\).

Since this problem is a least squares problem, the canonical approach is Gauss-Newton. The gradient, approximate Hessian, and Gauss-Newton step are

\[\begin{split}\begin{aligned} \frac{\partial}{\partial q} f(q) &= 2 (\phi(q)-y^*)^{\!\top\!}C J + 2 (q-q_0)^{\!\top\!}W = {\nabla_{\!\!f}}(q)^{\!\top\!}\\ {\nabla_{\!\!f}^2}(q) &\approx 2 (J^{\!\top\!}C J + W) \\ \delta(q) &= - [{\nabla_{\!\!f}^2}(q)]^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} {\nabla_{\!\!f}}(q) = (J^{\!\top\!}C J + W)^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} [J^{\!\top\!}C (\phi(q)-y^*) + W (q-q_0) ]\end{aligned}\end{split}\]

With some identities, this can be rewritten as

\[\begin{split}\begin{aligned} \delta(q) &= J^\sharp (y^* - \phi(q)) + (I - J^\sharp J)~ (q_0 - q) \label{eqIK} \\ J^\sharp &= (J^{\!\top\!}C J + W)^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} J^{\!\top\!}C = W^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} J^{\!\top\!}(J W^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} J^{\!\top\!}+ C^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1})^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} \text{(Woodbury identity)}\end{aligned}\end{split}\]

The matrix \(J^\sharp\) is also called (regularized) pseudo-inverse of \(J\). In its second form (RHS of Woodbury), we can take the hard limit \(C\to\infty\), where \(J^\sharp \to W^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} J^{\!\top\!}(J W^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} J^{\!\top\!})^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}\) or, for \(W={\rm\bf I}\), \(J^\sharp \to J^{\!\top\!}(J J^{\!\top\!})^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}\).

Eq. ([eqIK]) says that, to jump to the (approx.) Gauss-Newton optimum, we should make a step \(\delta\) in joint space proportional to the error \((y^*-\phi(q))\) in task space, and (optionally) combined with a homing step towards \(q_0\) projected to the task null space via the projection \((I - J^\sharp J)\).

Performing a single step \(\delta\) is approximate due to the non-linearity of \(\phi\). To solve inverse kinematics exactly we have to iterate Gauss-Newton steps. If lucky, we can use full stepsizes (\(\alpha= 1\) in the speak of line search) and iterate \(q_{k{{\hspace{-0.0pt}\textrm{+}\hspace{-0.5pt}}1}} \gets q_k + \delta(q_k)\) until convergence, and will have an exact IK solution. If \(\phi\) is very non-linear, we may have to do line searches along the step directions to ensure convergence. If \(\phi\) is non-convex, we may converge to a local optimum that depends on the initialization.

On the fly IK.

Inverse kinematics is sometimes being used to generate robot motion on the fly. In a sense, rather than letting an optimization algorithm find an IK solution and then start moving the robot to it (we we’ll do it below), you let the robot directly move (generate a smooth path) towards an IK solution. This is heuristic, and I eventually don’t recommend it. But it’s standard practice, so let’s mention it:

Let the robot be in state \(q\), and we have a task space target \(y^*\). We may compute a desired robot motion

\[\begin{aligned} \dot q = \alpha\Big[ J^\sharp (y^* - \phi(q)) + (I - J^\sharp J) (q_0 - q) \Big] ~.\end{aligned}\]

In a sense, this mimics performing (integrating over time) infinitesimal Gauss-Newton steps towards the IK solution. Often the regularization \((I - J^\sharp J) (q_0 - q)\) is also dropped, which is the same as saying \(q_0 = q\), i.e., you always set the homing state \(q_0\) to be the current state \(q\), adapting it on the fly. Doing this, you will loose a precise definition of where you’ll eventually converge to – and sometimes this leads to undesired drift in nullspace. All not recommended.

Singularity.

The limit \(C\to\infty\) mentioned above is only robust when \(\det (J J^{\!\top\!}) > 0\), or equivalently, when \(J\) has full rank (namely rank \(d\)). \(J\) is called singular otherwise, and the pseudo inverse \(J^\sharp\) is ill-defined.

Intuitively this means that, in state \(q\), certain task space directions cannot be generated, i.e., no motion in these task space directions is possible. A stretched arm that cannot extend further is a typical example.

In the original NLP formulation, this corresponds to the case where \(\phi(q) = y^*\) is simply infeasible, and a proper NLP-solver should return this information.

The soft problem formulation ([eqSoft]), where \(C\) is finite (not \(\infty\)) is one way to address a singularity: For finite \(C\), \(J^\sharp\) is well defined and defines steps towards a optimal solution of the trade-off problem ([eqSoft]). This is also called regularized IK or singularity-robust IK. But it only leads to an approximate IK solution.

3D Transformations, Rotations, Quaternions

Rotations

There are many ways to represent rotations in \(SO(3)\). We restrict ourselves to three basic ones: rotation matrix, rotation vector, and quaternion. The rotation vector is also the most natural representation for a “rotation velocity” (angular velocities). Euler angles or raw-pitch-roll are an alternative, but they have singularities and I don’t recommend using them in practice.

A rotation matrix

is a matrix \(R\in{\mathbb{R}}^{3\times3}\) which is orthonormal (columns and rows are orthogonal unit vectors, implying determinant 1). While a \(3\times3\) matrix has 9 degrees of freedom (DoFs), the constraint of orthogonality and determinant 1 constraints this: The set of rotation matrices has only 3 DoFs (\(\sim\) the local Lie algebra is 3-dim).

  • The application of \(R\) on a vector \(x\) is simply the matrix-vector product \(R x\).

  • Concatenation of two rotations \(R_1\) and \(R_2\) is the normal matrix-matrix product \(R_1 R_2\).

  • Inversion is the transpose, \(R^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} = R^{\!\top\!}\).

A rotation vector

is an unconstrained vector \(w\in{\mathbb{R}}^3\). The vector’s direction \(\underline w = \frac{w}{|w|}\) determines the rotation axis, the vector’s length \(|w|=\theta\) determines the rotation angle (in radians, using the right thumb convention).

  • The application of a rotation described by \(w\in{\mathbb{R}}^3\) on a vector \(x\in{\mathbb{R}}^3\) is given as (Rodrigues’ formula)

    \[\begin{aligned} w \cdot x &= \cos\theta~ x + \sin\theta~ (\underline w\times x) + (1-\cos\theta)~ \underline w(\underline w^{\!\top\!}x)\end{aligned}\]

    where \(\theta=|w|\) is the rotation angle and \(\underline w=w/\theta\) the unit length rotation axis.

  • The inverse rotation is described by the negative of the rotation vector.

  • Concatenation is non-trivial in this representation and we don’t discuss it here. In practice, a rotation vector is first converted to a rotation matrix or quaternion.

  • Conversion to a matrix: For every vector \(w\in{\mathbb{R}}^3\) we define its skew symmetric matrix as

    \[\begin{split}\begin{aligned} \hat w = \text{skew}(w) = \left(\begin{array}{ccc}0 & -w_3 & w_2 \\ w_3 & 0 & -w_1 \\-w_2 & w_1 & 0\end{array}\right) ~.\end{aligned}\end{split}\]

    Note that such skew-symmetric matrices are related to the cross product: \(w \times v = \hat w~ v\), where the cross product is rewritten as a matrix product. The rotation matrix \(R(w)\) that corresponds to a given rotation vector \(w\) is:

    \[\begin{split}\begin{aligned} \label{eqRodriguez} R(w) &= \exp(\hat w) \\ &= \cos\theta~ I + \sin\theta~ \hat w/\theta+ (1-\cos\theta)~ w w^{\!\top\!}/\theta^2\end{aligned}\end{split}\]

    The \(\exp\) function is called exponential map (generating a group element (=rotation matrix) via an element of the Lie algebra (=skew matrix)). The other equation is called Rodrigues’ equation: the first term is a diagonal matrix (\(I\) is the 3D identity matrix), the second terms the skew symmetric part, the last term the symmetric part (\(w w^{\!\top\!}\) is also called outer product).

Angular velocity & derivative of a rotation matrix:

We represent angular velocities by a vector \(w\in{\mathbb{R}}^3\), the direction \(\underline w\) determines the rotation axis, the length \(|w|\) is the rotation velocity (in radians per second). When a body’s orientation at time \(t\) is described by a rotation matrix \(R(t)\) and the body’s angular velocity is \(w\), then

\[\begin{aligned} \label{eqDotR} \dot R(t) = \hat w~ R(t)~.\end{aligned}\]

(That’s intuitive to see for a rotation about the \(x\)-axis with velocity 1.) Some insights from this relation: Since \(R(t)\) must always be a rotation matrix (fulfill orthogonality and determinant 1), its derivative \(\dot R(t)\) must also fulfill certain constraints; in particular it can only live in a 3-dimensional sub-space. It turns out that the derivative \(\dot R\) of a rotation matrix \(R\) must always be a skew symmetric matrix \(\hat w\) times \(R\) – anything else would be inconsistent with the constraints of orthogonality and determinant 1.

Note also that, assuming \(R(0)=I\), the solution to the differential equation \(\dot R(t) = \hat w~ R(t)\) can be written as \(R(t)=\exp(t \hat w)\), where here the exponential function notation is used to denote a more general so-called exponential map, as used in the context of Lie groups. It also follows that \(R(w)\) from ([eqRodriguez]) is the rotation matrix you get when you rotate for 1 second with angular velocity described by \(w\).

Quaternion

(I’m not describing the general definition, only the “quaternion to represent rotation” definition.) A quaternion is a unit length 4D vector \(r\in{\mathbb{R}}^4\); the first entry \(r_0\) is related to the rotation angle \(\theta\) via \(r_0=\cos(\theta/2)\), the last three entries \(\bar r\equiv r_{1:3}\) are related to the unit length rotation axis \(\underline w\) via \(\bar r = \sin(\theta/2)~ \underline w\).

  • The inverse of a quaternion is given by negating \(\bar r\), \(r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} = (r_0,-\bar r)\) (or, alternatively, negating \(r_0\)).

  • The concatenation of two rotations \(r\), \(r'\) is given as the quaternion product

    \[\begin{aligned} \label{eqQuat} r \circ r' = (r_0 r'_0 - \bar r^{\!\top\!}\bar r',~ r_0 \bar r' + r'_0 \bar r + \bar r' \times \bar r)\end{aligned}\]
  • The application of a rotation quaternion \(r\) on a vector \(x\) can be expressed by converting the vector first to the quaternion \((0,x)\), then computing

    \[\begin{aligned} r \cdot x = (r \circ (0,x) \circ r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1})_{1:3} ~,\end{aligned}\]

    I think a bit more efficient is to first convert the rotation quaternion \(r\) to the equivalent rotation matrix \(R\):

  • Conversion to/from a matrix: A quaternion rotation \(r\) convertes to the rotation matrix

    \[\begin{split}\begin{aligned} R &= \left(\begin{array}{ccc} 1-r_{22}-r_{33} & r_{12}-r_{03} & r_{13}+r_{02} \\ r_{12}+r_{03} & 1-r_{11}-r_{33} & r_{23}-r_{01} \\ r_{13}-r_{02} & r_{23}+r_{01} & 1-r_{11}-r_{22} \end{array}\right) \\ & ~ r_{ij} := 2 r_i r_j ~.\end{aligned}\end{split}\]

    (Note: In comparison to ([eqRodriguez]) this does not require to compute a \(\sin\) or \(\cos\).) Inversely, the quaternion \(r\) for a given matrix \(R\) is

    \[\begin{split}\begin{aligned} r_0 &= {\frac{1}{2}}\sqrt{1+{\rm tr}R}\\ r_3 &= (R_{21}-R_{12})/(4 r_0)\\ r_2 &= (R_{13}-R_{31})/(4 r_0)\\ r_1 &= (R_{32}-R_{23})/(4 r_0) ~.\end{aligned}\end{split}\]
Angular velocity \(\to\) quaternion velocity

Given an angular velocity \(w\in{\mathbb{R}}^3\) and a current quaternion \(r(t)\in{\mathbb{R}}^4\), what is the time derivative \(\dot r(t)\) (in analogy to Eq. ([eqDotR]))? For simplicity, let’s first assume \(|w|=1\). For a small time interval \(\delta\), \(w\) generates a rotation vector \(\delta w\), which converts to a quaternion

\[\begin{aligned} \Delta r = (\cos(\delta/2), \sin(\delta/2) w) ~.\end{aligned}\]

That rotation is concatenated LHS to the original quaternion,

\[\begin{aligned} r(t+\delta) = \Delta r \circ r(t) ~.\end{aligned}\]

Now, if we take the derivative w.r.t. \(\delta\) and evaluate it at \(\delta=0\), all the \(\cos(\delta/2)\) terms become \(-\sin(\delta/2)\) and evaluate to zero, all the \(\sin(\delta/2)\) terms become \(\cos(\delta/2)\) and evaluate to one, and we have

\[\begin{aligned} \label{eqQuatVel} \dot r(t) &= {\frac{1}{2}}( - w^{\!\top\!}\bar r,~ r_0 w + \bar r \times w ) = {\frac{1}{2}}(0,w) \circ r(t)\end{aligned}\]

Here \((0,w)\in{\mathbb{R}}^4\) is a four-vector; for \(|w|=1\) it is a normalized quaternion. However, due to the linearity the equation holds for any \(w\).

Quaternion velocity \(\to\) angular velocity

The following is relevant when taking the derivative w.r.t. quaternion parameters, e.g., of a ball joint represented as quaternion. Given \(\dot r\), we have

\[\begin{aligned} \label{eq37} \dot r \circ r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} &= {\frac{1}{2}}(0,w) \circ r \circ r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} = {\frac{1}{2}}(0,w) ~,\quad w = 2~ [\dot r \circ r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}]_{1:3}\end{aligned}\]

which allows us to read off the angular velocity \(w\) induced by a change of quaternion \(\dot r\). However, the RHS zero will hold true only iff \(\dot r\) is orthogonal to \(r\) (where \(\dot r^{\!\top\!}r = \dot r_0 r_0 + \dot{\bar r}{}^{\!\top\!}\bar r = 0\), see ). In case \(\dot r^{\!\top\!}r \not=0\), the change in length of the quaternion does not represent any angular velocity; in typical kinematics engines a non-unit length is ignored. Therefore one first orthogonalizes \(\dot r \gets \dot r - r(\dot r^{\!\top\!}r)\).

As a special case of application, consider computing the partial derivative w.r.t. quaternion parameters, where \(\dot r\) is the 4D unit vectors \(e_0,..,e_3\). In this case, the orthogonalization becomes simply \(\dot r \gets e_i - r r_i\) and ([eq37]) becomes

\[\begin{aligned} (e_i - r_i r) \circ r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} &= e_i \circ r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} - r_i (1,0,0,0) ~,\quad w_i = 2~ [e_i \circ r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}]_{1:3} ~,\end{aligned}\]

where \(w_i\) is the rotation vector implied by \(\dot r = e_i\). In case the original quaternion \(r\) wasn’t normalized (which could be, if a standard optimization algorithm searches in the quaternion parameter space), then \(r\) actually represents the normalized quaternion \(\bar r = \frac{1}{\sqrt{r^2}} r\), and (due to linearity of the above), the rotation vector implied by \(\dot r = e_i\) is

\[\begin{aligned} \label{eqQuatJac} w_i &= \frac{2}{\sqrt{r^2}}~ [e_i \circ r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}]_{1:3} ~.\end{aligned}\]

This defines a \(3\times 4\) quaternion Jacobian \(J_{:i} = w_i\) with 4 columns \(w_i\), so that \(w = J \dot r\) is the angular velocity induced by a quaternion velocity \(\dot r\) (accounting for all implicit normalizations).

Transformations

We can represent a transformation as:

A homogeneous matrix

is a \(4\times 4\)-matrix of the form

\[\begin{split}\begin{aligned} T = \left(\begin{array}{cc}R & t \\ 0 & 1\end{array}\right) \end{aligned}\end{split}\]

where \(R\) is a \(3\times 3\)-matrix (rotation in our case) and \(t\) a \(3\)-vector (translation).

In homogeneous coordinates, vectors \(x\in{\mathbb{R}}^3\) are expanded to 4D vectors \(\left(\begin{array}{c}x\\1\end{array}\right) \in {\mathbb{R}}^4\) by appending a 1.

Application of a transform \(T\) on a vector \(x\in{\mathbb{R}}^3\) is then given as the normal matrix-vector product

\[\begin{split}\begin{aligned} x' = T \cdot x &= T~ \left(\begin{array}{c}x \\ 1\end{array}\right) = \left(\begin{array}{cc}R & t \\ 0 & 1\end{array}\right) ~ \left(\begin{array}{c}x \\ 1\end{array}\right) = \left(\begin{array}{c}Rx + t \\ 1\end{array}\right) ~.\end{aligned}\end{split}\]

Concatenation is given by the ordinary 4-dim matrix-matrix product.

The inverse transform is

\[\begin{split}\begin{aligned} T^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} &= \left(\begin{array}{cc}R & t \\ 0 & 1\end{array}\right) ^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} = \left(\begin{array}{cc}R^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} & -R^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} t \\ 0 & 1\end{array}\right) \end{aligned}\end{split}\]
Translation and quaternion:

A transformation can efficiently be stored as a pair \((t,r)\) of a translation vector \(t\) and a rotation quaternion \(r\). Analogous to the above, the application of \((t,r)\) on a vector \(x\) is \(x' = t + r\cdot x\); the inverse is \((t,r)^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} = (-r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}\cdot t, r^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1})\); the concatenation is \((t_1,r_1) \circ (t_2,r_2) = (t_1 + r_1\cdot t_2, r_1 \circ r_2)\).

Sequences of transformations

by \(T_{A\to B}\) we denote the transformation from frame \(A\) to frame \(B\). The frames \(A\) and \(B\) can be thought of coordinate frames (tuples of an offset (in an affine space) and three local orthonormal basis vectors) attached to two bodies \(A\) and \(B\). It holds

\[\begin{aligned} T_{A\to C} = T_{A\to B} \circ T_{B\to C}\end{aligned}\]

where \(\circ\) is the concatenation described above. Let \(p\) be a point (rigorously, in the affine space). We write \(p^A\) for the coordinates of point \(p\) relative to frame \(A\); and \(p^B\) for the coordinates of point \(p\) relative to frame \(B\). It holds

\[\begin{aligned} p^A = T_{A\to B}~ p^B ~.\end{aligned}\]

A note on “forward” vs. “backward” of frame and coordinate transforms

Instead of the notation \(T_{A\to B}\), other text books often use notations such as \(T_{AB}\) or \(T^A_B\). A common question regarding notation \(T_{A\to B}\) is the following:

The notation :math:`T_{Ato B}` is confusing, since it transforms coordinates from frame :math:`B` to frame :math:`A`. Why is the notation not the other way around?

I think the notation \(T_{A\to B}\) is intuitive for the following reasons. The core is to understand that a transformation can be thought of in two ways: as a transformation of the coordinate frame itself, and as transformation of the coordinates relative to a coordinate frame. I’ll first give a non-formal explanation and later more formal definitions of affine frames and their transformation.

Think of \(T_{W\to B}\) as translating and rotating a real rigid body: First, the body is located at the world origin; then the body is moved by a translation \(t\); then the body is rotated (around its own center) as described by \(R\). In that sense, \(T_{W\to B} = \left(\begin{array}{cc}R & t \\ 0 & 1\end{array}\right)\) describes the “forward” transformation of the body. Consider that a coordinate frame \(B\) is attached to the rigid body and a frame \(W\) to the world origin. Given a point \(p\) in the world, we can express its coordinates relative to the world, \(p^W\), or relative to the body \(p^B\). You can convince yourself with simple examples that \(p^W = T_{W\to B}~ p^B\), that is, \(T_{W\to B}\) also describes the “backward” transformation of body-relative-coordinates to world-relative-coordinates.

Formally: Let \((A,V)\) be an affine space. A coordinate frame is a tuple \((o,\boldsymbol e_1,..,\boldsymbol e_n)\) of an origin \(o \in A\) and basis vectors \(\boldsymbol e_i \in V\). Given a point \(p\in A\), its coordinates \(p_{1:n}\) w.r.t. a coordinate frame \((o,\boldsymbol e_1,..,\boldsymbol e_n)\) are given implicitly via

\[\begin{aligned} p = o + \sum\nolimits_i p_i \boldsymbol e_i ~.\end{aligned}\]

A transformation \(T_{W\to B}\) is a (“forward”) transformation of the coordinate frame itself:

\[\begin{aligned} (o^B,\boldsymbol e^B_1,..,\boldsymbol e^B_n) &= (o^W + t, R\boldsymbol e^W_1,..,R\boldsymbol e^W_n)\end{aligned}\]

where \(t\in V\) is the affine translation in \(A\) and \(R\) the rotation in \(V\). Note that the coordinates \((\boldsymbol e^B_i)^W_{1:n}\) of a basis vector \(\boldsymbol e^B_i\) relative to frame \(W\) are the columns of \(R\):

\[\begin{aligned} \boldsymbol e^B_i &= \sum_j (\boldsymbol e^B_i)^W_j \boldsymbol e^W_j = \sum_j R_{ji} \boldsymbol e^W_j\end{aligned}\]

Given this transformation of the coordinate frame itself, the coordinates transform as follows:

\[\begin{split}\begin{aligned} p &= o^W + \sum_i p^W_i~ \boldsymbol e^W_i \\ p &= o^B + \sum_i p^B_i~ \boldsymbol e^B_i \\ &= o^W + t + \sum_i p^B_i~ (R \boldsymbol e^W_i) \\ &= o^W + \sum_i t^W_i~ e^W_i + \sum_j p^B_j~ (R \boldsymbol e^W_j) \\ &= o^W + \sum_i t^W_i~ e^W_i + \sum_j p^B_j~ (\sum_i R_{ij}~ \boldsymbol e^W_i) \\ &= o^W + \sum_i \Big[t^W_i + \sum_j R_{ij}~ p^B_j\Big]~ e^W_i \\ \Rightarrow &~ p^W_i = t^W_i + \sum_j R_{ij}~ p^B_j ~.\end{aligned}\end{split}\]

Another way to express this formally: \(T_{W\to B}\) maps covariant vectors (including “basis vectors”) forward, but contra-variant vectors (including “coordinates”) backward.

Splines

A spline is a piece-wise polynomial path \(x:[0,T] \maps {\mathbb{R}}^n\).

Let’s first clearly distinguish the use of words knot, waypoint, and control point:

  • A knot \(t_i\) is a point in time, \(t_i \in {\mathbb{R}}\) (usually \(t_i \in [0,T]\)). The path is polynomial between knots. We have a non-decreasing sequence of knots \(t_0,..,t_m\) (usually with \(t_0=0\) and \(t_m=T\)) which partition the time interval \([0,T]\) into pieces \([t_i, t_{i{{\hspace{-0.0pt}\textrm{+}\hspace{-0.5pt}}1}}]\) so that the path is just polynomial in each piece. Note that we may often have double or triple knots, meaning that several consecutive knots \(t_i = t_{i{{\hspace{-0.0pt}\textrm{+}\hspace{-0.5pt}}1}}\) are equal, especially at the beginning and end.

  • A waypoint \(x_i\) is a point on the path, typically at a knot, \(x_i = x(t_i)\). So a path really passes through a waypoint. At waypoints, we often also care about velocities \(v_i\) (or accelerations), where \(v_i = \dot x(t_i)\).

  • A control point \(z_j\) is (usually) not a point on the path, but it indirectly defines the path as an linear combination of several control points. B-splines, defined below, make this explicit.

In robotics, there are two main conventions to define and parameterize splines: Hermite splines and B-splines. Hermite splines are defined by the knot sequence and explicitly prescribing waypoints \(x_i\) and (for cubic) velocities \(v_i\) at each knot (for quintic also acceperations \(a_i\)). In contrast, B-splines are specified by the knot sequence and \(K\) control points \(z_j\). As in B-splines we do not need to provide velocities as part of the specification, they are sometimes easier to use in practical robotics. However, the resulting path does not go (exactly) through the provided control points – the actual waypoints are implicit and ensuring exact prescribed waypoints implies solving a subproblem.

Cubic splines are a common choice in robotics, as they have a still continuous (piece-wise linear) acceleration profile, and therefore limited jerk (3rd time derivative).

Let’s start with Cubic Hermite splines.

Code References

NLP interface

A general non-linear mathematical program (NLP) is of the form

\[\begin{aligned} \min_{b_l\le x \le b_u}~ f(x) ~~~\text{s.t.}~~~ g(x)\le 0,~ h(x) = 0 ~,\end{aligned}\]

with \(x\in{\mathbb{R}}^n\), \(f:~ {\mathbb{R}}^n \to {\mathbb{R}}\), \(g:~ {\mathbb{R}}^n \to {\mathbb{R}}^{d_g}\), \(h:~ {\mathbb{R}}^n \to {\mathbb{R}}^{d_h}\), \(b_l,b_u\in{\mathbb{R}}^n\). However, we want to explicitly account for least squares costs (sum-of-squares), so that we extend the form to

\[\begin{aligned} \min_{b_l\le x \le b_u}~ f(x) + r(x)^{\!\top\!}r(x) ~~~\text{s.t.}~~~ g(x)\le 0,~ h(x) = 0 ~,\end{aligned}\]

with \(r:~ {\mathbb{R}}^n \to {\mathbb{R}}^{d_r}\). In technical terms, the solver needs to be provided with:

  • the problem “signature”: dimension \(n\), dimensions \(d_r, d_g, d_h\), bounds \(b_l, b_u \in {\mathbb{R}}^n\),

  • functions \(f, r, g, h\),   Jacobians for all,   Hessian for \(f\),

  • typically also an initialization sampler \(x_0 \sim p(x)\), that provides starting \(x_0\).

However, instead of providing a solver with separate functions \(f, r, g, h\), we instead provide only a single differentiable feature function \(\phi: X \to {\mathbb{R}}^K\), which stacks all \(f,r,g,h\) components to a single vector,

\[\begin{split}\begin{aligned} \phi(x) = \left(\begin{array}{c}f_1(x) \\ r_1(x) \\ h_1(x) \\ g_1(x) \\ h_2(x) \\ \vdots\end{array}\right) ~,\quad \rho = \left(\begin{array}{c}\texttt{f}\\ \texttt{sos}\\ \texttt{eq}\\ \texttt{ineq}\\ \texttt{eq}\\ \vdots\end{array}\right) ~,\end{aligned}\end{split}\]

where the indicator vector \(\rho\) informs the solver which components of \(\phi\) have to be treated as linear cost (f), sum-of-squares cost (sos), equality constraint (eq), or inequality constraint (ineq). (The order of stacking does not matter.) In this convention, the NLP reads

\[\begin{aligned} \min_{b_l\le x \le b_u}~ {{\bf 1}}^{\!\top\!}\phi_\texttt{f}(x) + \phi_\texttt{sos}(x)^{\!\top\!}\phi_\texttt{sos}(x) ~~\text{s.t.}~~\phi_\texttt{ineq}(x) \le 0,~ \phi_\texttt{eq}(x) = 0 ~,\end{aligned}\]

where \(\phi_\texttt{sos}\) is the subsets of sos-features, etc. Based on these conventions, the solver needs to be provided with:

  • the problem “signature”: dimension \(n\), feature types \(\rho\), bounds \(b_l, b_u \in {\mathbb{R}}^n\),

  • a single differentiable feature function \(\phi: X \to {\mathbb{R}}^K\), with Jacobian function \(J = \partial_x \phi(x)\),

  • optionally a Hessian function for the sum of all f-terms,

  • and typically also an initialization sampler \(x_0 \sim p(x)\), that provides starting \(x_0\).

In the rai code, an NLP is therefore declared as

//signature
uint dimension;  ObjectiveTypeA featureTypes;  arr bounds_lo, bounds_up;

//essential method
virtual void evaluate(arr& phi, arr& J, const arr& x);

//optional
virtual arr  getInitializationSample(const arr& previousOptima={});
virtual void getFHessian(arr& H, const arr& x);

Yaml-Graph Files

We use yaml-style files throughout. These are the file representation of internal data structures such as dictionaries (anytype key-value maps) used for parameter files or other structure data, but esp. also graphs. The (semantic) extensions relative to yaml are:

  • An @Include@ node allows to hierarchically include files. This means that while each local file can be parsed with a standard yaml parser, an outer loop has to check for @Include@ nodes and coordinate loading sub-files.

  • As an implication of the above, we allow for a special @path@ type, as URLs embraced by <...>. This becomes necessary as file values need to be interpreted relative to the path of the loading file. So when such a file is parsed we not only store the filename string, but also the path of the loading file to ensure we know its absolute path.

  • We also allow @Edit@ and @Delete@ tags, which allow us to edit/overwrite the value of previously defined nodes, as well as delete previously defined nodes.

  • Finally, the name of a node can include a list of parents: E.g. @A (B C): shape: box@ denotes a node with key @A@ that is a child of @B@ and @C@. The semantics of this is that @A@ is a (directed) edge between B and C. This is analogous to a dot declaration @B -> C [ shape=box ]@.

  • Note that all of the above is still yaml syntax, the outer parser only adds additional interpretation (post-processing) of @Include, Edit, Delete@ tags, @<..>@ values, and @(..)@ in names.

Within rai, .g-files are used to represent parameter files, robotic configurations (\(\sim\) URDF), 1st order logic, factor graphs, optimization problems. The underlying data structure is used, e.g., as any-type container, Graph, or auto-convertion to python dictionaries.

Subgraphs may contain nodes that have parents from the containing graph, or from other subgraphs of the containing graph. Some methods of the Graph class (to find nodes by key or type) allow to specify whether also nodes in subgraphs or parentgraphs are to be searched. This connectivity across (sub)-graphs e.g. allows to represent logic knowledge bases.

The shows this file syntax is used to specify robot/environment configurations.

Cameras

Image, Camera, & World Coordinates

In this section, we use the following notation for coordinates of a 3D point:

  • world coordinates \(X\),

  • camera coordinates \(x\) (so that \(X = T x\), where \(T\equiv T_{W\to C}\) is the camera position/orientation, also called extrinsic parameter),

  • image coordinates \(u=(u_x,u_y,u_d)\), with the pixel coordinates \((u_x,u_y)\) and depth coordinate \(u_d\), details as followed.

The pixel coordinates \((u_x,u_y)\) indicate where a point appears on the image plane. The \(x\)-axis always points to the right, but there are two conventions for the \(y\)-axis:

  • \(y\)-up: The \(y\)-axis points upward. This is consistent to how a diagram is typically drawn on paper: \(x\)-axis right, \(y\)-axis up. However, a consequence is that the \(z\)-axis then points backward, i.e., pixels in front of the camera have negative depth \(u_d\).

  • \(y\)-down: The \(y\)-axis points down. This is consistent to how pixels are typically indexed in image data: counting rows from top to bottom. So when defining pixel coordinates \((u_x,u_y)\) literally to be pixel indices in image data, \(y\)-down is the natural convention. A consequence is that the \(z\)-axis points forward, i.e., pixels in front of the camera have a positive depth \(u_d\), which might also be more intuitive.

The transformation from camera coordinates \(x\) to image coordinates \(u\) is involves perspective projection. For better readability, let’s write (only in this equation) \(x \equiv (\texttt{x},\texttt{y},\texttt{z})\). Then the mapping is

\[\begin{split}\begin{aligned} \label{eqxtou} u = \left(\begin{array}{c}u_x \\ u_y \\ u_d\end{array}\right) &= \left(\begin{array}{c}(f_x \texttt{x}+ s \texttt{y})/\texttt{z}+ c_x\\ f_y \texttt{y}/\texttt{z}+ c_y \\ \texttt{z}\end{array}\right) ~.\end{aligned}\end{split}\]

Here, the five so-called intrinsic parameters \(f_x,f_y,c_x,c_y,s\) are the focal length \(f_x,f_y\), the image center \(c_x,c_y\), and a image skew \(s\) (which is usually zero). E.g., for an image of height \(H\) and width \(W\), and vertical full view angle \(\alpha\), we typically have an image center \(c_x \approx H/2, c_y \approx W/2\) and a focal length \(f_y = \frac{H}{2 \tan(\alpha/2)}\), e.g., for \(\alpha=90^\circ\), \(f_y = H/2\). For a typical camera \(f_x \approx f_y\).

Inversely, if we have image coordinates \(u\) and want to convert to cartesian camera coordinates, we have (assuming \(s=0\))

\[\begin{split}\begin{aligned} x &= \left(\begin{array}{c}(u_x - c_x) u_z / f_x\\ (u_y - c_y) u_z / f_y \\ u_z\end{array}\right) ~.\end{aligned}\end{split}\]

Homogeneous coordinates & Camera Matrix \(P\)

First a brief definition: A homogeneous coordinate :math:`boldsymbol x=(x_1,..,x_n,w)` is a (redundant) description of the :math:`n`-dim point

\[\begin{split}{\cal P}(\boldsymbol x)= \left(\begin{array}{c}x_1/w \\ \vdots \\ x_n/w\end{array}\right) ~.\end{split}\]

Note that two coordinates \((x_1,..,x_n,w)\) and \((\lambda x_1,..,\lambda x_n,\lambda w)\) are “equivalent” in that they describe the same point. The operation \({\cal P}\) is non-linear and called perspective projection. In this section, we write homogeneous coordinates in bold \(\boldsymbol x\).

Back to our camera setting: Let \(\boldsymbol x\) and \(\boldsymbol X\) be homogeneous camera and world coordinates of a point (typically both have \(w=1\) as last entry). Then the pose transformation \(T\) can be written as \(4\times\) matrix such that

\[\boldsymbol x = T^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} \boldsymbol X ~.\]

Given camera coordinates \(x = ``(x,y,z)''\), we can write ([eqxtou])

\[\begin{split}\begin{aligned} \boldsymbol u &= K x = \left(\begin{array}{c}f_x x + s y + c_x z\\ f_y y + c_y z \\ z\end{array}\right) ~,\quad K = \left(\begin{array}{ccc}f_x & s & c_x \\ & f_y & c_y \\ & & 1 \end{array}\right) ~,\quad {\cal P}(\boldsymbol u) = \left(\begin{array}{c} (f_x x + s y)/z + c_x\\ f_y y/z + c_y \end{array}\right) ~,\end{aligned}\end{split}\]

where \(\boldsymbol u\) are homogeneous pixel coordinates, and \({\cal P}(\boldsymbol u)\) the actual pixel coordinates, which would have to be augmented with \(z\) again to get the \(u\) including depth coordinate.

The \(3\times 3\) matrix \(K\) includes the 5 general intrinsic parameters. Writing the inverse transformation \(T^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}\) as a \(3\times 4\) matrix \(\left(\begin{array}{cc}R^{\!\top\!}& -R^{\!\top\!}t\end{array}\right)\) with rotation \(R\) and translation \(t\), we can write the relation between \(\boldsymbol u\) and homogeneous world coordinates \(\boldsymbol X\) as

\[\begin{split}\begin{aligned} \boldsymbol u = P \boldsymbol X ~,\quad\text{with~} P = \left(\begin{array}{cc}K & 0\end{array}\right) ~ T^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} = \left(\begin{array}{cc}K & 0\end{array}\right) ~ \left(\begin{array}{cc}R^{\!\top\!}& -R^{\!\top\!}t \\ & 1\end{array}\right) = \left(\begin{array}{cc}KR^{\!\top\!}& -KR^{\!\top\!}t\end{array}\right) ~,\end{aligned}\end{split}\]

where \(P\) is the \(3\times 4\) camera matrix, which subsumes 5 intrinsic and 6 extrinsic (3 rotation, 3 translation) parameters. Except for absolute scaling (the 1 in the definition of \(K\)), this fully parameterizes a general affine transform.

Calibration as Estimating \(P,K,R,t\) from Depth Data

Assuming we have data of pairs \((\boldsymbol u, \boldsymbol X)\), we can use the basic equation \(\boldsymbol u = P \boldsymbol X\) to retrieve \(P\) in closed from, and in a second step retrieve the intrinsic and extrinsic camera parameters from \(P\). Note that here we discuss the situation where we have the “right” \(\boldsymbol u\) in the data – and not only the pixel coordinates \({\cal P}(\boldsymbol u)\)! This means that we assume we have data entries \(\boldsymbol u = (u_x u_d, u_y u_d, u_d)\) which includes the true depth \(u_d\). So this method is only applicable when we want to calibrate a depth camera.

Given data \(D = \{(\boldsymbol u_i, \boldsymbol X_i)\}_{i=1}^n\), we want to minimize the squared error

\[\begin{aligned} \argmin_P \sum_i (\boldsymbol u_i - P \boldsymbol X_i)^2 = [U - P X]^2 ~,\end{aligned}\]

where \(U\) and \(X\) are the stacked \(\boldsymbol u_i\) and \(\boldsymbol X_i\), respectively. The solution is \(P = U^{\!\top\!}X (X^{\!\top\!}X)^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}\). Comparing with the form of \(P\) above, we can decompose it and extract explicit \(K, R, t\) using

\[\begin{split}\begin{aligned} (K,R^{\!\top\!}) &\gets \text{RQ-decomposition}(P_{1:3,:}) \\ t &\gets -(K R^{\!\top\!})^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} P_{4,:}\end{aligned}\end{split}\]

However, when defining \(\bar u = (\boldsymbol u,1) = (u_x u_z, u_y u_z, u_z, 1)\) (with additional 1 agumented), we can also write the inverse linear relation to the non-homogeneous world coordinate \(X\):

\[\begin{aligned} X = P^+ \bar u~,\quad\text{with~} P^+ = \left(\begin{array}{cc}R K^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} & t\end{array}\right) \bar u~,\end{aligned}\]

Using data \(X\) (\(3\times n\)) and \(U\) (\(4\times n\)) the optimum is \(P^+ = X^{\!\top\!}U (U^{\!\top\!}U)^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1}\). We can decompose it using

\[\begin{split}\begin{aligned} t &\gets P^+_{3,:} \\ (K,R^{\!\top\!}) &\gets \text{RQ-decomposition}( [P^+_{1:3,:}]^{{\hspace{-0.0pt}\textrm{-}\hspace{-0.5pt}}1} ]\end{aligned}\end{split}\]