Solving AOC 19 with optimization

Day 19 of this year's Advent of Code is a nice geometrical problem. We are given sets of points detected by different scanners in $3d$ space. Each scanner has its own coordinate system and does not know its own position with respect to the others. Based on some common detected points that scanners share, the task is to retrieve their relative relative positions in a single coordinate system. A longer and more accurate description can be found at https://adventofcode.com/2021/day/19.

This task can be split into several smaller steps:

In this post, I'm going to talk about how I solved this problem using some algorithms from convex optimization :)

Reading the input data

First, we should read and analyze the data to know what we're dealing with. I saved the example data which is given in the problem description in a text file, which can be easily parsed as done below

We can also take a look at the some of the points detected by the scanners

Find the relations between different sets

Once we read the data, we should find the sets which share at least 12 points, as stated in the problem description. To do so, we can compute the pairwise distances among all points in the same set (according to some norm). Then, if two sets share at least 66 such distances, it means there are at least 12 points in common between them (since the number of pairwise distances between 12 points is 66, i.e., 12 choose 2). We call sets which share at least 12 points a "couple". Note that I'm using python sets, which could give wrong results. Indeed, if more than 2 points are at the same distance (for example (0,0,0), (1,1,1), (2,2,2)), only one distance would be counted, and we would potentially miss some related sets (if this causes the shared distances to be less than 66). Neverthless, this is easy to fix, altough I didn't do it and got the right answer anyway!

Another problem we might encounter with the above approach is that two different sets of points which share the same distances might be considered overlapping even if they're not. For this advent of code puzzle, I'm not really sure this is a problem (how else could we detect that the sets share some points if not considering pairwise distances?), but in a real situation it could be.

Once we have all the connected "couples" of sets, we can build a graph. This graph will be used later, to determine the order to visit the sets of points. For example, the sets in the example data give rise to the following graph.

As already said, we need to determine the order to visit the sets of points. Indeed, in order to translate all the points in a single coordinate system, we can start from a set of points and iteratively transform the other sets into the system of the first set, following their relations in the graph we built. For example, in the example dataset, we can start from 0. From here, we can only visit 1 (since it's the only set which share at least 12 points with 0). From 1, we can either go to 4 or 3. Suppose we go to 3, then there are no other sets left. At this point we visit 4 and finally 2. This procedure corresponds to a Breadth First Search (BFS) visit of the graph of relations.

Find the correct transformation

Up to this point, we didn't describe how to translate a coordinate system to another one given that 2 sets share at least 12 points. We are going to do it next. However, first we need to identify common points between two sets which share at least 12 points. This is a bit boring: I do it by doing some manipulations on the indices in the (symmetric) matrix of pairwise distances. If you're interested in the process, the code is given below:

Here comes the interesting part! Once we have the shared points, we can actually solve the problem. At this point, we have to find the transformation which transforms one point cloud to the other. This transformation is given by a rotation matrix and a translation vector. If we denote by $x$ and $x'$ the points in the two different coordinate systems, we will have:

$$ x = R x' + t $$

where $R$ is a $ 3 \times 3 $ rotation matrix and $ t $ is a translation vector. We can actually focus on the rotation matrix, as the translation vector is easy to recover once the points are "aligned".

The problem of finding the rotation matrix can be formalized as an optimization problem. First, we center the point clouds by subtracting them their centroids, i.e., their mean vectors. We define the centered points as $\{y_i\}_{i=1}^n$ and $\{y_i' \}_{i=1}^n$. Then we want to minimize (with respect to the rotation matrix) the sum of the distances between every couple of points. In mathematical terms, we have the following objective to minimize

$$ f(A) = \frac{1}{n} \sum_{i=1}^n \| y_i - A y_i' \| $$

where $A$ belong to the set of real-valued 3 by 3 matrices.

Is this a convex objective? It easy to verify. Consider the function $g_i(A) = \| y_i - A y_i' \|$. From the definition of convexity:

$$ \begin{align*} g_i(\lambda A_1 + (1-\lambda) A_2) &= \| y_i - \lambda A_1 y_i' - (1-\lambda) A_2 y_i' \| \\ &= \| \lambda (y_i - A_1 y_i') + (1-\lambda) (y_i - A_2 y_i') \| \\ &\leq \lambda \| y_i - A_1 y_i' \| + (1-\lambda) \| y_i - A_2 y_i' \| = \lambda g_i(A_1) + (1-\lambda) g_i(A_2) \end{align*} $$

where we used the triangle inequality in the third step. Note that $f(A) = \frac{1}{n} \sum_{i=1}^n g_i(A)$ and the sum of convex functions is still a convex function. Hence, we have that $f$ is a convex function and we are guaranteed to find a global optimum.

We didn't specify the norm used in $f$, but it could be any norm. For simplicity, we will consider the Euclidean norm. The correct matrix $R$ will be the one minimizing the objective function $f$,

$$ R = \arg\min_{A \in \mathbb{R}^{3 \times 3}} f(A) $$

To be more precise, the set over which we are minimizing should be the one of the rotation matrices in $3d$ space, i.e., orthogonal matrices 3 by 3 with determinant equal to 1. However, if we use a gradient-based method, this more accurate formulation would involve an expensive projection step on the mentioned set in every step of the optimization process. Also, I found out that the minimization problem above worked well for the task at hand.

Now that we have our optimization problem formulated, we can try to solve it by using any optimization algorithm. To this aim, we can for example try any of the optimization algorithms contained in pytorch. One nice feature of pytorch is that it computes the gradients automatically, once we define the loss function. Below, we use most famous optimization algorithm, Stochastic Gradient Descent (SGD). The condition we used to determine when to stop the iterations of the optimization algorithm is to consider when the rounded matrix multiplied by the points to be rotated give back the points we want to recover. If this does not happen after 10 thousands steps, an error is raised.

We can try to visualize the optimization process, by plotting the the transformed points as the objective function is minimized. We can see that as the optimal solution is approached, the transformed red points approach the original blue points. This only works in an interactive Jupyter notebook though. In the next cell we export the result to an HTML video (see below).

One disadvantage of SGD is that it has a learning rate to be set, which can influence the speed of convergence of the optimization process. Even more bad, if the learning rate is completely wrong, the solution will never converge to the optimal one (for example, you can download the notebook and try to use lr=1e-2)! For this reason, finding a good learning rate is a time consuming process, since different values have to be tried until a good one is found. On the other hand, there are also other optimization algorithms which do not need parameters such as the learning rate to be tuned and can still converge to the optimal solution. One such algorithm is Cocob, which has been used also to train neural networks. Let's see how it performs on this problem.

Note how we don't have to set any $lr$ parameter above!

There's also another subtle difference which can be noticed by looking at the videos above. If you look at the speed of the moving points when using SGD and Cocob, you can see that it is not the same. In SGD it is constant, and the red points slowly approach the blue ones. On the other hand, with Cocob the red points start moving slowly but then their speed greatly increases as they approach the blue ones. What is going on here?

The explanation is that the learning rate in SGD is fixed (at least in the pytorch version), while in Cocob it is not! But.. Didn't we say the Cocob does not have any learning rate? Well, Cocob doesn't have any learning rate to be set, but that doesn't mean it has no learning rate at all. The truth is that this algorithm is able to determine automatically the optimal learning rate, which is not fixed in advance, but changes depending on the data fed to it. In particular, it can be shown that finding the optimal learning rate is more or less equivalent to betting on the outcome of a coin in a game where the goal of the algorithm is to increase its total amount of money, hence the name Cocob (which stands for Continuous Coin Betting). Furthermore, if the objective function is fixed, the amount of money which is the algorithm is betting on (which is equivalent to learning rate) will grow exponentially and the algorithm will approach the optimal solution exponentially fast! If you're interested in a more detailed explanation you can watch this video. Coin betting is a fascinating and theoretically grounded area of research in optimization!

Solving the original problem

Now that we have all the pieces needed, we can solve the original problem! In particular, there are 2 tasks:

Below, we show how to use the functions described until now to solve the 2 tasks:

And now the real puzzle, both with Cocob and SGD:

We can also measure how long it takes for the optimization algorithms to find the solution in terms of time:

It seems that Cocob is faster. However, the time taken to find a solution is not really meaningful since it is subject to many different variables. Instead, we can measure how many iterations the algorithms need:

Summary

This was definitely a cool problem to solve in this year's Advent of Code! I hope you enjoyed the explanation on optimization algorithms and especially coin betting. I didn't go into much detail, but if you're interested this is a very nice topic in optimization and machine learning. You can find many resources online, and there even was a tutorial at last year's ICML. Despite their practical success, I believe these algorithms are still not very well known to the practicioners. One of the reason is that there are not many implementations available. For this reason, last year we started implementing some of them in pytorch, and currently a library on Github is available on my profile. The plan is to add more parameter-free algorithms to the library (not only in torch but also JAX) and extensively test them against other more famous optimization algorithms such as SGD or Adam.