Today, TBs of videos are been uploaded to websites such as YouTube and Facebook, while users have very limited time to watch them. One way to solve this problem is to generate "speed up" videos, hence users can get a sense of the video in short time. This is called "timelapse" video. The idea is to pick up 1 frame in a long time interval, and use these frames to generate the output video. An impressive result could be obtained if a stationary camera is used when capturing the video.
Here is a good timelapse video which captures the beauty of Iceland.
When a video is shot with a stationary camera, naive frame skipping is quite effective; however, if there are camera motions, the speed-up process would cause heavy jumbles. Unfortunately, camera motions are very common in most videos dirtributed in the world, which are usually taken with hand-held devices like smartphones.
“Hyperlapse” videos aim to solve the problem by smoothing the motion of the camera. It performs camera motion smoothing in addition to the speed-up process. Our goal is to help people create an effective and robust hyperlapse video.
Hardware-based method and software-based method are the two main approaches to create hyperlapse videos. We will briefly describe these methods as follows:
The key idea of Joshi’s work is to pick a set of frames that are close to the target speed but also can be aligned well and thus stabilized more easily in the final hyperlapse video. In order to meet the requirement, an appropriate cost metric and optimization method need to be designed carefully. As illustrated in the following figure, the Joshi’s method consists of three main stages:
\[C_{r}(i,j) = \frac{1}{n} \displaystyle\sum_{p=1}^{n} \|(x_{p}, y_{p})_{i}^{T} - T(i,j)(x_{p}, y_{p})_{i}^{T}\|_{2}\]
where $(x_{p}, y_{p)}$ are corresponding key points selected by the RANSAC process, and $T(i,j)$ is the estimated homography. The cost funtion penelizes the frames which cannot be aligned well or have significant transformations.
The secord term is an overlap cost, measure the size of overlap region between frames:
\[ C_{o}(i,j) = \| (x_{0}, y_{0})^{T} - T(i,j)(x_{0}, y_{0})^{T} \|_{2} \]
where $(x_{0}, y_{0})$ is the center of the image. This is equivalent to the magnitude of translation of the center of the image between frames
The two costs are combined into a motion cost:
\[ C_{m}(i,j) = \begin{cases} C_{o}(i,j) &\quad C_{r}(i,j) < \tau_{c} \\ \gamma &\quad C_{r}(i,j) \geq \tau_{c} \\ \end{cases} \]
where $\gamma$ is the maximal cost, and $\tau_{c}$ is pre-defined threshold to determine if the homography is reliable or not.
Figure 4 shows an example of corresponding motion cost matrix when analyzing frames.
The first one is a speed-up cost:
\[ C_{s}(i,j,\nu)=min(\| (j-i) - \nu \|_{2}^{2}, \tau_{s}) \]
where, $\tau_{s}$ is pre-defined maximal cost. The term measures the difference between the actual jump between frames $i$ and $j$ and the target speed-up rate $\nu$.
The second one is an acceleration cost:
\[ C_{a}(h,i,j) = min(\| (j-i) - (i-h) \|_{2}^{2}, \tau_{a} ) \]
where, $\tau_{a}$ is pre-defined maximal cost. The term reduces perceptible visual jump coming from sudden acceleration change.
Finally, the overall cost function used to select optimal frames is the comination of motion cost, speed-up cost, and acceleration cost:
\[ C(h,i,j,\nu) = C_{m}(i,j) + \lambda_{s}C_{s}(i,j,\nu) + \lambda_{a}C_{a}(h,i,j) \]
where $\lambda_{s}$ and $\lambda_{a}$ are the adjusted parameter determined by the user who prefers smoother camera motions or more precise speed-up rate.
After completing the computation of cost functions, we can determine the optimal frames by dynamic programming (DP) and dynamic-time-warping (DTW) algortithms. The optimal path is determined by finding the minimal accumulated cost and then tracing back to get whole output frames. The more details, including pseudo codes, could be found in Joshi's paper.
We use C++ and OpenCV[10] to implement the algorithm, and use MFC (Micorsoft Foundation Class) to build our testbed.
Besides, our implementation supports preview mode, where the user can preview the hyperlapse result and use a slider to interactively change the speed-up rate. In order to achieve the effect, we offer discrete set of speed-ups of 1, 4, 8, 12, 16, 20, and compute the all optimal paths in these settings. When the user change the target speed-up rate, we will adjust the output frame accordingly by finding the most appropriate frame in the target optimal path. The more detail description of our testbed could be found in Supplementary section.
We use several videos to evaluate our algorithm. Table 1 is a brief report of analysis speed. We also implement speed optimization method proposed in the paper. The idea is to approximate overlap cost $C_{0}(i,j)$ by chaining the overlap costs between frames:
\[ C_{0}(i,j) = C_{0}(i,i+1) + C_{0}(i+1,i+2) + ... + C_{0}(j-1,j) \]
The formula implys only costs between neighbor frames are necessary to be computed. It could reduce large amount of computation.
But when the method of chained cost is applied, the hyperlapse would become more drifting, especially when the window size is large. So a modification is that we only use chained costs when the current scene in the video is near still. The heuristic trick offers a good trade-off to maintain hyperlapse quality and reduce computation simultaneously.
We run our test videos with different optimization method (no optimization, hybrid, and fully chained cost). The detail report is shown in Table 1. But it is strange that our analysis performance is not as fast as what is claimed in the paper. We guess the reason is the implementation of Harris Corner detection in OpenCV is not optimized. If we have more time, we should investigate the problem.
The spec of our test machine is:
Video | Resolution | No Optimization (FPS) | Hybrid (FPS) | Fully Chained (FPS) |
---|---|---|---|---|
bike - 1 | 1280X960 | 3.61 | 4.92 | 13.16 |
bike - 2 | 1280X960 | 3.43 | 4.32 | 13.21 |
running | 1280X960 | 3.16 | 3.68 | 12.78 |
climbing | 1280X960 | 3.57 | 4.36 | 13.05 |
building | 1920X1080 | 2.81 | 3.77 | 8.97 |
walking | 1920X1080 | 2.79 | 3.75 | 9.02 |
Average | 3.23 | 4.13 | 11.70 |
Next, we will show some our hyperlapse videos created by the implemented algorithm. We compare our result with naïve sampling. In each video, the result of naïve sampling is put on left side, and our result is put on right side. Here is an example:
You can observe that our result is smoother than the result of naïve sampling. The rapid motion change is more common when naive sampling is used.
Here is another example:
In the example, we found the difference between optimal frame sampling and naïve sampling is not so significant although the result of optimal frame sampling is still smoother. We think the reason is the camera motions in the video are larger and longer than other videos. Therefore, the default windows size is not big enough to let the algorithm skip all frames with large motion. If we use larger window size, the result should be improved.
Here is the last example:
Naive sampling also performs well in the example because the camera motions in the video are not very large. Therefore, the motions could be compensated when a video stabilizer is applied. But compared with optimal frame sampling, you can still find more dramatic camera motion changes in the video of naïve sampling.
Our method might fail in the following situation:
Given a limited period of time, we have implemented most features of the algorithm. But if we can get more time, we will consider more possibilities in the following directions:
We have implemented various kinds of features in our testbed to help us develop the algorithm. I use the following figure to explain what we have implemented.
To be more clear, we splitted all funtionalities of the testbed into five sections. The similiar functionalities, which belongs to the same section, are grouped together in the control panel.