Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras

티스토리 뷰

Deep Learning

Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras

seoyoung02 2022. 2. 20. 16:27

이 논문은 하나의 이미지의 depth, egomotion, object motion, and camera intrinsics을 동시에 학습하는 방식을 제안한다. motion-prediction network는 camera motion, motion of every pixel with respect to the background, and the camera intrinsics(focal lengths, offsets and distortion) 학습하고, 다른 network는 depth maps을 학습한다. 또한 occlusion 변화를 반영하기 위한 loss도 제안한다. 전체적인 Loss를 구하는 방식과 개념은 이전에 정리했던 논문에 기반하므로 이것을 참고하면 도움이 될 것이다.

Learning the intrinsics

reconstruction을 위해 다른 이미지에서의 point를 찾을 때(일종의 warping?) 아래와 같은 식을 사용한다. $K$는 intrinsic matrix, $z$는 depth, $R$은 rotation matrix, $t$는 transition matrix이다.

$z'p' = KRK^{-1}zp+Kt$

따라서 $p'$를 통해 $K$, $R$, $z$, $t$로 전파된다. 그 중에서도 $KRK^{-1}$, $Kt$에 의존하므로 각 $K$, $R$, $t$가 정확하지 않더라도 $KRK^{-1}$, $Kt$는 정확한 값을 가질 수 있다.

Learning object motion

pretrained segmentation mdoel을 통해 "움직일 가능성이 있는" 물체를 찾는 mask를 얻는다. $R$은 움직임에 상관없이 일정한 값을 가진다고 보고, $t$는 물체가 움직이면 정지해있을 때와 다른 값을 가진다고 본다. 추가적으로 L1 smoothing이 $t$에 적용된다.

여기서 구한 mask는 global translation vector $t_{0}$와 residual translation $\delta t(x,y)$과 함께 translation field를 구하는데 이용된다.

$t(x,y) = t_{0} + m(x,y)\delta t(x,y)$

Occlusion-aware consistency

카메라가 이동하게 되면 보였던 물체에 occlusion이 발생하거나, 가려진 물체가 드러나는 경우가 생길 수 있다. Depth map과 motion field가 주어지면 occlusion이 일어날 위치를 인식하고, occlusion area를 consistency loss에서 제외할 수 있다.

An illustration of proposed method for handling occlusions

Occlusion을 찾는 방식은 다음과 같다. 우선 predicted depth $z_{ij}$와 camera intrinsic matrix로 공간에서의 point $(x_{ij}, y_{ij},z_{ij})$를 구한다. 이 점 중 motion field에 의해 움직인다고 판단된 점은 옮겨준다. 이 점을 다음 이미지에 projection 시킨 점 $(i', j')$를 구하고, 이 점의 depth $z_{i',j'}^{t}$을 interpolation을 해 계산한다. $z_{i',j'}' \leq z_{i',j'}^{t}$ 인 경우가 occlusion 되지 않았다는 의미이고, 이 경우만 loss에 포함하여 사용한다. 즉, source frame의 pixel 변환값이 target frame의 pixel 변환값보다 앞에 있어야 한다는 의미이다. 이 방식을 source와 target을 바꾸어서도 계산한다. 이러한 방식을 "occlusion-aware" loss라고 한다. 그림은 시간에 따른 카메라의 이동을 L, R로 표현한 것이다.

Networks, losses and regularizations

Networks

이 논문에서 사용하는 network의 수는 두 개이다. depth를 예측하는 것과 지금까지 설명했던 egomotion, motion field 등을 예측하는 것이다. Depth를 예측하는 network는 UNet과 ResNet18을 기반으로 하고 있으며, motion estimation network는 FlowNet을 기반으로 하고 있다.

Losses

$
\begin{equation}
\begin{aligned}
L_{Total} = & w_{reconstruction}L_{reconstruction} + w_{smooth}L_{smooth} + w_{ssim}SSIM \newline
& + w_{motion \: smoothing}L_{motion\: smoothing} + w_{depth \: consistancy}L_{depth \: consistancy} \newline
& + w_{rotation}L_{rotation} + w_{transition}L_{transition}
\end{aligned}
\end{equation}
$

Loss를 전체적으로 다 표현하면 위와 같다. Recontruction loss와 smooth loss는 앞서 링크를 올렸던 논문과 유사하다고 보면 된다.

Structual similarity(SSIM)은 이미지를 평가하는 영향력있는 알고리즘이다. 세부적으로는 두 이미지의 밝기, contrast, 구조를 비교한다. 여기서 구조란 밝기의 평균과 표준편차로 정규화해준 것을 의미한다. 이 세 가지 항목을 묶어 SSIM을 표현한다. 여기서는 두 이미지의 유사도를 나타낸다고 보면 된다.

Depth consistancy는 source depthmap과 warped target depthmap의 차이이다. Rotation과 transition은 두 이미지에서 각각의 차이를 의미한다. Rotation과 transition은 카메라가 바뀌지 않는 한 달라지지 않는 값이어서 유지되도록 하기 위함인 것 같다.

저작자표시 비영리 변경금지 (새창열림)

'Deep Learning' 카테고리의 다른 글

LoFTR: Local Feature matching with TRansformers (1)	2024.01.14
PRML 4. 선형 분류 모델(Linear Models for Classification) (0)	2022.06.05
Objects are Different: Flexible Monocular 3D Object Detection(MonoFlex) (0)	2022.02.19
SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation (0)	2022.02.18
Unsupervised learning of depth and ego-motion from video (0)	2022.02.17

최근에 올라온 글

TAG more

Total

Today

Yesterday

최근에 달린 댓글

링크

공지사항

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

note

티스토리 뷰