Objects are Different: Flexible Monocular 3D Object Detection(MonoFlex)

티스토리 뷰

Deep Learning

Objects are Different: Flexible Monocular 3D Object Detection(MonoFlex)

seoyoung02 2022. 2. 19. 23:27

world center구하려면 3D object의 image에 투영된 center($u_{c}$, $v_{c}$)와 world depth($z$)가 필요하다. 이 논문에서는 크게 이 두 가지를 분리해서 구한다. 이 논문은 CenterNet의 확장이라고 볼 수 있다. CenterNet은 heatmap head, offset head, dimension head로 나누어져있다. heatmap head, offset head는 비슷한 기능을 하지만 3D point를 구하고, 이미지 안쪽과 이미지 바깥쪽에 있어 occlusion된 object를 구분하여 center를 구한다는 차이가 있다. Dimension head는 dimension orientation, 2d detection도 같이 구한다는 차이가 있다. 여기에 depth를 구하는 방식이 추가되어 world center를 구하게 된다. 이것들을 조금 더 자세히 살펴보자.

Decoupled Representations of Objects

이전 논문들은 projected center를 $x_{c}$라고 하고, 2d bbox의 center를 $x_{b}$라고 하여 이 둘의 차이를 offset $\delta_{c}=x_{c}-x_{b}$로 구해 사용했다. 하지만 여기서는 heatmap head에서 projected 3D center가 나오고, 이 점이 이미지 안쪽에 있는지 바깥쪽에 있는지에 다라 inside와 outside를 나누어 서로 다른 offset을 가지게 한다.

Inside

downsampling ratio에의해 생기는 discretization error를 regression한다.

$\delta_{in}=\frac{x_{c}}{S} - \lfloor \frac{x_{c}}{S} \rfloor$

Outside

$x_{I}$는 edge heatmap에서 얻어진다. 2d center는 잘린 물체의 이미지 안쪽 부분만 인식해 confusing한데, $x_{I}$는 강한 boundary prior 제공한다.

$\delta_{out}=\frac{x_{c}}{S} - \lfloor \frac{x_{I}}{S} \rfloor$

Edge Fusion

edge부분을 interior과 같은 커널을 쓰면 위치에 따른 예측이 어려우므로 edge fusion layer를 추가한다. feature map의 네 모서리로 edge feature vector만들어서 1D conv로 truncated된 물체의 특징을 학습한다.

Loss function

$L_{off}=
\begin{cases}
\left| \delta_{in} - \delta_{in}^{\ast} \right| & \text{ if inside} \newline
log(1+\left| \delta_{out} - \delta_{out}^{\ast} \right|)& \text{ otherwise }
\end{cases}$

Visual Properties Regression

3D 를 구하려면 dimension, orientation 등 다른 feature들이 필요하므로 이것을 위한 network가 존재한다.

2D Detection

FCOS(Fully Convolutional One-Stage Object Detection)의 방식을 사용한다. FCOS는 중심점에서 bbox경계까지의 거리를 예측하는 방식으로 $l$, $r$, $t$, $b$ 네 방향의 offset으로 bbox를 찾는다. MonoFlex에서는 representative point $x_{r} = (u_{r}, v_{r})$로부터의 거리를 구한다.

Dimension

training set 평균($\begin{bmatrix}
\bar{h} & \bar{w} & \bar{l} \end{bmatrix}$)을 구해놓고, log-scale offset output으로 regression한다.
Loss

$L_{dim} = \sum_{k \in \{ h,w,l \}} \left| \bar{k_{c}}e^{\delta_{k}} - k^{\ast} \right|$

Orientation

많은 3D object detection에서 그렇듯 yaw만 계산한다. Local orientation을 Multibin loss로 추정하고 ray는 center point를 사용해 추정한다. Multibin 방식은 각도를 N개의 겹쳐지는 구역들로 정해서 구역들 중 하나 선택하고 residual rotation을 regression한다. 최종적으로 global orientation은 $r_{y}=\alpha + arctan(x/z)$이다.

Keypoints

Keypoints는 이미지상에서의 3d bbox point들이며 위의 세 feature와는 다른 layer를 통해 얻어진다. 이렇게 구해진 key point는 depth를 구하는데 이용된다. 8개 + 2개(위/아래 평면의 중심)로 총 10개의 point를 구하고 projection시 모든 점이 이미지 안쪽에 있는것만 사용해 L1 loss를 구한다.

Loss

$L_{key}=\frac{\sum_{i=1}^{N_{k}}I_{in}(k_{i})\left| \delta_{ki} - delta_{ki}^{\ast} \right|}{\sum_{i=1}^{N_{k}}I_{in}(k_{i}}$

Adaptive Depth Ensemble

Keypoint로 부터의 geometric solution(M개) + direct regression(1개) 를 adaptive하게 섞는다.
height로 Depth regress (dimension중에 relative error 제일 작아서)

Direct regression

Network output $z_{o}$를 inverse sigmoid transformation을 이용해 absolute depth $z_{r}$를 구한다.

$z_{r}=\frac{1}{\sigma(z_{0})}-1, \sigma(x)=\frac{1}{1+e^{-x}}$

Loss를 계산할 때 $\sigma_{dep}$이 사용되는데 이것은 depth regression head에서 나오는 uncertainty output이다. (gaussian yolov3를 보면 variance가 uncertainty를 나타낸다고 하는데 그래서 $\sigma$기호를 쓰는걸까..?) 최종적으로 loss function은 아래와 같다.

$L_{dep}=\frac{\left| z_{r} - z^{\ast} \right|}{\sigma_{dep}}+log(\sigma_{dep})$

Depth from keypoints

Dimension에서 구해진 H와 pixel height($h_{l}$), focal length로 world depth를 구한다.

$z_{l}=\frac{f \times H}{h_{l}}$

Key points로부터 세 가지 방식(중심, 두 대각선)으로 pixel height를 구한다.

$z_{c} = k_{9} - k{10}$

$z_{d_{1}} = \frac{z_{1} +z_{3}}{2} , z_{1} = k_{5} - k_{1}, z_{3} = k_{7} - k_{3}$

$z_{d_{2}} = \frac{z_{2} +z_{4}}{2} , z_{2} = k_{6} - k_{2}, z_{4} = k_{8} - k_{4}$

Loss는 direct regression처럼 uncertainty 이용하고, 이미지 keypoint가 이미지 내에 없으면 사용하지 않는다.

$L_{kd} = \sum_{k \in \{ c, d_{1}, d_{2} \}} [\frac{\left| z_{k} - z^{\ast} \right|}{\sigma_{k}} + I_{in}(z_{k})log(\sigma_{k})]$

Uncertainty Guided Ensemble

실제로 사용되는 depth는 M+1개의 depth를 더 confident한(uncertainty가 낮은) depth에 가중치를 주어 구한다.

$z_{soft} = (\sum_{i=1}^{M+1}\frac{z_{i}}{\sigma_{i}})/(\sum_{i=1}^{M+1}\frac{1}{\sigma_{i}})$

Intergral corner loss

위의 방법으로 depth까지 구하고 나면 estimated dim, orientation, offset, soft depth로 8개 코너 구할 수 있고 이것의 L1 loss로 또 한 번 loss를 구한다.

$L_{corner}=\sum_{i=1}^{8}\left| v_{i} - v_{i}^{\ast} \right|$

저작자표시 비영리 변경금지 (새창열림)

'Deep Learning' 카테고리의 다른 글

PRML 4. 선형 분류 모델(Linear Models for Classification) (0)	2022.06.05
Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras (0)	2022.02.20
SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation (0)	2022.02.18
Unsupervised learning of depth and ego-motion from video (0)	2022.02.17
파이썬 날코딩으로 알고 짜는 딥러닝, Chapter 3 (0)	2021.12.05

최근에 올라온 글

TAG more

Total

Today

Yesterday

최근에 달린 댓글

링크

공지사항

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

글 보관함

note

티스토리 뷰