SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation

티스토리 뷰

Deep Learning

SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation

seoyoung02 2022. 2. 18. 15:26

이 논문은 이미지에서 인식된 obeject(차량)의 실제 bounding box를 중심/크기/각도로 나누어 정보를 얻어 구할 수 있게 하는 모델을 제안하고 있다.

하나의 이미지에서 차량의 실제 정보를 얻을 수 있는(3D object detection) one-stage architecture를 제안하였으며, 중심/크기/각도로 분리하여 구하는 방식(multi-step disentanglement approach)을 제안하였다.

Backbone

- DLA(Deep Layer Aggregation)-34 수정해서 사용

- 모든 hierarchical aggregation을 DCN(Deformable Convolutional Networks)으로 대체

hierarchical aggregation은 같은 계층의 다른 feature를 결합시키는 connection이다.

DCN은 offset layer를 학습시켜서 conv layer가 feature map 만들 때 고정된 위치가 아니라 conv layer의 nxn에서 벗어난 다른 위치의 값을 사용할 수 있도록 한 것이다.

- BN → GN(Group Norm)

batch가 작을 때 BN 쓰면 클 때 보다 성능이 상당히 떨어지는데 이것을 보완하기 위한 방법이 GN이다. 채널을 N개의 그룹으로 나누어 normalize하여 적용할 수 있다.

3D Detection Network

Backbone에서 나온 feature map을 두 개의 병렬적인 network에 통과시켜 key point,

Keypoint branch

keypoint는 이미지 plane에 투영된 3D center($x_{c}$, $y_{c}$)이다. 이 점은 아래 식 처럼 intrinsic parameter를 사용해 실제 물체의 중심점(in camera cooridinate)을 투영시킨 것이다.
$\begin{bmatrix}
z\cdot x_{c}\newline
z\cdot y_{c}\newline
z
\end{bmatrix}=K_{3\times 3}\begin{bmatrix}
x\newline
y\newline
z
\end{bmatrix}$

Regression branch

3D info를 구하기 위해 아래와 같은 8개 output이 나온다.
$$
\begin{equation}
\tau = [\delta_{z}, \delta_{x_{c}}, \delta_{y_{c}}, \delta_{h}, \delta_{w}, \delta_{l}, sin(\alpha), cos(\alpha)]
\end{equation}
$$

각 output에 대해 자세히 알아보자.
Position

$\delta_{z}$은 depth offset이다. depth $z$는 $z=\mu_{z}+\delta_{z}\sigma_{z}$ 식으로 계산되는데 $\mu_{z}$랑 $\sigma_{z}$는 미리 정해진 값이다. 최종적으로 실제 중심$\begin{bmatrix}
x & y & z \end{bmatrix}$을 구하기 위해서 앞서 key point branch에서 구한 이미지에서의 3D center를 사용한다. $\delta_{x}$, $\delta_{y}$는 downsampling 때문에 발생한 차이를 보정해주는 역할(discretization offset)을 한다. 따라서 아래와 같은 식으로 $\begin{bmatrix}
x & y & z \end{bmatrix}$를 구할 수 있다.

$\begin{bmatrix}
x\newline
y\newline
z
\end{bmatrix}=K_{3\times 3}^{-1}\begin{bmatrix}
z\cdot (x_{c}+\delta_{x_{c}})\newline
z\cdot (y_{c}+\delta_{y_{c}})\newline
z
\end{bmatrix}$

Size
$\delta_{h,w,l}$은 residual dimension이다. 특정 데이터셋에 대해 카테코리별로 미리 평균값($\begin{bmatrix}
\bar{h} & \bar{w} & \bar{l} \end{bmatrix}$)을 계산하고, residual dimension offset 적용하면 아래와 같은 식으로 실제 크기를 구할 수 있다.
$\begin{bmatrix}
h\newline
w\newline
l
\end{bmatrix}=\begin{bmatrix}
\bar{h}\cdot e^{\delta_{h}}\newline
\bar{w}\cdot e^{\delta_{w}}\newline
\bar{l}\cdot e^{\delta_{l}}
\end{bmatrix}$

Angle

Angle은 pitch, roll 은 0이라 가정하고 yaw만 구한다. 구할 때 크게 차량이 이미지에서 보이는 각도인 local angle과 실제 차량 중심이 카메라와 가지는 각도인 ray로 나누어 구할 수 있다.

from https://arxiv.org/pdf/1612.00496.pdf

이 논문에서 $\alpha$가 위의 그림의 $\theta_{l}$이다. $\alpha$는 regression branch output 중 $sin(\alpha)$, $cos(\alpha)$로 구한다. $\theta_{ray}$는 $arctan(x/z)$로 구한다. 최종적으로 물체의 각도는 아래의 식으로 구한다.

$\theta = \alpha_{z}+arctan(\frac{x}{z})$

Final
위에서 구한 position, size, angle로 real 3d bbox 8개 점을 구한다.
$B=R_{\theta}\begin{bmatrix}
\pm h/2\newline
\pm w/2\newline
\pm l/2
\end{bmatrix}+\begin{bmatrix}
x\newline
y\newline
z
\end{bmatrix}$

Loss function

Keypoint classification loss

penalty-reduced focal loss를 사용한다. $s_{i,j}$는 heatmap location ($i$, $j$)에서의 예측된 score이고, $y_{i,j}$는 key point branch에서 나온 ground-truth value이다. $\breve{y_{i,j}}$, $\breve{s_{i,j}}$는 아래와 같이 정의된다.

$$
\breve{y_{i,j}}=
\begin{cases}
0 & \text{ if } y_{i,j}= 1\newline
y_{i,j}& \text{ otherwise }
\end{cases},
\breve{s_{i,j}}=
\begin{cases}
s_{i,j} & \text{ if } y_{i,j}= 1\newline
1-s_{i,j}& \text{ otherwise }
\end{cases}
$$

간단하게 하기 위해서 여기서는 object class를 하나만 고려한다.

$$
L_{cls}=-\frac{1}{N}\sum_{i,j=1}^{h,w}(1-\breve{y_{i,j}})^{\beta}(1-\breve{s_{i,j}})^{\alpha}log(\breve{s_{i,j}})
$$

Regression loss

각 feature map location마다 8D tuple $\tau$가 나오고, 각 채널마다 activation function을 적용한다. 크기(dimension)는 sigmoid 함수를 각도는 l2 norm을 적용한다. 아래 식에서 $o$는 실제 output이다.

$\begin{bmatrix}
\delta_{h}\newline
\delta_{w}\newline
\delta_{l}
\end{bmatrix} = \sigma(\begin{bmatrix}
o_{h}\newline
o_{w}\newline
o_{l}
\end{bmatrix})-\frac{1}{2}, \begin{bmatrix}
sin(\alpha)\newline
cos(\alpha)
\end{bmatrix}=\begin{bmatrix}
o_{sin}/\sqrt{o_{sin}^{2}+o_{cos}^{2}}\newline
o_{cos}/\sqrt{o_{sin}^{2}+o_{cos}^{2}}
\end{bmatrix}$

Regression loss는 predicted B 와 ground truth B의 L1 distance로 구한다.
$$
\begin{equation}
L_{reg}=\frac{\lambda}{N}\left\Vert \hat{B}-B \right\Vert_{1}
\end{equation}
$$

Final loss
$$
L = L_{cls}+\sum_{i=1}^{3}L_{reg}(\hat{B_{i}})
$$

evaluation metric
AP: precision-recall graph 이용
11 point, all-point, 40 point method 존재
R_{11} = {0, 0.1, 0.2, ... , 1} : recall이 R_{11} set 안의 값일 때 precision의 평균
all-point : 흔히 아는 아래 면적 인듯
R_{40} = {1/40, 2/40, ..., 1}: R_{11}처럼 set안의 값일 때 precision의 평균
aos: average orientation score

'Deep Learning' 카테고리의 다른 글

Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras (0)	2022.02.20
Objects are Different: Flexible Monocular 3D Object Detection(MonoFlex) (0)	2022.02.19
Unsupervised learning of depth and ego-motion from video (0)	2022.02.17
파이썬 날코딩으로 알고 짜는 딥러닝, Chapter 3 (0)	2021.12.05
NAVER AI TECHTALK 후기 (0)	2021.12.02

최근에 올라온 글

TAG more

Total

Today

Yesterday

최근에 달린 댓글

링크

공지사항

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

글 보관함

note

티스토리 뷰