arXiv:2606.12047v1 Announce Type: cross Abstract: In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident...
Read the full article at the source.
Comments (0)
No comments yet. Be the first to comment!