arXiv:2606.12412v1 Announce Type: cross Abstract: Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a...
Läs hela artikeln hos källan.
Kommentarer (0)
Inga kommentarer ännu. Bli först med att kommentera!