PENCARIAN CERDAS ANTAR-MODA : EVOLUSI TEKNOLOGI VIDEO-TEXT RETRIEVAL
Abstract
Downloads
References
Z. Chen et al., “InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks,” no. 1, pp. 24185–24198, 2023, doi: 10.1109/CVPR52733.2024.02283.
A. Yang et al., “Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2023-June, pp. 10714–10726, 2023, doi: 10.1109/CVPR52729.2023.01032.
Q. Sun et al., “Emu: Generative Pretraining in Multimodality,” 12th Int. Conf. Learn. Represent. ICLR 2024, no. Figure 3, pp. 1–29, 2024.
J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang, “ModelScope Text-to-Video Technical Report,” 2023, [Online]. Available: http://arxiv.org/abs/2308.06571
R. Huang, J. Huang, D. Yang, Y. Ren, M. Li, and Z. Ye, “Make-An-Audio : Text-To-Audio Generation with Prompt-Enhanced Diffusion Models,” 2023.
Y. Wang et al., “Internvid: a Large-Scale Video-Text Dataset for Multimodal Understanding and Generation,” 12th Int. Conf. Learn. Represent. ICLR 2024, pp. 1–23, 2024.
Z. Huang et al., “VBench: Comprehensive Benchmark Suite for Video Generative Models,” pp. 1–12, 2023, doi: 10.1109/CVPR52733.2024.02060.
J. Wang et al., “Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval,” pp. 16551–16560, 2024, doi: 10.1109/CVPR52733.2024.01566.
Y. Wang, X. Meng, J. Liang, Y. Wang, Q. Liu, and D. Zhao, “HawkEye: Training Video-Text LLMs for Grounding Text in Videos,” pp. 1–23, 2024, [Online]. Available: http://arxiv.org/abs/2403.10228
K. Kahatapitiya, A. Arnab, A. Nagrani, and M. S. Ryoo, “VicTR: Video-conditioned Text Representations for Activity Recognition,” pp. 18547–18558, 2023, doi: 10.1109/CVPR52733.2024.01755.
M. Kim, H. B. Kim, J. Moon, J. Choi, and S. T. Kim, “Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval,” pp. 13894–13904, 2024, doi: 10.1109/CVPR52733.2024.01318.
X. Yang, L. Zhu, X. Wang, and Y. Yang, “DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval,” Proc. AAAI Conf. Artif. Intell., vol. 38, no. 7, pp. 6540–6548, 2024, doi: 10.1609/aaai.v38i7.28475.
M. Cao, H. Tang, J. Huang, P. Jin, C. Zhang, and R. Liu, “RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter,” 2022.
D. Luo, J. Huang, S. Gong, H. Jin, and Y. Liu, “Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models,” no. Vlm, pp. 5464–5473.
O. Thawakar et al., “Composed Video Retrieval via Enriched Context and Discriminative Embeddings,” pp. 26896–26906.
L. Ventura, A. Yang, C. Schmid, and G. Varol, “CoVR: Learning Composed Video Retrieval from Web Video Captions,” Proc. AAAI Conf. Artif. Intell., vol. 38, no. 6, pp. 5270–5279, 2024, doi: 10.1609/aaai.v38i6.28334.
K. Tian, “Holistic Features are almost Sufficient for Text-to-Video Retrieval,” pp. 17138–17147.
K. Li et al., “VideoChat: Chat-Centric Video Understanding,” pp. 1–16, 2023, [Online]. Available: http://arxiv.org/abs/2305.06355
Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “PandaGPT: One Model To Instruction-Follow Them All,” Proc. 1st Work. Taming Large Lang. Model. Control. Era Interact. Assist. TLLM 2023, pp. 11–23, 2023.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.



