PENCARIAN CERDAS ANTAR-MODA : EVOLUSI TEKNOLOGI VIDEO-TEXT RETRIEVAL

Muhammad Fiddiana Asyhari; Fadli Dimas; Abib Maftuh Abu Bakar; Ade Bastian

doi:10.23960/jitet.v13i3.6607

Authors

Muhammad Fiddiana Asyhari Universitas Majalengka
Fadli Dimas Universitas Majalengka
Abib Maftuh Abu Bakar Universitas Majalengka
Ade Bastian Universitas Majalengka

DOI:

https://doi.org/10.23960/jitet.v13i3.6607

Abstract Views: 49 File Views: 49

Abstract

This study aims to examine scientific trends and current approaches in the field of video-text retrieval through a bibliometric analysis approach. The urgency of this topic is driven by the surge in the growth of audiovisual data and the increasing need for a retrieval system that is able to understand the semantic relationship between text and video. Bibliometric data were collected using Publish or Perish Harzing software with Google Scholar as the main source, then visualized and analyzed using VOSviewer to identify thematic clusters, keyword distributions, and author collaboration patterns. The InternVid dataset was used as an exploration reference because it provides millions of video clips with semantically rich text annotations. The analysis results show five main clusters that illustrate thematic directions in this field, including the development of cross-modal representation models, retrieval performance evaluation, and large-scale dataset construction. Human perception-based benchmarks such as VBench are also utilized to enrich the evaluative perspective beyond quantitative metrics. This study contributes to mapping the knowledge structure of the multimodal retrieval field and opens up opportunities for the development of more contextual, adaptive, and user-oriented intelligent retrieval systems.

Downloads

Download data is not yet available.

Author Biography

Ade Bastian, Universitas Majalengka

S.T., M.Kom

References

Z. Chen et al., “InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks,” no. 1, pp. 24185–24198, 2023, doi: 10.1109/CVPR52733.2024.02283.

A. Yang et al., “Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2023-June, pp. 10714–10726, 2023, doi: 10.1109/CVPR52729.2023.01032.

Q. Sun et al., “Emu: Generative Pretraining in Multimodality,” 12th Int. Conf. Learn. Represent. ICLR 2024, no. Figure 3, pp. 1–29, 2024.

J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang, “ModelScope Text-to-Video Technical Report,” 2023, [Online]. Available: http://arxiv.org/abs/2308.06571

R. Huang, J. Huang, D. Yang, Y. Ren, M. Li, and Z. Ye, “Make-An-Audio : Text-To-Audio Generation with Prompt-Enhanced Diffusion Models,” 2023.

Y. Wang et al., “Internvid: a Large-Scale Video-Text Dataset for Multimodal Understanding and Generation,” 12th Int. Conf. Learn. Represent. ICLR 2024, pp. 1–23, 2024.

Z. Huang et al., “VBench: Comprehensive Benchmark Suite for Video Generative Models,” pp. 1–12, 2023, doi: 10.1109/CVPR52733.2024.02060.

J. Wang et al., “Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval,” pp. 16551–16560, 2024, doi: 10.1109/CVPR52733.2024.01566.

Y. Wang, X. Meng, J. Liang, Y. Wang, Q. Liu, and D. Zhao, “HawkEye: Training Video-Text LLMs for Grounding Text in Videos,” pp. 1–23, 2024, [Online]. Available: http://arxiv.org/abs/2403.10228

K. Kahatapitiya, A. Arnab, A. Nagrani, and M. S. Ryoo, “VicTR: Video-conditioned Text Representations for Activity Recognition,” pp. 18547–18558, 2023, doi: 10.1109/CVPR52733.2024.01755.

M. Kim, H. B. Kim, J. Moon, J. Choi, and S. T. Kim, “Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval,” pp. 13894–13904, 2024, doi: 10.1109/CVPR52733.2024.01318.

X. Yang, L. Zhu, X. Wang, and Y. Yang, “DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval,” Proc. AAAI Conf. Artif. Intell., vol. 38, no. 7, pp. 6540–6548, 2024, doi: 10.1609/aaai.v38i7.28475.

M. Cao, H. Tang, J. Huang, P. Jin, C. Zhang, and R. Liu, “RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter,” 2022.

D. Luo, J. Huang, S. Gong, H. Jin, and Y. Liu, “Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models,” no. Vlm, pp. 5464–5473.

O. Thawakar et al., “Composed Video Retrieval via Enriched Context and Discriminative Embeddings,” pp. 26896–26906.

L. Ventura, A. Yang, C. Schmid, and G. Varol, “CoVR: Learning Composed Video Retrieval from Web Video Captions,” Proc. AAAI Conf. Artif. Intell., vol. 38, no. 6, pp. 5270–5279, 2024, doi: 10.1609/aaai.v38i6.28334.

K. Tian, “Holistic Features are almost Sufficient for Text-to-Video Retrieval,” pp. 17138–17147.

K. Li et al., “VideoChat: Chat-Centric Video Understanding,” pp. 1–16, 2023, [Online]. Available: http://arxiv.org/abs/2305.06355

Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “PandaGPT: One Model To Instruction-Follow Them All,” Proc. 1st Work. Taming Large Lang. Model. Control. Era Interact. Assist. TLLM 2023, pp. 11–23, 2023.