GLM-4.1V-Thinking: an open source visual inference model to support multimodal complex tasks
GLM-4.1V-Thinking is an open source visual language model developed by the KEG Lab at Tsinghua University (THUDM), focusing on multimodal reasoning capabilities. Based on the GLM-4-9B-0414 base model, GLM-4.1V-Thinking utilizes reinforcement learning and "chain-of-mind" reasoning mechanisms to...
VideoMind
VideoMind is an open source multimodal AI tool focused on inference, Q&A and summary generation for long videos. It was developed by Ye Liu of the Hong Kong Polytechnic University and a team from Show Lab at the National University of Singapore. The tool mimics the way humans understand video by splitting the task into planning, localization, checking...
DeepSeek-VL2
DeepSeek-VL2 is a series of advanced Mixture-of-Experts (MoE) visual language models that significantly improve the performance of its predecessor, DeepSeek-VL. The models excel in tasks such as visual quizzing, optical character recognition, document/table/diagram comprehension, and visual localization.De...