Trackers is an open source Python tool library focused on multi-object tracking in video. It integrates several leading tracking algorithms, such as SORT and DeepSORT, allowing users to combine different object detection models (e.g., YOLO, RT-DETR) for flexible video analysis. Users can easily...
Describe Anything is an open source project developed by NVIDIA and several universities, with the Describe Anything Model (DAM) at its core. This tool generates detailed descriptions based on areas (such as dots, boxes, doodles, or masks) that the user marks in an image or video. It does not ...
Find My Kids is an open source project hosted on GitHub and created by developer Tomer Klein. It combines DeepFace face recognition technology with the WhatsApp Green API, and is designed to help parents monitor their children's WhatsApp groups through...
YOLOE is an open source project developed by the Multimedia Intelligence Group (THU-MIG) at Tsinghua University School of Software, with the full name "You Only Look Once Eye". It is based on the PyTorch framework, belongs to the YOLO series of extensions, can detect and segment any object in real time. The project is hosted on GitHu...
SegAnyMo is an open source project developed by a team of researchers at UC Berkeley and Peking University, including members such as Nan Huang. This tool focuses on video processing and can automatically recognize and segment arbitrary moving objects in a video, such as people, animals or vehicles. It combines TAPNet, DINO...
RF-DETR is an open source object detection model developed by the Roboflow team. It is based on the Transformer architecture and its core feature is real-time efficiency. For the first time, the model achieves over 60 APs of real-time detection on the Microsoft COCO dataset, and also performs well in the RF100-VL benchmark...
HumanOmni is an open source multimodal big model developed by the HumanMLLM team and hosted on GitHub. It focuses on analyzing human video and can process both picture and sound to help understand emotions, actions, and conversational content. The project used 2.4 million human-centered video clips and 14 million .....
Vision Agent is an open source project developed by LandingAI (Enda Wu's team) and hosted on GitHub to help users quickly generate code to solve computer vision tasks. It utilizes an advanced agent framework and a multimodal model to generate efficient vision AI agents with simple prompts...
Make Sense is a free online image annotation tool designed to help users quickly prepare datasets for computer vision projects. It requires no complicated installation, just open a browser access to use it, supports multiple operating systems, and is perfect for small deep learning projects. Users can use it to add images to...
YOLOv12 is an open source project developed by GitHub user sunsmarterjie , focusing on real-time target detection technology . The project is based on YOLO (You Only Look Once) series of frameworks , the introduction of attention mechanism to optimize the performance of traditional convolutional neural networks (CNN) , not only ...
VLM-R1 is an open source visual language modeling project developed by Om AI Lab and hosted on GitHub. The project is based on DeepSeek's R1 approach, combined with the Qwen2.5-VL model, and significantly improves the model through reinforcement learning (R1) and supervised fine-tuning (SFT) techniques in...
HealthGPT is a state-of-the-art medical grand visual language model designed to enable unified medical visual understanding and generation capabilities through heterogeneous knowledge adaptation. The goal of this project is to integrate medical vision understanding and generation capabilities into a unified autoregressive framework, significantly improving the efficiency and accuracy of medical image processing...
MedRAX is a state-of-the-art AI intelligence designed specifically for Chest X-ray (CXR) analysis. It integrates state-of-the-art CXR analysis tools and multimodal large language models to dynamically process complex medical queries without additional training.MedRAX, through its modular design and strong technological base, provides a...
Agentic Object Detection is an advanced target detection tool from Landing AI. The tool greatly simplifies the process of traditional target detection by detecting through text prompts without the need for data labeling and model training. Users simply upload an image and enter the detection prompts, and the AI agent can .....
CogVLM2 is an open source multimodal model developed by the Tsinghua University Data Mining Research Group (THUDM), based on the Llama3-8B architecture, and designed to provide performance comparable to or even better than GPT-4V. The model supports image understanding, multi-round conversations, and video understanding, and is capable of processing content up to 8K long, and supports...
Gaze-LLE is a gaze target prediction tool based on a large-scale learning encoder. Developed by Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M. Rehg, it is designed to use pre-trained visual...
Video Analyzer is a comprehensive video analysis tool that combines computer vision, audio transcription, and natural language processing techniques to generate detailed video content descriptions. The tool generates natural language descriptions by extracting key frames from the video, transcribing audio content, and...
Twelve Labs is a multimodal AI company focused on video understanding, dedicated to helping users understand and process large amounts of video content through advanced AI technologies. Its core technologies include video search, generation, and embedding, which are able to extract key features from video such as actions, objects, on-screen text, speech, and character...