The platform supports the processing of six major types of heterogeneous data sources with the following capabilities:
- file class: Including PDF (extraction of paragraphs and tables), Word (preservation of formatting conversion), TXT (coding auto-recognition)
- imagery: JPG/PNG and other common formats, support for OCR text recognition and metadata extraction
- audio class: MP3/WAV, etc., with automatic generation of timeline subtitles (e.g., "00:01-opener" format)
- video category: MP4/MOV, etc., while extracting visual frame information (thumbnails) and speech to text
- web category: support for dynamically rendered pages, form submissions, waterfall loading, and other complex structures
- API data: Direct parsing of JSON/XML responses, support for custom field mapping
In terms of file capacity, the platform utilizes segmented processing technology:
- The basic version supports single files of ≤200MB
- Enterprise Edition can handle 500MB+ of 4K video or hundreds of pages of legal documents
- Oversized files will be automatically processed in chunks, the processing status is displayed through the progress bar, support for intermittent transmission
It is worth noting that audio and video processing will consume more Token resources, and it is recommended to bind external models (e.g., OpenAI's Whisper) to improve efficiency. For sensitive data, the pending private deployment version of Docker will provide fully offline processing capabilities.
This answer comes from the articleSupametas.AI: Extracting Unstructured Data into LLM Highly Available DataThe