Alibaba Unveils Lightweight AI Model for Image and Video Processing on Mobile Devices

tech360.tv

3 days ago2 min read

Alibaba Group Holding has launched a new multimodal artificial intelligence model, Qwen2.5-Omni-7B, capable of processing text, images, audio and video directly on smartphones, tablets and laptops.

Credit: ALIBABA

The model, introduced on Thursday, is the latest addition to Alibaba’s Qwen family and is designed to run locally on devices with limited computing power. With only 7 billion parameters, it enables real-time responses in text or audio without requiring an internet connection.

Cartoon bear named Ethan talking, saying "I can explain PPTs, web materials, and more." Background shows blurred document and waveforms. — Credit: ALIBABA

Qwen2.5-Omni-7B is open-source and available on Hugging Face, Microsoft’s GitHub and Alibaba’s ModelScope. It is also integrated into Alibaba’s Qwen Chat.

Alibaba highlighted potential applications such as providing real-time audio descriptions for visually impaired users and offering cooking guidance by analysing ingredients. The model’s ability to handle multiple input types reflects growing demand for AI systems that extend beyond text generation.

In benchmark tests, Qwen2.5-Omni-7B scored 56.1 on OmniBench, outperforming Google’s Gemini-1.5-Pro, which scored 42.9. It also achieved 92.4 on the CV15 audio benchmark, surpassing Alibaba’s earlier Qwen2-Audio model by one point.

For image-related tasks, the model scored 59.2 on the Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark, beating the Qwen2.5-VL vision-language model.

The release aligns with a broader industry trend toward efficient, multimodal AI models that prioritise portability and data privacy. These models can operate without cloud-based processing, reducing reliance on external servers.

Other tech firms are also advancing in this space. OpenAI recently added image generation to its GPT-4o model, while ByteDance introduced InfiniteYou, a tool that re-crafts images while preserving subjects’ identities. In January, DeepSeek released Janus-Pro, an updated version of its multimodal model.

Alibaba’s Qwen models have become popular among AI developers in mainland China, positioning the company as a key competitor to DeepSeek’s V3 and R1 models.

Alibaba launched Qwen2.5-Omni-7B, a multimodal AI model for mobile devices
The model processes text, images, audio and video locally without internet
It outperformed Google’s Gemini-1.5-Pro in benchmark tests

Source: SCMP