Google Open Sources Magika: AI-Powered File Identification Tool

Ravie LakshmananFeb 17, 2024Artificial Intelligence / Data Protection

Google has announced that it's open-sourcing Magika, an artificial intelligence (AI)-powered tool to identify file types, to help defenders accurately detect binary and textual file types.

"Magika outperforms conventional file identification methods providing an overall 30% accuracy boost and up to 95% higher precision on traditionally hard to identify, but potentially problematic content such as VBA, JavaScript, and Powershell," the company said.

The software uses a "custom, highly optimized deep-learning model" that enables the precise identification of file types within milliseconds. Magika implements inference functions using the Open Neural Network Exchange (ONNX).

Google said it internally uses Magika at scale to help improve users' safety by routing Gmail, Drive, and Safe Browsing files to the proper security and content policy scanners.

In November 2023, the tech giant unveiled RETVec (short for Resilient and Efficient Text Vectorizer), a multilingual text processing model to detect potentially harmful content such as spam and malicious emails in Gmail.

Amid an ongoing debate on the risks of the rapidly developing technology and its abuse by nation-state actors associated with Russia, China, Iran, and North Korea to boost their hacking efforts, Google said deploying AI at scale can strengthen digital security and "tilt the cybersecurity balance from attackers to defenders."

It also emphasized the need for a balanced regulatory approach to AI usage and adoption in order to avoid a future where attackers can innovate, but defenders are restrained due to AI governance choices.

"AI allows security professionals and defenders to scale their work in threat detection, malware analysis, vulnerability detection, vulnerability fixing and incident response," the tech giant's Phil Venables and Royal Hansen noted. "AI affords the best opportunity to upend the Defender's Dilemma, and tilt the scales of cyberspace to give defenders a decisive advantage over attackers."

That said, concerns have also been raised about generative AI models' use of web-scraped data for training purposes, which may also include personal data.

"If you don't know what your model is going to be used for, how can you ensure its downstream use will respect data protection and people's rights and freedoms?," the U.K. Information Commissioner's Office (ICO) pointed out last month.

What's more, new research has shown that large language models can function as "sleeper agents" that may be seemingly innocuous but can be programmed to engage in deceptive or malicious behavior when specific criteria are met or special instructions are provided.

"Such backdoor behavior can be made persistent so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it), researchers from AI startup Anthropic said in the study.

Found this article interesting? Follow us on Google News, Twitter and LinkedIn to read more exclusive content we post.