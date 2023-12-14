From the timid beginning to the launch of Gemini: Google's breakthrough in the era of generative AI. Let's discover the particularities of the multimodal chatbot.

Not even a decade ago, tiny bits of machine learning quietly crept into the digital lives of all of us.

We are mostly talking about small "tricks", such as theidentification of subjects when viewing a camera or the sentence formulations of questionable usefulness. Today, as we approach a pinnacle of generative artificial intelligence, the rumors about it are becoming increasingly loud; It is in this scenario that Google raises the bar with its new "multimodal" model called Gemini. Google debuted with Gemini on December 6, 2023, offering it in three sizes: Ultramore powerful, and which for now is held back from widespread commercial use, Pro And Dwarfthe latter dedicated to implementation on mobile devices. In recent years, the search giant has struggled to respond to the hype around OpenAI, GPT and the potential threats that AI-powered services presented to its core business. With the ability to manage ahuge amount of information from the Internetusers could get the answers they needed with a single question on a single web page. Above all, making everything easier and quicker than a Google search.

A thought that raises concern in the Mountain View area, especially considering the numerous glances that could escape the adverts, for which customers pay considerable sums.

Between myths and false gods Google Gemini logo To date, the models of Large Language Models or LLM, worked by analyzing input media to expand a certain type of discourse into a given media format. For example, OpenAI's Generative Pre-trained Transformer or GPT model handles text-to-text exchangeswhile DALL-E translates text prompts into images.

Each LLM would be adjusted for one type of input and one type of output. Here's where the multimodality talk comes in: Gemini can receive text (including code), images, video, and audio, and, with some direction, return something new in any of these formats. In other words, a multimodal LLM can theoretically perform the tasks of several dedicated single-disciplinary LLMs. This presentation gives a nuance of idea how refined interactions can be with a decently trained model of this type. However, it is worth warning because the video in question, and above all its elegant editing, can easily be misleading.

In reality, none of these interactions happen that quickly as seen on the screen. As Google also admitted, the video demonstration was not performed in real time with voice suggestions; instead, still frames from the raw footage were used and then text suggestions were inserted to which Gemini responded. His intent was to showcase Gemini's multimodal capabilities, including its innate ability to make spoken conversational suggestions based on image recognition. This would constitute a point of substantial divergence in Google's proposal compared to other chatbots.

What's unique about it is the future perspective it offers: the ability for an individual to have a fluid voice conversation with Gemini, observing and getting real-time responses to what's happening in their surroundings.