Easy-GPT4o - reproduce GPT-4o in less than 200 lines of code
Talk is Cheap, here is the code:
https://github.com/Chivier/easy-gpt4o # Is GPT-4o the Solution We Need?
The new OPENAI GPT-4o API has certainly caught everyone's attention. With its impressive capabilities, one might wonder if it truly meets our requirements. However, upon closer examination, GPT-4o doesn't seem to offer any significant advantages over its predecessor, GPT-4. In fact, I believe I can create a GPT-4o-like system using the powerful combination of GPT-4 Turbo and other OpenAI APIs. And the best part? I managed to achieve this in less than 200 lines of code!
How to Make It Happen
First, let's take a closer look at what GPT-4o is capable of. During the recent OpenAI Spring press conference, GPT-4o showcased its remarkable ability to interact with video in real-time. To replicate this functionality, we can employ the divide-and-conquer approach and break down the complex problem into smaller, manageable parts. Essentially, the larger problem can be broken down into the following smaller problems:
- Understanding the content of the video
- Comprehending questions and audio from the video file
- Facilitating interaction and engaging in a chat with the video file
- Reading the generated answer and converting it into audio
By addressing each of these smaller problems, we can achieve a GPT-4o-like system using the combination of GPT-4 Turbo and other OpenAI APIs.
Here is the flow graph:
Let's finish those parts one by one. ## Generating Clips from Video
To tackle the task of generating clips from videos, we can leverage OpenAI's Vision understanding API. One straightforward approach is to generate descriptions frame by frame and then combine them into a comprehensive text with detailed information.
To accomplish this, I turned to the mighty GPT-4 Turbo for assistance in clip generation. Here's the prompt I provided to GPT-4 Turbo:
1 |
|
And now, I have the first part of my code ready. The next step is to generate descriptions for each image. However, due to the knowledge limitations of GPT-4 Turbo, it is unable to provide direct answers. To overcome this, I decided to give GPT-4 Turbo some OpenAI APIs as references. This way, it can better understand the desired output and provide more accurate descriptions.
1 |
|
So now, the vision is solved (partially). ## Audio transcription
OK, now focus on the audio part. I use the following prompt help me build a function:
1 |
|
Conclusion part
Once we obtain the vision description text and audio transcription, we can combine them together. Using the prompt trick, we can then leverage GPT-4 Turbo to generate a comprehensive answer that incorporates both visual and auditory information. And to add the finishing touch, we can utilize a TTS (Text-to-Speech) model to generate the corresponding voice for the answer. By following this approach, we can create a seamless and immersive experience.
The Hidden Efforts
At first glance, the story of this task may seem simple, as if it required minimal effort on my part. But the truth is, it took me more than 2 hours to complete.
You may wonder, how could a task that should have taken less than 10 minutes consume so much time?
Well, while GPT assisted me in generating most of the code, it wasn't always accurate. For instance, it struggled to comprehend the response result type in the "image-to-text" section, necessitating manual fixes on my part.
Moreover, OpenAI Speech-to-text has certain limitations when it comes
to file size. Sending small video files directly is feasible, but as the
file size increases, I had to first extract the audio portion from the
video using ffmpeg
.
1 |
|
Additionally, there were numerous intricate debugging and testing tasks involved. Nevertheless, in the end, everything seemed to flow smoothly. I have updated the code and demo on GitHub for your reference:
https://github.com/Chivier/easy-gpt4o
Roadmap for future
Performance Enhancement
The objective of this project is to demonstrate one fundamental truth. GPT-4o may not possess the level of magnificence we have imagined; in reality, it predominantly comprises a packaging of existing technologies. Leveraging their accumulated technical expertise, OpenAI has successfully mitigated the latency associated with multi-model collaboration. If given the opportunity to gradually replace these components, we could theoretically achieve a real-time end-to-end model.
We shall deconstruct the three modules, denoting them as Modules A, B, and C, as depicted in the diagram below. Firstly, Modules A and B can be executed completely in parallel, allowing for simultaneous execution of these two parts. Upon extracting video frames, multiple CLIP models can be utilized for enhancing the parallel processing rate. Moreover, for audio transcription, large files can be divided into multiple parallel processes, each process transcription individually before synthesis.
Another aspect that easily comes to my mind is within Module C. When converting the generated textual output into audio, we can effectively employ OpenAI's Streaming Generation mode. As each sentence is generated, it can be incrementally sent to the TTS model for conversion and played back sequentially. Consequently, the time required for language model inference can be concealed within the duration of audio playback.
Let us reorganize our thoughts. An upgraded version of Easy-GPT4o may yield the following effects:
Theoretically, the minimum waiting time would be defined as:
\(max\{\text{Inference Time}(\text{CLIP model}), \text{Inference Time}(\text{Audio Transcription model})\}+\text{Inference Time}(\text{LLM Prompt})\)
This encompasses the time required to process a single slice of audio/video, in addition to the inference time for a single prompt in the language model.
Thus, the entire workflow is left with only one remaining concern, which happens to be the sole limitation: total video length. In other words, our bottleneck lies in the interaction between the audio/video content sequence and the language model. The number of input tokens in the language model will become the solitary constraint.
Improve Expression Ability
To be completely honest, my prior experience in the field of computer vision research was quite limited, with my only acquaintance being the CLIP model. The most flawed component of the entire project lies within Module A: the conversion of videos into textual descriptions. I warmly welcome any suggestions or solutions you may have to offer in order to enhance this aspect.
When it comes to video descriptions, my approach involved uniformly sampling frames at regular intervals and then directly querying GPT 4 Turbo for descriptions of each selected image. These descriptions were then concatenated together. However, this design is far from optimal, particularly for continuously recorded videos. A more efficient strategy would be to employ incremental questioning between adjacent images, such as "What unfolded between these two images?" or "Did you observe any new elements compared to the previous image?" This approach would significantly reduce token consumption for video descriptions.
Regarding audio descriptions, we unintentionally overlooked temporal and emotional information. To address this, we could utilize a more advanced transcription method to generate detailed transcriptions that incorporate both temporal and emotional cues. These transcriptions could then be seamlessly integrated with the video information.
Lastly, for the final module, it would be worth exploring solutions that offer greater flexibility. For instance, we could incorporate our own Text-to-Speech (TTS) model to generate speech with distinct vocal characteristics. This would add an extra layer of customization to the project.
The Significance of this Project
This project is just a toy project. I must admit, GPT-4o is an exceptionally user-friendly multimodal model that I have thoroughly enjoyed using. And it is the best multimodal model. However, it is important to recognize that the "Omnipotent" nature implied by the "o" in GPT-4o is not as all-powerful as it may appear. Particularly in terms of text generation, GPT-4 Turbo Preview model proves to be superior.
Therefore, it is crucial that we take a moment to reflect on whether we have truly maximized the potential of existing models. The examples mentioned earlier highlight the need for a product like GPT-4o to have been developed much earlier. Personally, every morning, I make it a habit to peruse through GitHub Trending. However, I often find the page cluttered with repetitive projects, including countless derivatives of Stable diffusion. This monotony can be quite vexing and tedious.
Furthermore, during the development of the entire framework, I made a flawed assumption: that the interaction between different modalities could be solely achieved through language. However, language is inherently limited when it comes to conveying visual models, despite its proficiency in describing visual impressions and emotions. In Chinese, there exists a concept called "tonggan," which describes the process of associating one sensation with another. For instance, feeling warmth from the color red. Sadly, this characteristic is often overlooked in multimodal models. Pure text or image tokens are insufficient in capturing such implicit information. By designing more sophisticated tokens, we can greatly enhance the capabilities of multimodal models.
Additionally, one significance of this project lies in the exploration of the possibility to create our own customized amalgamation model, similar to GPT-4o. The modules within the aforementioned model can be freely combined and substituted. For instance, we have the freedom to utilize our own Text-to-Speech model, or in certain unique scenarios, we can present customized requirements for extracting specific objects from video content, thereby focusing on particular elements within the footage. Alternatively, we can integrate a model after transcription to facilitate multilingual translation. End-to-end functionality, in reality, is a double-edged sword, offering low latency and impressive performance, but simultaneously compromising our freedom in terms of tailoring functionality and achieving optimal results.
Lastly, extensive research on models with contextual understanding holds immense significance. Let's consider a simple calculation: if we were to use 500 tokens to describe a single video frame, our model could analyze it in great detail. However, this severely limits the duration of our videos. On the other hand, if we were to use only 100 tokens to describe a frame, our model may not possess the same level of visual acuity, but it would be able to observe changes over a longer time span. Introducing tokens that represent changes adds complexity to the program's behavior, thus impacting the overall performance of the process. If we wish to tackle multimodal challenges with existing technology, we cannot afford to overlook this issue.
Fortunately, our research group has devised solutions to address these aforementioned problems. Currently, my focus is on developing strategies to reduce the latency of end-to-end inference. I invite you all to stay updated by following our repository for the latest progress and developments!
Some other thinkings
We need a Dataset! Let's build it together!
How to evaluate our generated voice for the video is a complex multimodal problem. And I think we need a better dataset and a benchmark on this problem. I really need help!