Independent Development Diary 24: I Built an AI Comic Drama App

This week has been a bit intense. Now I can stay at home for 2-3 days without going out. The main reason is that there have been too many new contents recently, and there just happened to be no new content today.

Confirmo

You may have the impression that I open sourced an R2 uploader before, and the icon at that time was a mean owl, so I made a sprite image. After submitting it to Confirmo, it can monitor the status of your AI Coding. This thing was made by a master on Twitter, but the actual reminder is not very accurate. I guess there is currently no way to capture the completion status.

openclaw

This thing has gone through three name changes, from clawdbot to moltbot to openclaw. It is so magical that you will probably lose money if you buy a domain name and speculate on keywords. In order to get rid of my prejudices, I decided to install it on my Mac mini to see the usability.

It has a web interface, and I established a public address through Cloudflare Zero Trust and Cloudflare Tunnel. Cloudflare Tunnel mainly allows you to bind a local port to a domain name, and then I create a whitelist through Cloudflare Zero Trust, which requires verification through the email verification code in my whitelist before I can open the web page. tg I set up a whitelist and only bind it to my account, so basically there is no problem with security.

In the past few days, I have been studying what other people are doing. Many people are hyping traffic or following the trend. Anyway, after installing it, they ask about the weather, or post news every day. Maybe I don’t pay much attention to people who have nothing to do with news, so I think they are all following the trend.

Back to productivity, with the advent of the AI era, you will find that your productivity has improved. One person can beat a team, but your concentration is not enough. For example, if I am running two different projects at the same time, the mental switching cost is high. I can't afford any more.

Since it is positioned as a personal assistant, I hope this thing can help me manage and make decisions. I have used many products and developed similar products myself, but I am not satisfied with them. So I made a Skills to link and collect data, let OpenClaw connect the services I configured, and then complete what I wanted through dialogue.

Operation Daily

Thanks to a Posthog membership provided by lennysbundle, I reported all the hidden data, as well as error logs and some info management data. Then I asked OpenClaw to configure my work-skills (I have to set up the key APIKEY on the computer myself), let OpenClaw pull Posthog data by itself, and then make a daily report and send it to me every day.

The biggest difference from the original is that I don’t need to change the script through AI Coding to manage cron by myself. If you have any questions about the report, you can always ask and chat with him for updates.

Automatically fix bugs

After I configured Github, I asked him to pull yesterday's Posthog error, then analyze the cause of the error and fix the code, and finally review it, then create a PR after running the test, and then message me.

First observe the quality in the past few days. If it is good, you can expand the scope of repair.

I haven’t had time to study other things yet. Currently, I see that it is letting AI do transactions by itself. I have a project that is similar, but this thing takes time.

StudyThai

Let’s talk about the project. After the Level 3 course is launched, I’m going to start making the APP. I spent a long time researching Tauri before, and there were so many questions that I wanted to cry. I want to say that if it is just webview packaging, why should I make it so complicated. So I wrote it directly in Flutter. I was moved to tears. It was packaged in less than half an hour and I could start making changes.

The main ideas are divided into several parts. One is the nativeization of the title bar + tab, then the nativeization of some functions such as tts and recording, and finally the unified update of the UI of the entire APP.

The UI design is done through Google Stitch. I asked Claude Code to sort out all our current pages, then read the code to generate Google Stitch prompt words, and then called MCP to generate it. Then I fine-tuned it in Google Stitch, and finally downloaded it all to the local computer and read the html file to reconstruct the UI.

Although many details may not be that good, it is already more designed than the current version of the UI. However, there are still many problems during the reconstruction process. For example, the design of the desktop side is obviously different from the web side and the ap side, because the ap uses a native title bar, and the desktop and web pages are also different.

There are also problems with web pages and native communication, and problems with app page stacks. These need to be tested, which is quite troublesome.

AI-drama

This project is also a lot of fun. There is a team on Twitter that is quite famous. I was also interested, so I voted for it. Then we chatted for a while and came up with a topic to test. It was probably about whether you have the ability to do it and what your code can do. Let me make an AI comic.

After all, I am an experienced person, and I will not work stupidly for nothing, so I said that I will not give you the core code. You can review the front-end code to review the overall project code management-related capabilities. The other party said yes, and then gave a test picture API key.

Because I am interested in this topic myself. One is that I have been curious about how to implement it recently. The other is that my completion rate of our StudyThai's AI reading is a bit low, so I have been thinking about expanding this area and turning the words you learn into a script video, article, comic, etc., so I spent some time researching it based on this incident.

This thing is indeed more complicated than I thought. From prompt words - script - storyboard - video, these are actually not difficult, it is just a matter of docking and debugging. But the difficult part is the business logic. The implementation plan is constantly changing, and most of the time is spent here.

Below are some problems I encountered. If you are also doing this, you can refer to it, or if you have a better solution, you can tell me.

Characters need to generate pictures from different angles, and then the pictures are consistent (Solution: Make the generated angle pictures consistent by passing in pictures and adjusting prompt words)

The problem of incoherent storyboards: (VGoT 5D) One standard is to add some additional information to the storyboards, such as character dynamics, background environment, beginning and end descriptions, character relationships, lens information, lighting information, etc. This will allow you to have more coherent problems when creating videos

The problem of video incoherence: through some video generation that supports first and last frames, such as the veo3.1 I am currently using, in this way, the picture of the current storyboard and the picture of the next storyboard can be connected to the next storyboard.

Then the most difficult one is the problem of video and lines, because you can't predict the content generated by the video.

Because veo3.1 has a very strict review mechanism, for example, if you give lines, audio filtering may be triggered and the video cannot be generated. So I thought of making the character's mouth move, but unfortunately it also triggered the limitations of the mouth. I looked up some plans and they all involved modifying the lips later to achieve the effect of speaking. I didn’t study it in detail, and it seemed like a rather troublesome plan.

At present, it is changed to natural language to describe expressions, but it does not involve the mouth, so the character's speech is not obvious, just the expression. And sometimes there are subtitles and English lines (added by veo himself)

Then I wanted to see if there were some professional prompt words, but I didn't expect to find a good one from the PureACT official tutorial. It was divided into about 7 pieces, controlled with JSON, and then submitted to veo. The overall story description helps, but it's the dialogue that solves the problem. **

However, our video and audio still cannot match each other, mainly because the protagonist is talking from 0.5s to 2.5s in the video, but the video may not necessarily follow our instructions. At this time, the audio is at another time, and the two sides do not match up.

Then I tried using elevenlabs to press the timestamp of the word and then pass it to the video generation. After trying it for a long time, I found that elevenlabs’ support for Chinese is too poor. Then I found Microsoft's edge tts for voice generation. It has specialized Chinese voices, and the effect was pretty good.

Finally, I used ffmpeg to clear the sound of the video, and then merged the audio into it. However, I found that the time did not match. The result was that the English and Chinese words in the video did not match, so I adjusted the prompt to imply the language. But there is currently no good way to make him adhere to our lines timeline. Here we need to draw cards, and drawing them once is not cheap.

At present, AI comics have not yet reached the point where they can be used in products. I am very curious about how other people do it. Do they do the post-processing through silhouettes after adding video and audio cards themselves?

at last

Having said so much, let’s talk about some tips.

If everyone has a Google account and can use gemini cli, you can create a translation skill in claude code and call it through gemini. I translate a lot of things this way now, so as not to waste claude code tokens.

The other one is to establish a code review skill of codex. Because the code review of codex is generally better than that of claude code, I also let claude code join in.