Sanvi

11 min read

Building the Mainland China Version of StudyThai: A Week of Alibaba Cloud ECS Pitfalls

I just finished writing it two weeks ago put StudyThai Notes on moving to Singapore, I was planning to take a break for two days when I discovered ICP There is no way to avoid filing. The registration requires that the server be in mainland China and the main domain is in Singapore. The structure does not exist for supervision. To achieve compliance in mainland China, you have to open another ECS and run the same code on Alibaba Cloud again.

I thought this was a job that could be done in one night. As a result, after a week, I spent an average of one or two hours a day competing with Alibaba Cloud. Make a note of it to avoid going through the same pitfalls again when you open a new instance three months later.

Why can’t I continue to use the Singapore one?

To put it simply: the ICP registration number is the "admission card" for mainland China's ecology. No record:

  • Unable to access WeChat mini program (domain name verification step is directly rejected)

  • Unable to access the Webview entrance of any domestic app store

  • Information flow advertisements cannot be placed (Juduo Engine/Tencent Advertising/Xiaohongshu)

  • Mainland China access will be intermittently blocked by the operator - not necessarily hung up, but very slow and convulsive

Therefore, we decided to keep the main domain in Singapore for international use, open a separate .com.cn domain name and run the mainland-compliant version**, and deploy it to Alibaba Cloud ECS. The branch is called studythai-cn.

The logic is simple. I started working and fell to my knees on the first day of deployment.

Monday night: Coolify can’t pull down the helper mirror on its own

The instance is opened, Coolify is installed, and the code is pushed to trigger deployment. Coolify Dashboard shows that the progress bar rotates for 8 minutes and then turns red:

Preparing container with helper image: ghcr.io/coollabsio/coolify-helper:1.0.14

The first reaction was "Did the name of my mirror be spelled wrong?" It took me a long time to realize that what this line reported was not my reflection at all. This is Coolify's own "Foreman" image. Every time you deploy it, you must docker run it first and let it clone the code, run build, and push the results. The helper cannot be pulled down = the deployment has not started at all, and my code has not even been "tried".

The root cause was later figured out: Alibaba Cloud ECS uses an international link to ghcr.io, and packet loss is severe during the evening peak. The image of several hundred MB cannot be pulled out within the default timeout. ghcr.io is not like docker.io, which has the official mirror of Alibaba Cloud. It is a de facto "domestic operation and maintenance blind spot" - if there is a problem, no one will take care of it for you.

The practice took some detours. First change /etc/docker/daemon.json and add registry-mirrors to configure dockerproxy, restart docker and try again, but it still doesn't work. After flipping through the documentation for another half an hour, I realized: **Docker's registry-mirrors only takes effect for docker.io. ghcr.io is another set of domain name prefixes. If it is configured, it means it is not configured. This period is a complete waste of time.

Finally, I used the ghcr mirror opened by Nanjing University. Pull it manually and give it an alias of ghcr.io/...:

docker pull ghcr.nju.edu.cn/coollabsio/coolify-helper:1.0.14

Docker's search for images is "local first" - there is already an image with this name locally (although it is actually pulled from nju), and Coolify directly hits the cache when docker run, without going through the network at all.

It’s solved in 5 minutes, but it’s a long-term debt. Every time Coolify is upgraded, the helper version number will change, the cache will be invalid, and it will have to be run again. I later added a cron script to pre-pull the latest version of the tool every day, but essentially I was still racing against the design of "Coolify hardcodes the helper".

The first day I got here I thought it was done.

Tuesday: The business image is also stuck, but the previous trick doesn’t work.

The deployment is triggered again the next day, with a new failure line:

Pulling image: ghcr.io/sanvibyfish/studythai-cn-dev:abc1234

This time I pulled my own business image, the one that was pushed to ghcr.io after the GitHub Actions build. Same network issue. I thought "just nju mirror + retag one more time" - then realized it wouldn't work this time.

Two reasons:

First, the business image is private. The code does not want to be made public. Third-party public welfare mirrors only support public, private authentication tokens are issued by GitHub, and third parties cannot forward them - direct allocation will result in a 401 rejection.

Second, the business image tag is different every time it is deployed. The helper's trick of "pre-pull once and use forever" does not work on my own image - tomorrow's commit sha has not been born yet, how can I pre-pull it?

That night I went through various plans in my head:

  • Change the image to public: There is a lot of business logic bundled in the code. I don’t want to make it public, so I reject it.

  • Self-built ghcr anti-generation: You need to rent an overseas VPS to run nginx with SSL, which is troublesome to maintain.

  • Docker Hub free plan: only 1 private repo + 200 pull/6h, frequent deployment will be limited

  • Alibaba Cloud ACR Enterprise Edition: 100+ yuan/month, not worth it for personal projects

  • Local docker save + scp: Too boring, don’t think about it first

Only when I was halfway through the research did I remember to search for "ACR Personal Edition" - the result was Alibaba Cloud ACR Personal Edition is completely free. 3 namespaces + 300 repo + 5GB single image, which is enough for personal projects like mine.

It was a bit deflating at that moment. I wasted a long time researching solutions that "seem to be more universal", and finally discovered that Alibaba Cloud itself has a ready-made free file. Tencent Cloud and Huawei Cloud also have them. The cloud vendors put the free files there just to tie up customers. I just happened to use them, so there is no shame at all. Next time my first reaction should be "Does this cloud vendor have a ready-made free solution?" rather than "Can I build one myself?"

The link modification is very light: GitHub Actions pushes one more copy to ACR, Coolify switches to ACR's VPC intranet address - because my ECS and ACR are in the same region (Hangzhou), the pull goes through the Alibaba Cloud intranet, hundreds of MB can be pulled in a few seconds, with zero traffic fee. Push goes through the public network (local/CI is not in the VPC and can only go through the public network), and pull goes through the VPC (ECS enjoys the intranet in the VPC). This is the core mode of ACR design. After understanding it, I think the design is quite beautiful.

Next week: Coolify turns red from time to time, irregularly

The mirroring problem has been completely fixed and the deployment is successful. I thought it was time to take a break.

As a result, in the next few days, the ECS in Coolify Dashboard turned red every now and then - "Server is reachable" verification failed. Sometimes I'll be fine for a few minutes, and sometimes I'll hang out for half an hour. The most frustrating thing is the irregularity: it's fine in the morning, suddenly hangs up at noon, and then fine again in the evening.

SSH up and see, I can connect. Docker is OK. The Coolify container is running. But Coolify’s health check just doesn’t work.

I went through three levels of inspection, and I learned something new at each level.

**The first layer is the Alibaba Cloud security group. **This is the most annoying thing. Alibaba Cloud's security group is a whitelist, and port 22 is only open to IPs checked in "Quick Configuration" by default. When I created the instance, I checked "My Current IP" to save trouble. As a result, my home broadband IP changed, and the export IP of the Coolify server (in Singapore) was not in the whitelist - it only occasionally worked from the beginning. "Occasionally it works" because the VPS exit in Singapore occasionally hits a certain historical rule I added. It's purely luck.

Temporarily changed the security group to port 22 0.0.0.0/0 and it was immediately stable. I used to think that the "constant disconnection" was due to the instability of Alibaba Cloud, but it turns out that Alibaba Cloud is too stable - so stable that it executes every rule I misconfigured. Later, the public IP of the Coolify Singapore machine was withdrawn for precise release, which is both safe and convenient.

**The second layer is Alibaba Cloud's "burst performance" instance trap. **I drove the t6 entry-level model, which seemed cheap (tens of dollars per month), but later I discovered a pattern: Every time a Coolify deployment is run, the SSH response is very slow for the next 1-2 hours, and the health check occasionally times out.

After checking, I found out that the CPU of t5 / t6 is in the baseline 10% + points accumulation mode: you can only use 0.2 CPUs at ordinary times, and you need to accumulate enough points to run it to full capacity in a short time. Once you build Next.js, you will eat up all the points. After the build, sshd will be run at a speed limit of 10%, which is naturally slow.

Alibaba Cloud's tier is superiorly cheap but actually pitfalls - any service that needs to continuously build/run background tasks should not use burstable type. Finally, I upgraded to the g6 regular model, which is 2-3 times more expensive, but it will be worth the money once the domestic business starts. If you really want to save money, you should use Alibaba Cloud's ECI or preemptible instances to avoid the burstable trap category.

**The third layer is cross-border SSH flutter. **Coolify is in Singapore, ECS is in Hangzhou, and international exports are in between. Mainland China has peak usage between 8pm and 12pm, and the international link packet loss rate is often 20%+. SSH handshake will report failure if it cannot be completed within the default 10s timeout.

There is no way to cure this - unless Coolify is also moved to the mainland (in which case it will need to be filed again). Finally, two things were done to reduce noise: the SSH timeout was adjusted to 30s, and the health check interval was increased from 1 minute to 5 minutes. Fewer false alarms, but slower response, which is a trade-off.

There is also an invisible mine: the disk

The Singapore article wrote about the accident in March - Coolify accumulated old docker images to stretch the disk to 100%, PostgreSQL crashed, and the entire site was unavailable for 15 minutes.

When I opened ECS in Alibaba Cloud this time, I took the initiative to open a 100GB system disk (default 40GB), and added cron in the first week:

# Every Sunday early morning: clean images unused for 7 days, stopped containers, and dangling volumes

Coolify doesn't have its own "auto-clean" switch. This is one of its long-term product gaps that almost all Coolify users will eventually step into - the one in March taught me to learn it once.

A few things I would like to say to myself a week ago

When I close the computer and think back on this week, I have a few words that I most want to say to myself a week ago:

**ECS directly choose the g6 ordinary type, don’t be greedy for the tens of dollars of t6. **If you really want to save money and use ECI or preemptive, Don't touch burstable - it is not a "cheap version", it is a "trap version that looks cheap".

**The system disk is 100GB+. **The default 40GB will be used up in three months. When the time comes to rescue PostgreSQL in the early morning, you will want to spend the extra money now.

Open ACR Personal Edition on your first day (FREE). Configure the VPC intranet address from ECS to ACR to save all the time of competing with ghcr.io later.

In the security group, check "Quick Configuration" and "My Current IP". Write down the fixed public IP of the Coolify server, plus your own administrator IP. Don’t try to save trouble with 0.0.0.0/0, and don’t let Alibaba Cloud guess it for you.

GitHub Actions defaults to dual push - ghcr.io (development visibility) + ACR (production pull source). Don't wait "wait until ghcr can't pull it anymore and then add ACR". By then you will already be putting out fires in the early hours of the morning.

After installing Coolify, the first thing to do is to add docker system prune to cron. It is not necessary to configure the domain name first.

When Coolify is managed across zones, the SSH timeout is adjusted to 30s and the health check interval is 5 minutes. The default value is designed based on "delay in the same computer room", which is not reasonable across countries and networks.

The summary of the whole week is actually one sentence

The summary of the Singapore case was "pre-show + one-line DNS rollback" - technical elegance.

This summary is very inelegant: Domestic cloud has a hidden tax called "environmental cost" - GitHub is slow, bursty CPU traps, complex security groups, cross-border jitter... Each one is not big, but stacked up, I have been putting out fires for the past two weeks.

I used to think that this kind of "environmental cost" was caused by insufficient engineering capabilities - I always felt that "others/more experienced people would not step into these pitfalls". I realized this week that these pitfalls are part of the infrastructure itself and are not a matter of your level. It's much cheaper to acknowledge it and build it into budgets and processes than to try to "do it the right way" around it.

A compliance tax is a compliance tax. Next time you open a new instance, run through the list at the end of the article on the first day, and leave the rest of the time for real business.