10 min read
Moving StudyThai to Singapore: Engineering Notes from a Cross-Region Infrastructure Migration
I'm not pursuing "zero-downtime switching." Give yourself a 5-10 minute maintenance window and get a minimalist rollback path - if there is a problem, just change a line of DNS and change it back. Technically it's not a lot of work, but there are a few pitfalls that I didn't expect in advance. I wrote them down for myself and for colleagues passing by.
1. Why move to Singapore?
Before the migration, I checked the 30-day user distribution on PostHog: Mainland China accounts for 54% and Thailand 13% of the MAU on the web side. Adding in Asian regions such as Hong Kong, Taiwan, Singapore, Malaysia, Japan, and the Philippines, East Asia + Southeast Asia totals 80%+, and the Americas + Europe add up to less than 10%. On the mobile side, Thailand accounts for a higher share (19%) - Southeast Asia is a typical mobile-first market. The server should be placed in Asia-Pacific, not the original North American computer room.
A trap that almost led me astray: GSC lied
Before the migration, I almost used Google Search Console to determine user distribution. GSC shows that mainland China only has 3% - according to this data, the server should be placed in Thailand or Taiwan. But PostHog shows mainland China 54%. 18x difference.
The root cause is that Google is restricted in mainland China. GSC only sees the channels reached by Google and completely misses the mainland users who come in through Xiaohongshu/official accounts/app stores. If I had decided on GSC, I would have misplaced the server.
This incident made me reflexively look at PostHog first when it comes to all "see where the user is" issues. SEO tools have their place, but it doesn’t look at where people are – it only sees people who can find you.
Another boost: March’s P0 accident
The old server was a small 135GB machine owned by RackNerd in Seattle, and compilation, applications, and databases were all squeezed onto it. One day in March, the old docker images accumulated by Coolify stretched the disk to 100%. PostgreSQL could not write to WAL and entered a crash loop. The entire site was unavailable for about 15 minutes. This accident made me determined to redo the entire architecture - I can no longer use the single-point combination of "one machine to do everything + cheap VPS + small disk".
- How do I choose suppliers?
The research path is very fast - based on the three hard conditions of "users in Asia-Pacific + monthly cost controllable + complete autonomy in control".
Dachang Cloud (AWS/GCP/Azure): The price is several times more expensive, the control is too complicated for small projects, and the export fees are astonishing. exclude.
PaaS (Vercel/Heroku/Railway): I have used Vercel before, but the problem is that Next.js server components still need to be returned to the source us-east. Asia-Pacific users still have to circle the earth; pay-per-use is more expensive than VPS after the user volume. exclude.
Self-built VPS + Coolify: Coolify is an open source alternative to Heroku/Railway. It is installed on the VPS itself and provides the experience of "GitHub push → automatic compilation and deployment → SSL renewal → domain name binding". It is simpler than k8s, more convenient than Docker Compose, and natively supports the "build server" concept - this is a key prerequisite for the subsequent hybrid architecture.
The two machines are divided into two families: Linode + Contabo
After deciding on Coolify, I made a decision that was different from most tutorials: using two different VPS providers for the application machine and the build machine.
Apply to Linode Singapore. Face users directly and be willing to pay a premium for stability. Linode is obviously more stable than cheap VPS in terms of "infrastructure quality" such as network, control plane, and disk IO. The Singapore computer room can be quickly transported to mainland China and Southeast Asia.
Build machine of choice Contabo Singapore. This one only gets busy when I push code. Contabo has 50%+ more CPU/memory/disk than Linode at the same price, making it ridiculously cheap. The price is that the network occasionally shakes and the control plane is not as smooth as Linode - but these do not matter at all to a machine that is only responsible for compilation. At most, a certain deployment will be slower.
The logic is very simple: The stability requirements of the two machines are very different, so there is no need to pay for the first level. If the build machine hangs, just wait for the next deployment. If the application machine hangs, all users will see the error page directly. For the same money, if you buy machines of different grades separately, the combined output will be higher than buying all the machines of one grade.
After the build machine compiles the image, it pushes it to ghcr.io (GitHub mirror repository), and the application machine pulls the image from ghcr.io and runs it - the two machines are only coupled through the image, and the failure of either machine will not affect the other.
A long-awaited pit: Coolify's default template will run
prisma migrate deployduring the compilation phase, which means that the compilation machine must be able to connect to the production database. This is not a robust assumption in a compilation/application separation architecture. See the "pits" section below for details.
3. How to move file storage and voice services
Database pg_dump is enough, but Cloudflare R2 does not have a "cross-region move" tool. My first reaction was to "temporarily synchronize on the day of the switch", but I quickly rejected it - the old bucket was still being written on the synchronization process line, and the old and new data did not match. After switching the domain name, I discovered that the synchronization was not complete, and reverse synchronization was required when rolling back. Two-way sync is a nightmare.
Later, I used shadow bucket + pre-synchronization + same-day link change:
-
7 days before switching: Create a new bucket
cdn-apacin the APAC area, but not bind a domain name (do not receive traffic); run Cloudflare Super Slurper to copy all the old bucket data over -
Switching day: Super Slurper ran another incremental synchronization to make up for the difference; switched the
media.studythai.aidomain name from the old bucket to the new bucket; changed a line of environment variableR2_BUCKET_NAME=cdn-apacand restarted the application
The core benefit of this method is that the new bucket will not receive traffic at all when it is not bound to a domain name. If there is a synchronization problem or the data does not match, the entire new bucket can be thrown away and started over, and the old bucket will not be affected. The rollback only involves changing the domain name in reverse and changing the environment variables. The data in the old bucket has never been touched.
The same idea is also used in Azure Speech - build a shadow endpoint in the southeast Asia region in advance, run the local connectivity test, and only change 4 environment variables + restart on the day of the switch. Resources are all prepared in advance, and there is no "now made and cut" action at the moment of switching.
4. How do I block users during the 5 minutes of cutover?
At the moment of switching, the old server will stop serving and the new server has not yet taken over. This gap will last about 5 minutes. I don't want users to see a Cloudflare 522 error page, that looks like they've been compromised.
The most intuitive way is to configure the maintenance page in origin nginx. But at the moment of cutover, origin itself is not reachable - the maintenance page must run outside origin. I used Cloudflare Worker, tied to all paths of studythai.ai, and read a secret called MAINTENANCE_MODE. When Secret is true, the traffic is intercepted and returned to the maintenance page; when false, the traffic is transparently transmitted to origin. The switch is wrangler secret put MAINTENANCE_MODE, which takes effect globally in a few seconds and does not require redeploying anything.
A mobile-aware detail: API paths must return JSON
In the first version, all my paths returned to the HTML maintenance page. As a result, the mobile app got the HTML and wanted to parse it according to JSON and exploded directly. Change to splitting by path:
-
HTML path → 503 + branding maintenance page HTML
-
/api/*→ 503 + JSON (application/problem+jsonformat, aligned with the error schema of mobile fetch client) -
/api/health,/api/app/config,/api/app/version→ Transparent transmission to origin
The last one is particularly important: cold startup of the mobile app will pull config and version for version checking. If blocked, the user will see a white screen of "Unable to connect to the server", which is much worse than the "under maintenance" prompt.
An unexpected bonus: maintenance pages are used more than cutover
Later, during the 15-minute 522 accident in production, I was able to switch the user experience from "CF 522" to "Brand Maintenance" with one line of command. This is the biggest by-product of this migration: All emergency resources placed on edge are more robust than those placed on origin. Next time I design an emergency plan, I will first ask "Can this thing be placed on edge?"
5. Two unexpected pitfalls
The "invisible firewall" of cloud vendors
One day not long after the migration went online, the monitoring suddenly exploded - CF error 522 was reported, and the entire site was unreachable. I quickly SSHed up to check.
Here comes the strange thing: SSH 22 passes, ping passes, but 80/443 timeout. Containers are healthy, reverse proxy is listening, ufw inactive, iptables INPUT defaults to ACCEPT. After browsing the server for 20 minutes, I became more and more confused - everything that should be a green light was still a green light.
Later, the phenomenon of "picking ports to block" woke me up. SSH is accessible but 80/443 is blocked - the machine cannot do this kind of regular blocking at all, either all are allowed or none are allowed. This kind of thing can only happen at a layer outside the machine. I immediately went to the Linode control panel, and sure enough, the Cloud Firewall configuration and binding status did not match - this layer of firewall runs on the hypervisor and is completely invisible to the VM. It will be restored immediately after repair.
In hindsight, I actually knew about this pitfall before - AWS Security Group, GCP Firewall, and Hetzner Cloud Firewall are all products of the same name and nature. But the first reaction in panic mode is to get into the server and look up the configuration, subconsciously forgetting that there is another layer outside the VM. Next time I see "Everything is OK inside the machine + a specific external port is blocked", I will look up at the cloud control panel first.
Database migration command stuck in compilation stage
When the new build machine deploys for the first time, prisma migrate deploy directly times out for 5 seconds during the docker build phase and cannot connect to the database. The same connection string of psql can be connected, and the minimal connectivity test can also pass - but Prisma just can't work in the compilation context. Finally, the link that locates the Coolify injection connection string has character escapes/end blanks, causing "it looks the same but is actually different".
In fact, this pitfall has always existed. The old machine ran well because the compilation and database were on the same VM, and the connectivity was natural. After switching to the separated architecture, this "assumed tacit understanding" was immediately broken. Later, I realized that putting the database migration in the image compilation stage was a design error in itself - the compilation action should not rely on runtime resources and should be able to run on a CI machine without any external network. Migration did not create this bug, migration exposed it.
ending
Looking back a few days after the migration, I found that what really took time was not the 5 minutes of switching, but the preparations for the previous 7 days of repeatedly creating sub-resources, calculating md5sum for secrets, and adjusting DNS TTL. Infrastructure migration is 10% technology, 90% is breaking down each irreversible step into reversible steps, and turning each hypothesis into verifiable.
The most valuable thing taken away this time is not Coolify or R2, but the habit of "shadow resources + pre-synchronization + only changing links on the day" - it turns "migration" from a tense on-site operation into a small switch that can be practiced repeatedly. Next time you do cross-region, cross-account, and cross-vendor migration, this idea can still be used.