Originally published as a technical write-up at https://puzzle-f20.integirls.org/page/tech.
To quickly introduce myself for those of you who haven’t seen me around on the forum, I’m Mingjie (usually @itsmingjie on the interwebs), a first-year intended CS student at UC Berkeley currently still caving at home in Rockville, Maryland, and I use he/him pronouns. Again, thanks for everyone’s participation over this weekend’s puzzle hunt! It was one of the rare opportunities that we could all have some fun in the hectic year of 2020.
In this post, I would like to spend some time talking about how I made certain design decisions for Infinity — the puzzle hunt platform that I built over the last 4 months specifically for this puzzle hunt, and how these decisions have affected the outcome of the hunt itself. This post be a little leaning towards the technical side, but please feel free to reach out to me if you have any questions! I’m always happy to share more.
Let’s dive in...
Let’s start with Spring 2020.
This spring, we launched our first attempt at an online puzzle hunt. I joined the team a few weeks before the competition weekend, and to be frank, I didn’t think it was that big of a deal — it’s just a simple server where a few students attempt to solve puzzles... What could go wrong?
We started the platform-selection process by thinking, “which platform is the easiest to deploy, and would probably require the least amount of effort to maintain before and during the event”? And since I’ve had prior experiences with running cybersecurity “Capture the Flag (CTF)” competitions and I didn’t think it was much different, we went with CTFd — a commonly used platform for CTF events.
Server setup went well — we were able to easily deploy an instance of the platform and get our puzzles online and running, get everything that we needed all hooked up, and allow teams to register and start competing. It did take our staff quite a while to actually be able to understand how the platform is designed, especially because CTFd is specifically designed for cybersecurity challenges and is bloated with features that we might not need.
So... What went wrong?
If you read the wrap up post from the Spring 2020 Puzzle Hunt, you already know what happened — we were absolutely overwhelmed by the amount of traffic that we have been receiving throughout the duration of the event. From our system logs, our servers received almost 300,000 requests per hour at peak time (generally the first and last hour of the event)! To better help you visualize — that comes down to about 5,000 requests per minute, and about 83 requests per second — and among them, some are simple calls that asked to send a certain image or small webpage, but some others may require our servers to perform slower calculation tasks like hashing and comparing a solution attempt, or a series of database-heavy authentication or lookup.
We were totally caught off guard by the popularity of this event. In fact, our server (notice how it is unfortunately... singular at this point) went completely unresponsive within the first minute after the puzzles were released. Upon receiving the downtime alert, I went to work, staring to identify what exactly is causing this slowdown. The result wasn’t shocking, but it did concern us by quite a lot — with just a few hundred teams active, our server’s resource usage started surging to 100% (both CPU and memory), and the average load per core was overloaded by at least 400%.
Multiple attempts were taken before we decided to give up on the poorly deployed CTFd instance, including massively scaling up the server, or splitting the divisions onto different servers. Interestingly, splitting into two servers didn’t help ease my pressure at all... but worsened it, since then I have 2 struggling servers to take care of. Eventually, we went with Matt’s back-up “Google Sheets” solution which, to my utter surprise, worked quite well despite the jankiness.
But notice my wording here — “poorly deployed” — so don’t blame CTFd. CTFd is a very nicely written piece of software that is maintained by the community (and very stable if deployed correctly). Unfortunately, I lacked certain knowledge about networking and load balancing to be able to correctly deploy them to the right infrastructure, and nor did we have the time and budget to make that happen. So we had to look for simpler alternatives.
A custom solution?
We chatted for a little bit after the competition, and as someone who had little experience solving puzzles myself, I was finally able to understand what it takes to run an engaging online puzzle hunt challenge. Everything comes down to these criteria:
- Customizable: the puzzle team should easily be able to fit the platform to their own needs
- Minimal: a puzzle-hunt platform is different from a CTF platform, and we will only need a fraction of the features needed by a CTF competition
- Easy-to-use: it shouldn’t take puzzle team the entire duration of the puzzle-hunt to understand how the platform functions. In fact, it should almost feel like the platform does not exist... and in Steve Jobs' words, “it just works”.
I soon offered to write a custom solution for the next event. Not only do I want my platform to be easy to deploy and use, I also want it to be fail-proof — if anything went wrong, we won’t lose access to the admin panel completely just like what happened last time; we should be able to deploy hot fixes and have the ability to modify database entries to correct information in real-time. And that turned out to be quite handy.
The Infinity platform itself is built with Node.js, rendered with the Handlebars templating engine, and routed with Express.js. You can find a thread of me posting updates when building Infinity on the inteGIRLS Forum.
A confession: I really wanted to use Next.js + React for this, but as much as I already know about Next.js, I don't have much confidence in my knowledge about React to build an application at this scale with React. Therefore I went with Handlebars, a lightweight templating framework that's practically plain HTML.
The platform itself is heavily reliant on Express.js's middlewares. For every pageload, a number of middlewares is invoked to check for sufficient permissions to access the page (e.g. whether the current puzzle is locked for your team, or whether the current page you're accessing is under lockdown), and the user's team information is reloaded from the database.
It's the safest way to handle information (and to prevent potential exploit of the system taking advantage of unsynced data). But it is also exactly that that caused problems: our database was limited to 100 concurrent connections, and due to some temporary connections not being properly cut off, we repeated ran out of available connections, so staff members had to manually reboot part of the server to get everything working again once in a while.
Interestingly, although we have less teams competiting in this season's puzzle hunt, we received a lot more attempts during peak hours... Thanks for treating my platform well, everyone!
Infinity didn't work on its own. Because of certain design and deployment choices I made, I have also written and set up a few microservices to offload the complexity of the platform itself:
Yes, Infinity depends on Airtable.
This sounds like a terrible idea at first, but it actually ended becoming quite handy. We were able to spin up multiple test servers before we set up our production servers, and we didn't actually have to link in the real puzzles — we had a test Airtable that contains all the puzzles to use throughout development and testing, and all we had to do before the competition is to swap out the access keys to Airtable. It also made puzzles a lot easier to import — traditionally, puzzles are written directly in the codebase, which would mean that I need to be the only person importing all the puzzles, aside from trying to keep the servers up.
We also made sure, since Airtable stores everything in plain text, that none of the puzzle solutions were stored in plain text. All solutions were hashed into something that looks like
$2b$10$UPKTpsSoGAKulzr2idoUsezGymWG1d.k7EbZdAhkyB3jnADRltW8G, and since hashing is done one-way, even if all the solutions were leaked, the attacker wouldn't be able to use this key to unlock puzzles, and nor will they be able to calculate the actual plain-text solutions. And all we had to do to regenerate everything was to re-hash the keys.
Using Airtable as our puzzle storage came with concerns. As CodeDay's ED Tyler tweeted just this summer, when we were using Airtable to power our entire summer program website:
But we weren't too concerned, since our servers automatically cache all the puzzle content until they reboot or we manually request a "restock". Therefore, as long as Airtable comes back up within an hour or two, we should expect minimal service interruptions. We can also very easily swap the Airtable service out with a simple JSON file, which can be done in just a few minutes if Airtable does not come back up in time.
A centralized socket.io server that receives request from the Admin portal of Infinity and broadcasts messages to everyone or by teams in real-time. This one was really interesting to write (personally, I really love writing real-time communication software — they're just really cool to watch). Every time a new user gets online, they get assigned a unique client ID, and if they are logged in, they get placed into the same "room" as their teammates, so the admins can have accurate control over our broadcasted messages. For example, we can accurately remind a specific team to stay hydrated...
Last time, we saw a surprisingly high amount of participation from teams in India, but due to certain internet regulations of the country, some of our assets cannot be accessed at all. Prox is nothing more than a content proxy — it proxies our image URL (prox.joy.integirls.org — now disabled) directly to Imgur's host. And since it downloads the image asset to our local server first, all the images will appear to be served from our own proxy server, instead of Imgur. This worked out well, but unfortunately, a few other elements throughout the puzzle hunt like videos and interactive components were also blocked by certain countries, so I might consider adding a universal proxy to the system in the future.
I was suggested by our friends at Puzzle Potluck to deploy a service like this on Google Cloud, where things can automatically scale based on our usage. But after a few attempts... I just could not understand how Google Cloud's deployment dashboard works. So we went back to self-hosting.
This time, we utilized a self-hosted PaaS solution called CapRover to its maximum capabilities. Deployments are fully automatic from pushes to our GitHub repository, and it even supports server clustering with Docker Swarm. We deployed a few dedicated Droplets on DigitalOcean across different datacenters, including New York, San Francisco, and Singapore (!), and bounced people around with a gateway that will, depending on server availabilities and the users location, bounce people around to their most optimal servers.
This worked out extremely well while we were running the testing round with some friends of inteGIRLS, but what I did not anticipate was the database bottleneck problem happening again as I described earlier. This caused some nodes on our network to get tossed into a queue forever (since the database connections are almost always occupied until we manually terminate them), and that error was, unfortunately, never reported back to the gateway. So the gateway unknowingly bounced people to the "braindead" servers... causing people to receive the infinite loading loop until they clear their cache so the gateway starts sending them to a new server.
The staff members, fortunately, figured it out ahead of time so when things are slowing down for us, we simply used the
CTRL/COMMAND + SHIFT + R shortcut to reset the site's data. But since we were only able to find out the cause of this problem almost a third of the way into the competition, it became too big of a risk to take everything down and send in a hotfix to change how the nodes communicate with each other. My only attempt was during the second night, when I took down the entire system for an hour and tried to close down connections properly... which still couldn't fully address the issue unless we change how the gateway works.
As I was writing this wrap-up, I came across Teammate Hunt's technical wrap-up. Coincidentally, it also described the following incident...
One issue we identified two weeks before the hunt was a leak in database connections. For some reason, each new GameEngine thread opened a connection to Postgres but failed to close it. Despite our best efforts, we never isolated the cause, and ended up implementing a workaround that used connection pooling (via PgBouncer), an expiry time of 12 hours, and a cron job to monitor the number of open connections.
Thanks for sharing! Connection monitoring sounds like a good fix that can be applied to this issue in the future, but I'll definitely spend some more time into looking for a long-term solution.
Let's talk about some lighthearted things. I wanted Infinity's design to be as clean as possible, and it also needs to be very customizable for our puzzle team to design puzzles. Every puzzle and level has a field where our puzzle team can inject custom CSS into, and the puzzles themselves also have access to the overall styling that I have pre-configured for the platform overall. This worked quite well — especially with all the cute artwork drawn by Maggie.
Not too much design effort was put into pulling together Infinity's branding itself — it is simply a combination of the infinity symbol ∞ and the integration sign ∫. The card-sized poster has the brand-colored gradient in the background, as well as a graph that is approaching infinity. The source Figma files are below — feel free to remix with credit!
Infinity isn't going away after this season! The interface may be replaced in the future, but I have made plans to optimize the verification server that powers the platform in the background. In the future, we are going to add ???????, as well as ?????? and ??????? to make puzzle hunts even more fun. With all these new features and capabilities, the next puzzle hunt is going to be quite interesting... (Don't try, this is not a puzzle.)
I also personally really care about the accessibility of sites, starting with screen-reader capabilities of the websites. We added in the capability of disabling background artwork for all puzzles almost at the last minute, since the text were kind of hard to read if they were displayed on top of colorful artwork. I know that not everything at a puzzle hunt can be screen-reader friendly, but I am going to do my best to ensure that the majority of the site is as accessible as possible for future events.
I am nowhere close to my initial goal of building a platform that "just works". But it has been an incredible learning experience for me nonetheless.
And to end this technical write-up, here's the actual longest solution that we received on the server, submitted by our one and only Julie Steele <3, with a whopping 8400 characters...
And yes, indeed, I did not length validate.