Disclaimer: I'm speaking *about* my employer but I'm not speaking *for* them.
Kagi Assistant 2.0 was released almost exactly a year ago. At that time, it had already been through half a year of internal development which, for a variety of reasons, had gone a bit more chaotically than we'd hoped. The code has the scars to prove it. Somewhere in the back of my head I keep a fuzzy mental list of components that were written in ways that made total sense at the time but are now in dire need of refactoring. (I'm happy to report that we've worked most of our way through that list.)
During that initial development phase, we had people working on the frontend and people working on the backend, but nobody was working on both at the same time. We suspected that there was room for optimization in the way the client and the server interacted, but it took us a while to bring the right kind of full-stack attention to bear on the problem. It turns out there was room to make the app load twice as fast.
Here's what you would have seen if you had watched a warm-cached page load in Chrome dev tools a month ago:
In these diagrams, the red line is a hand-added "The App Is Actually Ready Now" metric.
After the markup loaded and the JS ran, we fired off two three AJAX requests. The app wouldn't be ready to go until they completed.
thread_open
populated the list of threads you have saved. If you were viewing a thread, it would populate the messages in the thread, too.
profile_list
populated the list of models that you can talk to.
summary.json
... was only firing on page load due to a bug that was fixed shortly after this recording was taken. Please pretend it isn't there.
The whole process takes two network round trips if you have the JS cached or three if you don't.
One might ask, if you know you're going to request extra data immediately after the page loads, why not embed that data into the markup and save a round trip?
This is a very good question. The reason is that doing so requires coordinated changes between the frontend and the backend and different people work on those components. For almost a year, "optimize assistant loading" sat forgotten in the backlog and near the bottom of my to-do list.
When we started work to add user-defined tags for organizing saved threads, we decided to render the tag sidebar on the server instead of adding a third tag_list
AJAX request to the page load process.* Armed with a concrete example of how we'd like that particular part of the backend code to talk to the database, I started working on optimizations. My first changeset embedded the data from profile_list
into the markup. Around the time that it deployed, we found out that it only worked when a certain feature flag was turned on. In my defense, at the time I had very little experience working in that part of the backend, and it was only a very short outage.
For the next few weeks, I dedicated a spare minute here and a spare minute there to building up a patch that would get rid of both AJAX requests at once, this time without catching fire. It lived on my local main
branch through countless git pull --rebase
s, accumulating test-hours while I worked on other things, slowly growing support for all the corner cases.
After we removed the feature flag that had gotten us into the trouble the first time, I created a pull request. The code went through a review that was only slightly more intensive than normal, merged, and deployed without major incident. I posted before and after pics on the Kagi Discord server.
Around this time I learned that even very simple DB calls were known to cost tens of milliseconds of latency in our environment. That was two orders of magnitude more than I had expected, and I wasn't quite sure if I believed it, but if that was the case, I had my next optimization target. I modified a procedure that runs on practically every request to consolidate its two queries into one. The pull request got stuck for a couple of weeks while we prioritized other things. In the end it didn't matter because the tens of milliseconds figure was (mostly) wrong.
Here's what a typical server-side trace for a request to /assistant looked like:†
The DB call to load tags shouldn't cost any more than the call for profiles, but the traces always came out looking like that. I didn't know why.
What I did know was that it was taking the client way longer than I had expected to process the markup after we sent it down. I have a decent computer and it apparently still takes 16ms to load the JS and CSS from cache. (Yes, I checked that I wasn't disabling the cache when dev tools were open. I still hope that I'm just measuring wrong, because 16ms feels way too slow, but let's proceed assuming the numbers are right.)
Oh, and these client-side performance recordings? I've been turning adblock off to capture them. With adblock on, page load stalls for multiple frames while JS from the extension executes. (Please somebody tell me that things are only running this slowly because of some sort of profiler overhead.)
The server was rendering the markup and sending it down in one big chunk, but nothing in <head> depends on the results of those three DB calls. What if we were to break out the first part of the markup and send it down while the queries are still running? The client could start chewing on markup tens of milliseconds earlier, loading subresources and executing extension JS. When the server sends down the second part, the client will have enough of a head start to get it on screen a frame or two sooner than it would have otherwise. We've been doing something similar in spirit (though technologically distinct) on search pages since time immemorial.
I never got to measure the full effects of the TTFB optimization in isolation, because before the code deployed I found out why the second two DB calls were taking 20ms longer than they ought to.
Our DB driver keeps a connection pool automatically. We haven't done much to tune it over the years. By default, it will keep at most one idle DB connection waiting.
If you happen to request three different DB connections all at once so you can run three queries in parallel, the most likely scenario is that it will give you the one it has sitting idle and then dial two more, which will be ready in, oh, 20 milliseconds or so.
When you're ready to return the connections, the pool will hold onto one and then close the others because it doesn't have room for them.
The next time someone tries to load the assistant, we'll do the whole thing again.
After we increased the size of the connection pool so that it kept enough connections sitting around to run a few queries in parallel, the traces started looking like this:
Here's a graph showing how many DB connections the us-west read replica sees opened per second. Somewhere over the course of this week, the connection pool change was deployed. Try to guess where.
The graph showing CPU load on the SQL instances wasn't as dramatic, but the drop was very definitely there.
This was by far the easiest optimization to make. It was only a six-line change. Of course, it was six lines of the sort of code that will make everything crash if you get it wrong, so we had a good long discussion in Zulip before merging.
I was out of town when the second batch of optimizations deployed. Through the power of questionable Wi-Fi backed by worse DSL, I could see that everything looked good on the server-side dashboards, but I waited to get back home to my ethernet connection before taking client-side measurements.
It doesn't look nearly as dramatic in this view, but you can see that TTFB has improved and subresource requests are firing earlier:
This exhausts the low-hanging fruit I know about. We can probably shave off another millisecond or two, but I doubt we'll be getting another ten that easily.
Measuring performance is hard.
The individual traces I've shown were picked to be representative of typical behavior. Some requests complete faster. Others hit lag spikes.
Feature work has been happening this whole time. I'm pretty sure we had an accidental regression a week or two before I started measuring for this project. On the other hand, one of the queries we use to load threads got meaningfully more expensive while all of this was going on.
* This post is long enough already, so I'm going to refrain from telling the separate story about how, when you're making multiple requests in parallel, it's very easy to write bugs that only trigger if the requests resolve in a different order than what you usually expect.↩︎
† One of the least-appreciated benefits of doing FE work on public-facing websites is that you can tell everyone details of what you've been up to without worrying that you're leaking anything secret. You could've taken those measurements in Chrome dev tools just as easily as I did. Enjoy my heavily-redacted MS Paint drawings of server-side traces.↩︎
2025-08-28
↑ Home
← Perfect utilitarian decision-making requires solving the halting problem