The Clog

Journal Articles Gitea

Elasticsearch Ingestion Daemon

I still saw requests coming in through an old Cloudflare-Worker-based proxy I had set up, before I released the current one, which rewrites m3u8 files on-the-fly, besides proxying the segment files themselves (among other features). I updated our website and app to use the new proxy; where were these requests coming from? I inspected some requests as they came in using the Cloudflare interface, and I found that the User Agent was always one of our apps; and different versions of it at that. We still had un-updated versions of our app out in the wild, but I also saw requests from the latest version! How could this be?

Then it dawned on me: I had a vague recollection that I saw in the app code that sometimes it doesn’t fetch a lecture when it’s chosen, because the data is already available from search results. Our search endpoints query Elasticsearch, which is not the source of truth—MySQL is. So obviously it was serving stale data. Lo and behold, I ran a query, and found old-style lecture video URLs in Elasticsearch. The lecture documents had to be re-ingested from MySQL.

We had LogStash keeping Elasticsearch in sync with MySQL about a year and a half ago; but it kept running out of memory. I ended up writing a daemon in Typescript to do the job. It works by only ingesting the latest data from MySQL (i.e. rows that have an updated_at value that is newer than the last ingestion time). But yesterday when I re-defined the vimeo_video_links column in MySQL, I didn’t set the updated_at value to the current time on all the rows. Perhaps I should have, but it didn’t seem like the right thing to do, because the record was updated semantically; it was just re-structured, but represented the same data.

Anyway, I reset the daemon’s cursor state so it would start ingesting from the beginning.

The next problem was that in order to prevent the daemon from OOMing and/or overloading MySQL, I have it fetch small batches of records (500 at a time). The next batch is supposed to begin at the row after the previous batch. I used updated_at as the cursor state (i.e. the next batch began with the first lecture who’s updated_at was after the last lecture in the previous batch). I then discovered that many lectures had the same updated_at value (strange, yes, but many things have happened over the years including bulk-update scripts; and in MySQL the updated_at only had a resolution of seconds); and many batch boundaries fell squarely among such lectures; so the next batch began at the first lecture with a updated_at that was newer than the last lecture in the previous batch (I couldn’t use >= as the comparison operator because then script would keep getting back the old batch of lectures). So I ordered by updated_at and by id, and adjusted the comparison to consider both updated_at and id. This solved the problem.

By the way, I didn’t use regular OFFSET instead of this cursor state, because records may change between fetching batches, thus what was at “offset 501” from the point-of-view of the first query may not be at “offset 501” from the point-of-view of the second query. Namely, if a record from an already-done batch was updated (and thus its updated_at value changed to NOW()), what was at “offset 501” is now at “offset 500” in the second query, but the second query specifies to start from OFFSET 501, so it will skip that one record.