Unity and VLDB Reading List #2
Normies Learn Software Licensing
If you're a software person and at all plugged in to the world of video games, you might have watched in horror as Unity performed one of the most aggressive killing-of-the-golden-goose-s we have yet seen in the tech industry. Seeing people I follow online for their opinions on whether Armored Core 6 is good suddenly start talking about software licensing and the risks of closed-source threw me for a little bit of a loop. This crossover made me think about my own complicated feelings about software ownership as a concept.
I started out as a starry-eyed undergrad who was eager to have an ideology, where I bought in fully that "open source" is a necessary condition for a good piece of software to have. A kind of utopian way for people to write software for people.
After I graduated, I worked at an open-source vendor (not really motivated by that ideology, but it helped), where I started to suspect that it wasn't so clear. In fact, maybe open source is not the leveling force I thought it was. When Amazon of all organizations had a public spat about how their use of ElasticSearch was "the spirit of open source," it raised questions for me. It started to seem as though the people with the most capital are actually the ones who are most equipped to exploit "public good" software, rather than the underdog who needs a leg up.
Not to mention, the idea of "if this project goes under, you at least have the ability to take it over" started to ring a little hollow to me as I grasped the amount of resources and onboarding it takes to work on a complex piece of software even as a full-time job.
I'd like to think I've landed on a bit more of a nuanced position that's hard to sum up. Situations like this, where Unity just pulled the rug out from under smaller developers (and maybe they'll roll some of it back, I don't think they really deserve much kudos for that if they do), definitely push me in the direction of "proprietary software is a gigantic risk for a business or individual to take on." Not to say it's not often worth it, because it is, but the risk empirically, is there.
I pledged to the Godot Engine Patreon around a year ago. The idea that perhaps a viable game engine alternative could exist funded by its users was really appealing to me, and I wanted to support that. Unfortunately, they took VC funding shortly after:
The whole internet loves Community Project, the project that's supported by donations! 5 seconds later We regret to inform you the project has taken VC funding
— Justin (@justinjaffray) September 13, 2022
I don't know man. Don't become beholden to a corporation like Unity, I guess. They might do things that don't make any sense, it turns out.
VLDB Papers #2
Y'all ever slap a "#1" on the end of something because you think it's aesthetically nice, and then realize you priced yourself into a least a #2? Yeah me neither.
Our Memory Model
I remember one particular epiphany moment getting into systems software, when I was looking at a piece of code that executed a SQL query and I realized that I understood what it did. I didn't have a low assessment of my ability to read or write code abstractly or anything, but certain pieces of industrial software, like real world compilers or databases had always felt special. Special in the sense that I always sort of figured that there was some special sauce barrier that separated them from the kinds of software I had written in the past (games, scripts, toy versions of...compilers and databases).
But when I first saw a piece of code that looked something like this in a production codebase:
enum SqlValue {
Int(i64),
String(String),
Bool(bool),
...
}
My mind was kind of blown: there isn't some weird magic happening behind the curtains here, this is just the same as the kind of code I already understand. Where's the magic trick that turns this into a useful tool? You're telling me that this whole time, all of this was just made of code? I can implement this crap however I want?
I mean, yeah, sometimes it's convenient to call into someone else's magic. But if my analytics tool writes the right stuff into the socket, nobody can say it's doing the wrong thing. You can learn lots of cool things about how to do these things better, like, oh yeah, it's better if I design my data structures to use offsets rather than pointers, because then I can just memcpy them around. Cool. I will take that fact into my bag of tricks and use it when I design the memory model for my next program.
But now they want you to store my data with a standardized serialization scheme! What's next, requiring a standardized format to process data in your own damn address space?
Oh...
Well, it turns out that at the extremes of data processing, where you have a whole ecosystem of tools passing data between and embedding each other, it's actually pretty nice if we don't have to have a translation step before we can operate on that data in memory. And practically, the only way this can really work is if we all agree on the structure of said data.
There's a number of attempts at defining this kind of standardized format for columnar data, my half-informed vibes are that Arrow is the most successful and the one I would default to if I were making something of this flavour or if I needed GoldenEye speedrun tips.
A Deep Dive into Common Open Formats for Analytical DBMSs is a really nice overview and comparison of the ideas present in this sort of thing. I was actually only vaguely familiar with any of these beforehand, having seen the rise of Arrow and having the initial negative reaction of "but it's my memory??" But after going through this I feel like I have a pretty good overview of the various different design decisions they make, and why. If that's the kind of thing you're interested in I don't think I could summarize the juicy, layers upon layers of facts in here, so I will just say: I think this one is good.
TreeLine
TreeLine: An Update-In-Place Key-Value Store for Modern Storage is a storage engine built on the premise that the tyranny of LSMs was brought upon us by a desire for sequential I/O that isn't necessarily warranted with 2023 storage devices. They describe the design and a number of optimizations that they did to beat out the performance of some popular LSM engines.
I suppose people make these kinds of purely performance-oriented tradeoffs in practice, but as an LSM-understander and a BTree-not-really-understander, it seems to me there are a number of benefits that don't come from purely from write performance:
it's so much easier to migrate data formats with an LSM, since you are constantly rewriting the data on disk anyway.
Concurrency is much simpler. At least, it makes more sense to me, to be able to concurrently read from a bunch of immutable files and not have to worry about locking.