AI Meh-ptic
Another sporadic attempt to calibrate my attitude toward what we’re currently calling “artificial intelligence”, and its use as a prosthetic for writing software.
I spent about an hour the other night fooling around with ChatGPT to produce a very short computer program. I already have a version of said program (actually several); what I wanted to see was how easily an equivalent could be constructed out of materials you could expect to have lying around on an ordinary operating system. ChatGPT is appropriate in this scenario because the alternative would be poring over Web forums and documentation, to piece together something even as meager as this.
This is the fourteen-line
bashscript that I ended up with. It generates an RFC 6920-compliant cryptographic identifier for either a supplied file or standard input. I’d say the final result is 80% me, 20% bot. Maybe 70-30, but I wouldn’t have done anything differently from what it ended up doing.
I was already pretty sure about most of what needed to happen. There are two things, nevertheless, that the chatbot surfaced that I wasn’t familiar with. One was the idiom for negotiating whether the program should operate in a pipeline versus being passed a file name, which is customary for programs such as these. That is something (at least for bash) I had never bothered to learn how to do. The other is the existence of the program xxd, which was an integral piece of the pipeline. I was confident something like that had to exist—indeed there’s probably more than one way to do its job—but I didn’t know what it was called. The result is not as elegant as a “real” programming language, but that isn’t the point. The point is for the last half-century there have been systems with an ocean of little gadgets lying around which will afford the ad-hoc construction of tools which are useful in a pinch. At least in theory, because in practice you need to memorize the names and capabilities of the entire inventory for it to be of any use to you.
How I imagine the unassisted process of doing any kind of creative work. Classic S-curve: you start out slow, then accelerate into a groove, until you run out of stuff you can power through and taper off into noodling. At some point in that zone, you can probably call it “good enough” to consider it done.
What I mean by this is that at the minuscule scope that this is, by the time you’ve gone and probed the available primitives for what you want to do, you could have just busted out a “real” programming language and done the job “for real”. What the chatbot did was confirm what I was already pretty sure about—that the necessary pieces were present—which it did by making a working program out of them. It also alerted me to the existence of an essential part of said inventory, which I can bank mentally for next time. And it did this all in seconds. So what took the hour?
This is how I imagine the AI-assisted process: you immediately jump to 80%, then you cajole, plead, and threaten it (before giving up and doing it yourself) to get your project over the line. I decided to be nice and imagine this process monotonic, but there’s no reason to believe this is actually the case.
Well, it started because the bit at the beginning didn’t look right. I knew this because I’m already an expert, and immediately clocked that what it was doing was not only unnecessary, but a potential source of errors. This caused me to suspect what little remained of this tiny program. So I spent the rest of the time poking around trying to see if the way to solve the problem it had proffered was actually the best way to go about it. At this point I had taken over, and the role of the bot shifted into more of that of a librarian I could ask to confirm or deny, again, things I already had some inkling about, without having to go look it up myself.
This is in line with what others have reported, and is genuinely useful—when it works.
Now, if instrumentality was the point, I probably could have used the first artifact the bot produced in the initial seconds of this exercise, because by all accounts that did the job. Granted, it did other things besides the job, but the side effects were fairly benign, and I’d probably never have noticed them. Certainly if I didn’t know any better. The problem is, I do know better, and wasn’t satisfied with this result.
I suppose one way to deal with these minor failure modes is to just accept them. An almost cliché remark at this point is to consider yourself a middle manager coaxing the result you want out of an intern who has read every manual but is otherwise clueless. Cute comparison, except when it breaks character. Like, don’t nag me about an inefficient construct when you just committed the same offense but worse. Not only is this annoying, but it betrays the fact that this is an apophenia machine: one by one it excretes granules of text that are statistically likely to be intelligible, but there is in fact no intelligence there.
Incidentally, the program’s output should look like this:
ni:///sha-256;uU0nuZNNPgilLlLX2n2r-sSE7-N6U4DukIj3rOLvzek
The statistical nature of the machine was brought to the fore, in this experience, by the fact that it volunteered some putative output of the program it had just generated, which was wrong. A cryptographic hash, or fingerprint, which is what this program produced, has a fixed length. A trained eye could tell it was too short just by looking at it. (The fragment that was there was also wrong—it started off correct but then diverged from the actual value.) What almost certainly happened was that it had been exposed to a hash for a given input (hello world, to be precise) and attempted to reconstitute it. Higher-end service tiers actually spin up a sandbox and run the code. Which raises the central question of this article: Is this something I would pay for? My answer, so far, is meh.
It saves you time you were never going to spend
The most common refrain I’ve seen and heard from people who actually use these things (for code) is “I used it to do something that would otherwise not have been worth the time”. There is definitely a distinct character to the kind of project that would be worth doing if it took a few hours, but not if it took a few days. It’s usually a fairly simple thing ensconced in a bunch of fiddly boilerplate, which is the thing that actually takes the time. But these projects are by definition low-value—or at least low-urgency—or they’d have been worth doing before AI was an option. I wonder, though, how much is it helping you do the work you actually planned to do?
I’m not crazy, right? Like if you have a thing that you’re not doing because it will take 10T units of time to do, you spend zero units on it. If some new invention comes along that makes it only take T units, and you spend that time, you’re actually out -T units, and no closer to getting your actual planned work done. You’ve only “saved time” relative to the thing you weren’t going to do because it would have taken too much time, but you’ve presumably got some new artifact to show for it.
I want to underscore that I’m not knocking this per se. I am a strong believer in prioritizing by opportunity. A non-trivial number of projects I have done anyway, despite having an inexhaustible supply of far more urgent things to do, turned out to be way more valuable than they looked at the outset. It’s hard to say in retrospect, though, how many of them would have fit the profile. Maybe cast this as an optionality-generating thing rather than a time-saving thing?
I’m not trying to be glib here; I actually do think there’s a substantive distinction. When I think about all the things that I would do if only I could do them quickly, they aren’t the same things that I need to do irrespective of how long they take. These have different contours; different bottlenecks. The structure and interactions are usually a lot more complex. Another way to say this is what do I prompt the chatbot with, if I myself am not completely certain what result I even want?
The answer, to the extent that there is one, is to invest a lot of time setting up the environment, so the AI can help you with the precursor materials too. People are developing entire methodologies around it. I think that’s neat, and I’m interested in seeing what results they come up with. I have one reservation, and that’s how wedded the methodology is to one SaaS-based chatbot product or other, and what happens to your workflow when you swap them.
It’s accurate, if you constrain what it can be accurate about
The usage pattern I keep seeing for these products—that actually seems to be cashing out—is to put the chatbot on rails, and use it as an interface that takes English on one side, and outputs a (potentially extremely verbose) formal configuration file on the other. This is used to drive a bunch of deterministic code which has been prepared in advance. An example of this is Paul Ford’s Aboard, which takes a prompt and generates a custom CRM, which can then be augmented and extended through conventional client services. I have also noticed a few enterprise security companies (Sublime, Socket, Dropzone) using LLMs to do similar tasks around mapping between complex rulesets and plain English.
Leasing an LLM this way, unlike my previous example, can at least be weighed pretty straightforwardly like any other business decision—bracketing the risks I laid out in my previous newsletter.
For these companies, I guess my big question is how do you feel about outsourcing such an essential aspect of your product? Do you have an off-ramp if your vendor of choice becomes unreliable? Can you “hot-swap” LLM vendors? Do they work the same if you do? What about exposing the confidential information of your customers? But most of all, does your problem domain truly merit the permanent umbilical relationship? In other words, could you get away with a smaller model for what you’re trying to accomplish? One you can run on your own gear?
To be clear, for “own gear” I’ll permit anything up to and including rented cloud infrastructure, as long as it has a commodity interface.
How my thinking is shaping up
Here is a video featuring my provisional position statement on AI.
I have had no particular compulsion to integrate the use of these products that suck up all the oxygen surrounding the concept of “artificial intelligence” into my life in any meaningful way. This is in spite of having had my eye on the field at least as long as there has been something to look at. In a decade and change, its output has gone from bizarre curios, to impressive tricks, to several well-defined product categories. That’s cool I guess, but I suspect I mustn’t be the target user. What’s being sold I either don’t have much of a use for, or am not willing to front the investment to determine if it might—might—work out for me.
A common refrain among AI fans is that if you can’t make it work for you, then you aren’t doing it right. You’re prompting it wrong. You haven’t put enough effort into your setup. You should upgrade to the higher service tier. That kind of thing. Let’s set aside for a moment that grifters talk like this: it would be disappointing to go through all that effort setting up the environment, to find out that the kind of work I do isn’t very well-supported, which is something I have grounds to suspect. The last thing I want to do is waste my life fighting with a chatbot to do something that would have been quicker to do myself.
For me, personally, moreover, it’s more than just going from an entire career of paying zero dollars to code to suddenly paying rent, it’s also submitting to an always-on, particularly intimate form of behavioural analysis. I have yet to see an articulation of the benefits that is crisp enough to offset these concerns.
I suppose before I continue, I should restate my understanding of the situation:
AI is not a product class or even a specific technology, but rather it is a technological goal. There are two fundamental, often competing, yet ultimately complementary strategies for achieving this goal: deterministic, and statistical. The deterministic strategy is (at least ideally) consistent—in that applying a given operation to a given input will always get you the same result—and transparent, in that you can trace the process from start to finish and understand what it’s doing every step of the way. The statistical strategy involves using an aggregate representation of data from the past to fit the input in the present (and some measure into the future) to derive the most appropriate output—a crude (and of questionable accuracy) emulation of a brain. In diametric contrast to the determinsitic process, this one is both opaque and inconsistent, because it’s highly sensitive to initial conditions, and the sheer number of parameters makes it unclear as to what those initial conditions exactly even are.
The first four decades of AI research were heavily concentrated in the deterministic realm. Even though the granddaddy of statistical methods—the perceptron—was invented early on, it was generally not considered viable until computing power and storage capacity caught up with the theory. Instead of programming a set of explicit instructions for dealing with this or that input—which may not be practical—you feed the system a bunch of examples. From there it can infer the correct output for a given input, at the risk of occasionally getting it wrong.
The way I think about the trade-off, from a nuts-and-bolts perspective, is when it would take longer to code a deterministic function that would be wrong more often (in terms of not fully accounting for the universe of possible inputs) than its statistical counterpart. That unambiguously marks the time and place to go statistical. Canonical example that can be achieved with rudimentary techniques: handwriting recognition.
This risk of occasionally getting it wrong is the Achilles’ heel of the statistical strategy. Deterministic computations are just as reliable as their underlying hardware, which is to say that in the vanishingly unlikely event that one fails, you have bigger problems. Not the case for statistical methods. Furthermore, the way you get better accuracy in the latter is by pumping more and more data through bigger and bigger models, to create an artifact that can only run on racks and racks of ultra-high-end hardware. Whereas deterministic computation has long been too cheap to meter, the statistical stuff appears to be getting more expensive.
The fact that GPT-5 arrived with a splat suggests a plausible endgame is some equilibrium between the capability of a model and the hardware required to run it. Hugging Face is loaded with models you can download for free and run on a (beefy) PC, that I’m told are about six months behind the bleeding edge.
I am not, in principle, averse to outsourcing my needs to vendors, where concentrating the capital to do it myself would be an absurdity along multiple dimensions. I don’t want get into the business of harvesting the entire internet so I can spend several months pushing its contents through several Costco-sized buildings crammed to the ceiling with $35,000 graphics cards. I’d be happy to rent the result of that process, if its idiosyncratic properties were what I actually needed.
The most compelling argument I’ve heard for hitching up to these vendors comes from Venkat Rao, which is on the order of “it has read everything”. If that’s what you need—and from which he appears to benefit—then vaya con dios. I imagine as well that just like any other SaaS product, it gets you to the proximate goal much more quickly than rolling your own, if the side effects don’t bother you.
The abductive, “it has read everything” characteristic of the so-called frontier models is visible when you prompt one with something like “I am trying to recall the name of a thing but don’t remember exactly, but it’s kind of like this…”. The alternative is frantically searching for every permutation you can remember, whereas the bot is likely to surface it instantaneously. And if it doesn’t? You’re no worse off. There is actual value in this, provided it isn’t making up the answer—which it sometimes does.
More accurately, rather, generative AI is always making up the answer, and it’s only incidental if the answer matches the correct one. The incidence may in practice be most of the time, but compared to deterministic processes, we have to invert the polarity of our expectations. We can’t assume correctness and chalk errors up to some lower-level failure. With statistical methods, we have to assume the default is random garbage, and any deviation from that is the result of bigger (read: more expensive) models and longer (read: more expensive) training.
Considering the training process for an AI model begins by initializing it with literal random garbage, this isn’t far off.
The way I would characterize AI-generated output is that it gets impressionistic with the details. You see it in made-up citations—a common occurrence, I got one just the other day—and particularly in images and video. It’ll render a coherent subject, that is, but the background is like when an artist goes “ehhh, close enough”.
The thing is, though, it’s the details that matter. It isn’t a coincidence that these products are really impressive for demos, where any weirdness can be hand-waved away. This fudging of the details certainly isn’t an impediment when what you’re trying to do is fool people, or at least take advantage of their lack of attention. The problem is that when you try to use one of these things to accomplish something load-bearing, that’s when you run into what has developed into a quintessential pattern: you get 80% of the way there in a few seconds, then spend however long fighting with it to get the other twenty.
The ostensible palliative here, is to sharply restrict what the AI can be wrong about. Code, and code-adjacent stuff like configuration data, is uniquely suited to this, because it can be verified relatively mechanistically. As such, if you get a malformed output, just throw it out and re-run the prompt. Of course, this all takes effort to set up.
Again, any time an LLM chatbot disgorges correct information—however likely this may be—it should be treated as a coincidence. Code is much easier to verify than say, an evidence-based argument. The “deep research” family of capabilities associated with these products does at least attempt to determine if the sources it cites are real, but this is of course an extra that you pay for. I just don’t believe you can get away without at least checking the work if you’re going to stake anything of value (like your reputation) on it. Verifying the output, though, means going and reading all of the cited works, which if you do that, then—oops—you’ve recapitulated most of the work of the bot.
This raises an important question: if you have to build up all this scaffolding just so you can use these products effectively, can the same effect be achieved more parsimoniously without them? If what you’re trying to do is on rails, can you not only get away with—but even get better results with—a less capable model? This seems like an empirical question that is worth some non-zero effort to answer. If “it has read everything” isn’t needed for your business outcomes, why—and I mean this sincerely, as there may be other reasons—stake your business on that relationship?
Epilogue
To the extent that I mess with AI on my own time, I have a project in mind. My website has (by my cursory count) 364 published articles and another (at least) 506 drafts. What I want to do is go in and extract all the references to people, places, things, companies, products, concepts, cited works, et cetera. I want to take these and put them into Sense Atlas, as well as go back into the corpus and link up all those references. So the first goal is to augment the content with hard references to the entities it’s talking about. The effect I’m hoping for is to be able to group articles thematically, like tags on steroids.
The next goal is to arrange those entities into a structure: people and products to organizations, people to each other, people to works, citation networks, semantic relations between concepts, and so on. I imagine this to be an ongoing project; something I prune over time like a bonsai. The reason to do this to situate my own writing into its social and historical context, and potentially surface things I hadn’t noticed about it before.
The final augmentation to my site I have planned is to model the audiences addressed in the corpus. This is something I have wanted to do for a long time. I have found that even within the relatively tight radius of who I have come to call “digital media insiders”, there are constituencies that are like oil and water when it comes to the discussion of technical matters: some people insist on the details, while others run screaming from them. That’s the motivational basis for a technique I want to generalize. What I’ve gone and done is make a class of object that represents an audience, with a set of Likert-esque relations to a set of concepts. From this I should be able to determine—programmatically—whether a piece of content is suitable for a given audience.
The essential premise is that an audience isn’t going to want to read about a concept they don’t understand (unless the purpose of the document is to teach them about that concept—and even then…). Merely understanding a concept isn’t enough, however. The audience could understand a concept just fine; they simply don’t like it. To represent this, you can add a positive or negative valence to the relationship between an audience and a concept. Since audiences and documents are both related to concepts, the document’s audience can be inferred through that relationship. If, on the other hand, the document’s audience is explicitly asserted, I can go in the other direction: do a gap analysis to determine if the actual text serves the audience it’s supposed to, and bring it into alignment if needed.
None of what I have described so far involves “AI” as we now know it—neither the general methodology of statistically-oriented computation, nor the specific category of LLM chatbot products. What I’ve described so far is 100% deterministic. Ironically, though, it would have been considered “AI”—or at least AI-adjacent—not even a couple decades ago.
I suppose the contemporary thing to do would be to dump the whole mess into ChatGPT and say have at it. Where’s the fun in that, though? (Also, easier said than done.) Besides, I already have all the infrastructure for doing bulk surgery to large quantities of documents, since that’s kind of part of my job. Where I actually see a role for AI-as-we-now-mean-it is where it’s meant to operate, in the fuzzy parts where my deterministic code can’t reach. Will I use one of the chatbot products, or will I DIY? I don’t know yet. More on this in a future newsletter.