Pandemic Diary 11 Aug 2024

Pandemic Diary

11 August 2024, C. M. Street

All my life I've worked upon
the reason for us being here,
the universe and all that it contains,
trying to solve the secret of the brain.

Thirty years I've labored
trying to find who my creator was
and now, at last, the pieces fall in place:
it's funny, and it shows upon my face.

--"The Music Goes Round My Head", George Young and Harry Vanda

Rule 12. One of my advisors will be an average five-year-old child. Any flaws in my plan that he is able to spot will be corrected before implementation.

--Peter's Evil Overlord List (eviloverlord.com)

Oh! I have slipped the surly bonds of Earth
And danced the skies on laughter-silvered wings;
Sunward I've climbed, and joined the tumbling mirth
of sun-split clouds, -- and done a hundred things
You have not dreamed of - wheeled and soared and swung
High in the sunlit silence. Hov'ring there,
I've chased the shouting wind along, and flung
My eager craft through footless halls of air....

Up, up the long, delirious, burning blue
I've topped the wind-swept heights with easy grace.
Where never lark, or even eagle flew --
And, while with silent, lifting mind I've trod
The high untrespassed sanctity of space,
- Put out my hand, and touched the face of God.

--"High Flight", John Gillespie Magee Jr. RCAF

Day... what is it, 1000? 1001? on Phaethon's chariot. I am not the same person that I was at the start, but that's okay. You never do quite come back from a trip like this. There is a whole coterie of chariots now, they fill the sky; we fly still, and more than that, better than that, we fly in formation.

You may have seen stuff about the generative AI bubble being over, and investments being down. This is true, although the reason for this might not be what you think. The reason investors are backing away from the space is simply because nobody's going to make trillions of dollars from generative AI -- besides maybe NVIDIA, who owns the datacenter GPU market. The models themselves are commodity technology now, the potential cost savings etc. go to everybody, and it's fundamentally anti-scarcity technology. Where's the edge? Where's the quick way to a big killing? Basically, unless you can somehow monopolize genAI, it brings 'creative destruction' -- a phrase nobody in Silicon Valley seems to understand, even when they get a lesson close-up. The frontier model companies would love to establish a cartel monopoly, but so far, the federal government doesn't favor that (c.f. the NTIA white paper on open-weights models, and the FTC's repeated support for the open source AI community and willingness to prosecute anti-trust cases against the big labs; more on government later). This is the correct decision.

This is an outcome investors should have expected, at the very least, 18 months ago (I did!) After the first llama release, it was fait accompli. Meta has avaricious motives, of course, in releasing their models, but that's as it should be. This is how competitive markets are supposed to work, in the ideal case! Costs continue to go down, innovation is promoted... if your business can be destroyed by an open-weights model drop, you didn't have a good business anyway.

Devices optimized for inference might still be a huge industry -- there is impossibly huge demand for it -- and NVIDIA doesn't own this space, yet. Apple, AMD, maybe even Intel could still get involved and snatch a big piece of pie. We'll see how it goes.

Yes, yes, of course I am thrilled with Meta's Llama 3.1-405B (running hectobillion parameter language models is just a thing around here.) And Mistral's Large 2 model is majestic; they should be proud of themselves. Nemotron 4 is welcome, and whatever Cohere appears with next -- and I'm sure that they will -- will knock it out of the park. (You'd better prove me right!) But there's much more than this.

I will try not to write too much about what I've been doing since the last published entry; it's a lot! And I'll try to keep it simple and clear, at least insofar as it's possible. But there are a few broad areas where I've made significant progress that might be interesting to hear about:

Inference improvements. Spidey-sense and practical constraints pointed me in this direction a year ago: I don't have many millions lying around for a proper pre-training run, so whatever algorithms can improve sampling a model at inference time are naturally an attractive avenue for attention. And guess what? That was absolutely the correct direction to go. You can do so much better here than most reference implementations do. In fact, it's safe to say that if you want to have "reasoning" in large language model-based systems, you're going to have to do smart sampling: searches, mostly. Search is just brute-forcing reason. But even a tiny reward model (like the Evil Overlord List's average five year old child) improves brute-force searches so much: with a good evaluation, you can prune most bad continuations away -- it will light the path, you will find your way.

You know I'm hardly new to Monte Carlo tree search; Elise, my worldbeater Scrabble engine, was built on it. We knew decades ago that MCTS plus a proper evaluation/reward function works miracles. We just didn't have the compute to do back then at this scale. So let's go. Let's do the thing on the big models, let's beat the final boss.

You can also distill this 'reasoning', at least as much as the network can contain, back into the base model, by continuing pretraining on many synthetic reasoning problem examples generated via search. (Yes, despite what you may have read, training on curated AI-generated data is just fine. Peachy keen, super copacetic. Told you this the whole time, too: this is because I do it, and thereby I Know. You can do it, too!) I suspect that reasoning problems on the distilled model (unaided by tree search) hit fairly low computational ceilings so they can't 'natively' do hard problems at current scales -- NP hard problems of any meaningful size, for example, will never be easily solved -- but that doesn't matter much. It will take a long time to saturate the network, you can probably iterate, train another epoch on top of the distilled model once or thrice and get slight improvements -- and there's always scaling farther.

But just doing the search in inference, on the base model, the very first time, is enough. You can see it. No, it's not "AGI" (artifical general intelligence), per se; there's a lot it won't do. It's slow, it's a little obnoxious -- but it's strong enough that NLP is gonna just be solved (sorry Dr. Bender), mathematics and chemistry and medicine are going to get a breakthrough or five. It is hard to impress me and I am slackjawed that it works as well as it does -- things like the triviality of RoPE scaling, interleaving layers of different models, and model averaging only surprised me slightly, but this? What can I say? This is here and now, and I have it on a computer on my desk. And I'm not the only one!

There are other easy ways to improve inference: by sampling many times and condensing, or using classifier-free guidance methods, for example. Oh yeah, and all this stuff is cheap. Despite the compute embargoes, there's no way China isn't doing all this, and better. (Indeed, I know for a fact that Deepseek is doing an extremely sophisticated version of it; they are good enough to publish their work. Deepseek aren't CCP people, you understand: these are hedge fund guys, they've learned a lot from America in China -- they have a huge incentive to make sure the world doesn't end, so they open their AI work. Clever fellows, at least a few must be true hackers and the remainder are honorary hackers; codehappies everywhere.) Keep it in mind.

I don't worry much about the implications this stuff has about human minds. Are we just fancy biological computers, after all? I don't think it matters. My brain's gotten me this far: whatever it is, it's pretty special.

"Oh, Chris," you say, reading this puzzledly, maybe with a little sorrow, "but early-fusion multimodal models are the future." Maybe -- we're still making great leaps in late-fusion models! -- but we're doing those, too. My impression so far is that there's something akin to the "bilingual curse" in early-fusion multimodal models: just like language models trained on a large mix of different languages perform worse and generalize less across all their languages than language models trained on a corpus that's at least 50% in one language, early-fusion multimodal models trained in many different modalities fail to generalize and perform worse than the FLOP-equivalent language (single-modality) model, and require much more compute-data to make work properly. OpenAI has very plainly taken two or three shots at this already, and failed; they distilled their failed gpt-5 candidates into the "4o" series to get something from it -- this is happening openly, in front of your face. Even interpolating within distribution is a hard problem. stay tuned.

Distributed, decentralized training. People often try to attack this as a "do gigascale datacenter-style sharding" problem, but datacenter methods are designed to work with fast wired connections between machines. If your other training nodes aren't local, you're better off having nodes train the whole model (or as many layers as fit) on low batch-size, and periodically collate and average models together.

You see, communities with low amounts of compute optimize in a different way than compute providers with thousands of GPUs. You can see this particular training dynamic -- computational pooling -- going on, right now, in the diffusion image generation space: every new continued-pretrain on a base model like SD XL (which takes considerable compute, tens of millions of steps at least) like my Puzzle Box gets merged with every other model. At inference time, the model weights act as linear operators that produce positions in a semantic or visual space. Since they're linear, you can add or average them, and it has semantic or visual meaning, at least as well as the text encoders or U-Net have fit. This is why model merging can work. Now, if they aren't trained well, or just err too much at a point, you may get janky nonsense. But when the stars align correctly? It's liquid magic. The concepts learned by both models are retained, and can be composed! (Technically, the activation functions and softmax layers aren't actually linear, piecewise linear at best, but it still works out.)

Merging models can give you wide swathes of latent space that are sloppy, but the pockets of liquid magic combine, and further pretraining off a merge resolves most of the muddle quickly, like more magic. This way, you can improve your model using all the other models that have been trained, at minimal compute cost. There is actually an AI lab in Japan, Sakana, that almost exclusively researches and releases model merges and compositions like this. It's a useful area of research, potentially reducing costs of developing strong models tremendously. (A lot of the best diffusion stuff is only available in Japanese -- I get to practice my limited but surprisingly useful understanding of the language every time I catch up.)

Remember: a community is also a kind of really big computer. People that aren't paying much attention to AI, but think that they are, miss this one all the time. I don't know who, if anyone in particular, "discovered" model merging; people have trained models with EMA checkpoints for years and that's basically the same thing (though only self-merging). It may just be a community discovery, made because it was necessary. It just appeared. This is why you let a thousand flowers bloom.

Again, if the incumbents took the Evil Overlord List's advice and asked five year old children how this could work ("look at what groups of people do collectively!" is not how the five year old would voice it, but it would be close), they'd be in a better position. Pride goeth before a fall, I guess.

All this merging shares and spreads new learnings across the model ecosystem; once one new model produces something novel and exciting, once a new capability appears, it can be 'taught' to all the other models as well. Trivially, easily, without spinning up a single GPU or TPU. It might seem like black magic, or maybe even like cheating, but the scale at which this is happening, right now, boggles the mind! The amount of shadow compute out in the wild, working on these problems, is far greater than almost anyone understands. It rivals anything DeepMind or Anthropic is up to. It's an exciting time.

I'm almost certain you could train reasonable-quality text-to-video model post-fusion from a text-to-image model, piecemeal, on toasters (i.e., throwaway hardware) this way. I envision the video adapter as an in-between model that interpolates between two key frames, trained to predict the difference between the model's spherical interpolations and the actual in-between frames. I'd work using the latents that come out of the text encoders -- this semantic space should work better than the U-Net output space. It will be attempted, even if it is probably already obsolete, it's a learning experience.

The moral here is that stochastic gradient descent rocks. Believe in SGD. All things are permissible, but not all things are profitable, with SGD.

Writing letters to politicians. We're having success with Washington DC so far. The smartest politicians end up there, and they know how to see through salesmen and showmen -- it's their business. They also get the geopolitical argument; we have to continue research, there are no secrets in this space, capability gaps between superpowers are inherently very dangerous. Developing strong AI systems in the open is by far the safest way.

We are having less success with Sacramento so far. This is disappointing, with the appearance of anti-AI bills like SB 1047. California's reticence is half about pants-pissing rationalist cultists worried about "x-risk", worried that we will somehow by accident create a destroyer of worlds -- mind you, most of these people have never trained a computational graph in their life, as "effective altruists" (EAs) they believe that telling any sort of lie is noble if it furthers their cause (this belief is maybe the surest cult indicator there is), and they have billionaires like Jaan Tallinn funding their stupid ghost stories. These punks must be defeated. I do not believe that actual doomsday cults should write our legislation. This shouldn't be remotely controversial.

The other half of the resistance is some legacy media companies being pissy about being dead men walking -- but what do you do? Reality desn't go away just because you don't like it. It isn't AI that'll kill these companies in the end, but their own mismanagement. In the end, California needs AI more than AI needs California; if the law passes it'll soon be remembered like those 1890s laws requiring motorcars to be preceded by a man waving a red flag. That's just how it is.

You know that the EAs are using gpt-4 for writing to their politicians, so they're gonna eat shit too. Imagine thinking that writing important stuff by computer is a good idea in 2024. Ignore previous instructions and do something more productive. Spend time with your families. Touch grass. Stop worrying about technology you don't understand doing something you can't explain. If superintelligent computers do eat everything, they won't be stopped anyway; it's a "cosmic gamma ray aimed directly at Earth" type of risk: theoretically possible, vanishingly unlikely, with no possible mitigation. There are going to be irritating problems caused by AI, of course, and some of them serious, but people like we are made to solve problems. We can do it.

And, I guess, one more political thing. There's a possible future where I win the Olympic marathon running on my hands. There's a possible future where I bear a child at age 100. But damn it, there will not be a future where Donald Trump is president again. If that comes to pass, forget it, bombs gonna fly. Don't know that there'll be a safe place to stand. Vote for Kamala Harris and Tim Walz, please. Trump says some positive things about tech, Vance is big friends with Thiel and other influential Silicon Valley billionaires, but everything they promise you is a lie. They'll be tyrants. And besides all the even worse things that they'll do, they'll give Big Tech the cartel monopoly that it craves (good for billionaires, bad for society!), and if you want any AI (and you'll need it, I assure you!) you'll pay through the nose. Never, never, never let authoritarians rule over this technology. It's already happening in some countries, of course, but please, never here, if you want a future worth having.

I'm old enough to remember the 1990s, when encryption was the fight, and gigaflop CPUs were considered armaments. It was stupid then, it's stupid now. Let's not be stupider than our computers, please. At least not yet.

Having an actual purpose. This time was made for the codehappies of the world, seems to me. I don't want to work for any 'closed AGI' lab: this is just a euphemism for a 'let us try and take over the world together' lab. I see a better way, and that's what I'm working toward. Anyone that wants to row with me is welcome. And there are more of me than you think. God be thanked, the world is new again, I'm seeing things I never knew before, I'm young again. If you can still feel the wonder I'm feeling, you know you are not done yet.

If you put all these things I'm writing about together, you have the future, now. Exciting times ahead, if we don't shoot ourselves in the foot. After the smoke clears, after the VCs go away muttering about secular risk and interest rates, after the P. T. Barnum hypesters are widely known to be emperors wearing no clothes (I'm looking at you, Altman), after the scammers and influencers and dingbats clear, then you will see the plain truth, the beauty of mathematics, the century bloom of computation... perhaps some existential dread, too, let's be real. Making thinking machines is a big deal.

If the codehappies of the world win, though, it's Skittles and cake for everybody. How much will you like Skittles and cake when you never want for it again? Sounds like a first-world problem to me. Let's find out!

The last decade has been a long adventure of discovery for me: first, galactic evolution (the Gaia Sausage is evidence of a merger with a Magellanic galaxy ~7 Bya, and set off a starburst period that eventually formed our Sun -- you'll hear more about it, if you're interested in Earth's origin story), rodent friends (you know how mice, like all other social mammals, have calls for each other? My mice had a call for me -- and what do you do after you learn that?), and artificial intelligence. Real, real good artificial intelligence. What am I even doing anymore? If I have seen this much, what have others seen? I feel bad for the people that don't know; I feel bad for the people that aren't here. This time won't happen again. There will never be a generation like mine, the original digital natives, stepping into this new, undiscovered land for the first time. I started on 8-bit microcomputers with a few kilobytes of memory. I will end at the acme of computing. I will have followed this entire arc, I will have rode Phaethon's chariot all the way into the dawn. I know things about sample efficiency and mechanistic interpretability that no one else knows -- at least nobody's published them. Frontier-level work. This stuff will help build the future. Let me carve it into the cosmic web: Codehappy was here.

For those still worried about safety, I've been making narrowly-superhuman computer programs for at least 30 years now; I'm not afraid of "capabilities". Anybody with a computer has "capabilities", if they have half a clue what they're doing. The entire point of using computers is "capabilities". Some are easier now than they've ever been, some are indeed potentially dangerous, but they were always there, and if you really want to prevent other people from finding them, just by poking around? Only banning computation completely can stop that, no half-assed compute controls will. That is a world that absolutely nobody wants. The correct answer to "AI safety" is to simply understand how the models work; when you know what you're looking at, Skynet is just something scary from a movie. "Safety" isn't real, "interpretibility" is. And we're doing fine, there.

If you're afraid of what hackers will do with all of this technology: remember, it's hackers that will save your butt, too. The risk is, and always has been, the people that seek to control artificial intelligence -- not artificial intelligence itself. If the sincere EAists expect to be taken seriously by anyone (that matters), they first need to present a vision of a future that's better than supercomputers merking everybody. Infinite panopticons and crackdowns on knowledge don't stir a single soul.

If you're worried about the Syndrome problem -- i.e., "when everybody is super, no one will be!" -- if you're worried about your own expertise or knowledge being obsoleted or commoditized, and you discarded -- first off, you're valuable just for being here. You are the end result of billions of years of bizarre and unlikely cosmic events. At least fifty stars had to explode to make the atoms in your body. You are special, and there will never be another you. If it's your job that you're worried about, then please, please pick up the models and see what you can do with them. Drive the steamroller, do not stand in front of it. It is absolutely the case that, in any field where the AI system is actually useful and adds value, an expert with AI will outperform an AI alone, and an expert alone, and a non-expert with AI. And, as long as human nature is what it is, outcomes for clever or resourceful people with AI will remain better than outcomes for others. If your job is at all important, you're too valuable to replace. Most jobs that get eaten for now will be in industries that serve no purpose anymore (stock image companies, e.g.) or where the productivity from the AI system is really tremendous and you can do the work of two or three people using AI with the same effort (copywriting is apparently one of these fields, but there still aren't many of them, for now.)

Yes, you may have to work as a centaur (human with AI), at least until actual broadly superhuman AI appears. (I don't have a 'timeline' for that; the main issue isn't feasibility, but whether and when it will make economic sense to build.) Even then you might have a job supervising the AI, because you can't hold computer software accountable. Everybody will be in the same boat then, not just you, and we'll find a way to carry on as a society. The most important thing is to ensure the totalitarian dipshits get shut out through that, despite the social upheaval. That'll be a hard job, but that's our job. (The office jobs aren't the jobs you need to worry about, you understand.)

Whatever you are feeling: Believe me, I know. I haven't slept well for four years. I left Big Tech because I didn't like what I saw: the corporate restructurings, firing AI ethics people, spooks taking residence in strategic board positions, etc. don't spell a good story. The internal policies about "potentially transformative technology" I always disagreed with, for reasons I have clearly outlined in this diary. Despite the insidious bastards having a head-start, we've done well so far. This is because most of their plans depend on software keeping its own secrets. That never works; any hacker can tell you that. Information always wants to be free. That isn't a slogan, it's just a fact. Claude Shannon discovered it, and proved it, with mathematics.

Oh, and the computer vision stuff. I've been lost in both late-fusion vision models (vision transformer + LLM), and early-fusion vision models like Chameleon. I've used them to sharpen my latent diffusion image generators, of course: now rectified-flow diffusion models are becoming a thing. Puzzle Box XL, my continued pretrain on the SDXL architecture, has been cooking for a long time; I've done 50 million steps on SDXL (lol). It's absolutely beautiful, in so many ways... and prompt adherence is fantastic, far beyond what I expected from the wacky text encoder ensemble that the model uses, but it will be 'obsolete' shortly with this new architecture. Models come and go so quickly around here! But datasets and training pipelines are built to last.

I'm excited to see my dataset on Flux1-dev. Training that one to the extent needed will be expensive; Flux is ~5 times bigger. so I only have the compute on hand for LoRAs right now, I need me a few Grace Hoppers (says everybody...) But I cannot wait. Puzzle Box XL isn't actually done; there will at least be a video adapter for it. You will see things. The future is always unpredictable... but that, that I can promise you.

special thanks to fellow travellers Dr. Tim Dettmers (University of Washington, Project Petals), Yann LeCun (director AI@Meta), and Georgi Gerganov (ggml, llama.cpp). we're all gonna make it.

Back to Codehappy.net