Learning the Macroeconomic Language
A new paradigm for macro. We will see if it's a better one.
Followers of the academic literature in macroeconomics have probably come across papers using deep learning to solve economic models. Though the technology is new (to us), conceptually this is nothing different from usual solution methods. We have some equilibrium conditions and know policy functions reflecting decisions must make those conditions hold, so we try to find functions that will minimize errors from these equations being satisfied exactly. I have been hoping to write a post on my experience using these methods, but unfortunately have not found the time. I think they’ll be incredibly useful for solving models that were previously deemed intractable, but because we have no baseline for what results “should” look like in an unexplored class of models, it’s hard to do sanity checks.1
This week there was a ton of discussion on Twitter about something seemingly related but actually very distinct. The spark was a post entitled “Can a Transformer “Learn” Economic Relationships?”, and the spark to the spark was earlier discourse on the following statement: is causal inference a special case of prediction? The answer is clearly no going by standard definitions, so much of the discussion boiled down to whether these definitions are useful. I’ll share some posts from that and some brief thoughts at the bottom (a lot of it did amount to more than semantics). The upshot from all this, besides unnerving the “mostly harmless” econometricians, was about whether advances in computing and methods will soon make it possible to just brute force solve the problems that eternally plague the study of economics. Namely, will the Lucas Critique be solved just by giving a bunch of data to the right neural network in the right way? Note this is very different from the ML methods described in the first paragraph. Those methods rely on knowing what equilibrium conditions must hold, whereas this discussion is assuming that the neural net has the same information set as the econometrician (i.e., does not know the economy’s true data generating process) yet can still overcome this. Just for clarity of reference throughout the post, I will distinguish between the first and second paragraph types as oracle ML vs. econometrician ML.
In the original exercises for “Can a Transformer “Learn” Economic Relationships?”, they first generated several batches of data using a standard New Keynesian model with various parameter combinations.2 Then during the training process, they fed into the architecture the history of structural shocks and the parameters used to generate the data. For the testing, they gave the setup data not seen in training and asked it to forecast the time path and simulate IRFs. The simulation tracked realizations well and the IRFs were wonky but defendable under some criteria.
While this was undoubtedly useful for exposing more people to these methods and broadening horizons for what can be done in macro, these results are exactly what we should have expected. This was covered well by posts I will link to at the bottom, but I will briefly summarize the key points. Most notably, this exercise was painted as the second type of ML but was actually the first — the econometrician will not have access to the structural shocks or the parameters used to generate the model. In fact, standard DSGE models are nothing more than functions of the the parameter vector and history of structural shocks (broadly defined, see footnotes), so this is all an “oracle” would require.34 The data used for testing was generated from the same distribution used to draw the training data parameters, making the testing exercise effectively in sample. Apoorva Lal ran some exercises of what would happen when the parameters were more directly out of sample for more of a true test of the extrapolative abilities, and the forecasting and IRF performance definitely deteriorate.
Lessons from the Labs
So what about the type of ML that puts the econometrician and network on equal footing? In a possible act of sheer coincidence, a paper “Learning the Macroeconomic Language” was just posted on arXiv today from Siddhartha Chib and Fei Tan. At the heart is something the major LLM labs are running up against just the same as macroeconomists — historical data is in finite supply. So what to do instead? Create synthetic data, which if done reasonably should still be useful. There’s nothing stopping the lowly human econometrician from doing this to try to improve their estimates, but making good choices on how precisely to use this simulated data is hard. So instead we can make an initial call on the class of models to select from and delegate this task of sorting through the subsequent reams of data to “a fellow econometrician” with a different set of skills. Chib and Tab generate several synthetic data sets using the posterior distribution of the Smets-Wouters model and use that in their training process. The idea is gobs of synthetic data should contain rich dynamics and patterns we haven’t observed as much in our own sample (but could at some point). The question is can the SW model approximate the world usefully enough to help with forecasting? The answer seems to be yes it’s definitely helpful. Abstracting away from the finer details, the goal is to predict the evolution of 7 macroeconomic variables. The architecture is designed to guess which decile each variable will fall into next period relative to its empirical distribution, with the “decile prediction” an intentional discretization to avoid overfitting. The inputs only depend on realized macroeconomic data, making this an “econometrician ML” exercise. Here are the results
The red and green coloring denotes whether the data realization fell into the predicted decile and the shading of the decile boxes represents the degree of confidence the network had in predicting. Overall, this looks pretty good — keep in mind this is an exercise that could’ve been run in real time beginning in 2017Q3. The authors note also this setup is relatively small relative to the scale at larger labs, so perhaps we will see even more gains from this approach in the future. Let’s take what’s been done at face value for now. It’s remarkable to me that even though the authors set 90% of the training data to be draws from the SW model, the predictions hit the zero lower bound pretty well. As in: a fraction of a fraction of the actual data seen was in a ZLB period, but we see here that was all that was needed to pick up on how persistent these episodes are. At the same time, the more interesting prediction problems in this period were missed. When will we leave the ZLB? Does not do a good job, there was actually higher confidence in the prior period. Once we leave, the model tracks the liftoff well, but also underpredicts the pace of adjustment the Fed chose. Could this have been used to predict the outbreak of inflation in real time? Clearly not, as the proverbial house was bet on there being extreme deflation when we had the largest inflation surge since the 80s. So these are the types of forecast problems where we actually need help, and it’s unclear how much we’ll be able to get from this approach.
And what about the Lucas Critique?
Neither of these exercises really speak to the econometrician’s dilemma after the rules of the game are changed. See Gauti Eggertson’s great response to the original “Transformer” post
In sum, I expect both oracle and econometrician ML to be very useful. But the eternal endogeneity problem remains. There is more I would like to say, but I will try to weave that into posts (hopefully in the not too distant future) about my own papers. Merry Christmas, etc. Here are some more things from twitter I liked (some are threads so click to see the full thing)
Some more thoughts (mostly from others)
Prediction vs. Causal Inference
My take, borrowing from Sims: causal inference is just prediction until it’s not
More on the original “Transformers” exercise
For the LC
The “Transformers” authors also added a version where they only feed outcome data to the model and make the DGP a time-varying parameter VAR and compare it to a Kalman filter.
They show it does better than a Kalman filter. This is indeed now econometrician ML, but these results still should not be unexpected. The Kalman filter only gets to see one sequence of data to learn the underlying structure and also we should not think either will do particularly well in this environment, so the answer of which does do better seemingly will not help us make progress on things we are interested in.
I’ve unfortunately had many experiences where it looks like a model has been “solved” (very low values of loss) but really hasn’t (and the reason wasn’t because of a code typo; seems like the neural net just gets stuck in a path that will not actually converge to the true, unique policy functions).
The authors responded to feedback by doing some extensions. I briefly discuss one of them at the very bottom of this post.
We typically think of structural shocks as innovations to an autoregressive process. Of course, models may have a more complicated state space that extends beyond this, like endogenous markov processes, in which case we would have to expand the notion of structural shocks to include things other than just white noise shocks. Also, if we introduce learning, “parameters” would also have to be defined broadly. And for increasing complexity, this input vector will only be sufficient to uncover the true representation only in an asymptotic sense.
See Azinovic-Yang and Žemlička (histories) and Kase, Melosi, and Rottner (parameters as states) for examples of papers trying to exploit this structure when using DL.




















Wow, the part about deep learning solving previously intractable models but struggling with sanity checks really stood out to me. It kinda echoes alot of your earlier insights about the practical challenges when pushing new tech boundaries. Your knack for breaking down these complex concepts is always so appreciated, and it's super encouraging to read.