LSTMs could be trained in a self-supervised way, just not efficiently. Transformers allowed parallelization of training so you could scale up model size which was the main breakthrough
Interesting. I thought LSTMs also had limitations on their "memory" that limit their effectiveness on long text. So they would still require something like attention to get the kinds of results you get from a transformer.
they do have limitations in "context window" (in LSTMs, it's a state vector that can theoretically contain information with unlimited context distance, but is limited by the information content of the vector itself), but you could scale them up the same way...except then they get really hard to train.
Transformers you just keep making them bigger and reap the benefits of the bitter lesson
LSTMs could be trained in a self-supervised way, just not efficiently. Transformers allowed parallelization of training so you could scale up model size which was the main breakthrough
Interesting. I thought LSTMs also had limitations on their "memory" that limit their effectiveness on long text. So they would still require something like attention to get the kinds of results you get from a transformer.
they do have limitations in "context window" (in LSTMs, it's a state vector that can theoretically contain information with unlimited context distance, but is limited by the information content of the vector itself), but you could scale them up the same way...except then they get really hard to train.
Transformers you just keep making them bigger and reap the benefits of the bitter lesson