What are transformers, why it is so expensive to train a Transformer-based model and what is the architecture of the future LLMs