Mile High View of How Gen AI Works — COURANT AI⚡🪁🔑

It starts with massive amounts of data — every known book, the entire open internet, libraries of music, videos, audio, images, unlabeled and unstructured data.

⬇️

Algorithms crawl and consume this massive amount of data and make connections. This is called Deep Learning and it’s a subset of Machine Learning.

⬇️

Relationships are applied and broken down into atomic units called Tokens. Many words can map to one token; and longer words and data can map to many tokens. One LLM (large language model, ChatGPT or Claude for example) will have billions of tokens.

⬇️

Every token is encoded as a Vector (a number with position and direction). The closer together that two token-vectors are, the more closely related the generative AI thinks they are (think of bear : grizzly bear vs bear : bare).

⬇️

All this training is used to create a complex, many-layered, weighted algorithm modeled after the human brain, called a Deep Learning Neural Network. It’s what allows LLMs to understand patterns and relationships in the data and tap into the ability to create human-like responses.

⬇️

GPTs (Open AI’s brand of generative AI) in particular use the Transformer architecture—it’s the “T” in GPT. At the core of transformers is a process called self-attention. Older recurrent neural networks (RNNs) read text from left-to-right. Transformer-based networks, on the other hand, read every token in a sentence at the same time and compare each token to all the others. This allows them to direct their “attention” to the most relevant tokens, no matter where they are in the text.

⬇️

An LLM’s neural network can have hundreds of billions of Parameters (or variables). The number of parameters typically represents the size and complexity of the model. These parameters are responsible for encoding linguistic knowledge and patterns derived from the training data.

⬇️

Values and weightings are assigned to the different parameters. The model learns by adjusting numbers called weights during training. It tries different weight values to get better at solving a problem, using a process called gradient descent to reduce mistakes. Over time, these adjustments help the model recognize patterns and improve its ability to generate accurate outputs.

⬇️

Gradient descent is a method used to help a model learn by finding the best values for its parameters. It works by calculating how much the model’s predictions are off (the error) and then adjusting the parameters step by step in the direction that reduces the error. Think of it like walking downhill to reach the lowest point in a valley, where the lowest point represents the best solution.

⬇️

The result of this workflow is ready when you need it. When you are ready, the model takes your Prompt and then outputs whatever it thinks best matches your request.

🤿 Deeper dive explanation: For the best explanation of how ChatGPT works that I’ve found so far, check out “How does ChatGPT work as explained by the ChatGPT team” by Gergely Orosz.