It’s been almost one year since a new breed of artificial intelligence took the world by storm. The capabilities of these new generative AI tools, most of which are powered by large language models (LLM), forced every company and employee to rethink how they work. Was this new technology a threat to their job or a tool that would amplify their productivity? If you don’t figure out how to make the most of GenAI, are you going to get outclassed by your peers?
This paradigm shift placed a dual burden on engineering and technical leaders. First, there’s the internal demand to understand how your organization is going to adopt these new tools and what you need to do to avoid falling behind your competitors. Second, if you’re selling software and services to other companies, you’re going to find that many have paused spending on new tools while they sort out exactly what their approach should be to the GenAI era.
There is a ton of hype, and it can be exhausting trying to figure out where to direct your resources. Before you can dive into the details of what to do with the answers or art your GenAI is creating, you need a robust foundation to ensure it’s operating well. To help, we’ve come up with four key areas you’ll need to understand to make the most of the time and resources you invest.
- Vector Databases
- Embedding Models
- Retrieval Augmented Generation
- Knowledge Bases
These are almost certain to be fundamental pieces of your AI stack, so read on below to learn more about the four pillars needed for effectively adding GenAI to your organization.
To make use of a Large Language Model, you’re going to need to vectorize your data. That means the text you feed into the model is going to be reduced to arrays of numbers, and those numbers are going to be as a vector on a map, albeit one with thousands of dimensions. Finding similar text is reduced to finding the distance between two vectors. This allows you to move from the old-fashioned approach of lexical keyword search—typing a few terms and getting back results that share those keywords—to semantic search, typing a query in natural language and getting back a response that understands a coding question about Python is probably referring to the programming language and not the large snake.
“Traditional data structures, typically organized in structured tables, often fall short of capturing the complexity of the real world,” says Weaviate’s Philip Vollet. “Enter vector embeddings. These embeddings capture features and representations of data, enabling machines to understand, abstract, and compute on that data in sophisticated ways.”
How do you choose the right vector database? In some cases, it may depend on the tech stack your team is already using. Stack Overflow went with Weaviate in part because it allowed us to continue using PySpark, which was the initial choice for our OverflowAI efforts. On the other hand, you may have a database provider, like MongoDB, which has been serving you well. Mongo now includes vectors as part of their OLTP DB, making it easy to integrate with your existing deployments. Expect this to be standard for database providers in the future. As Louis Brady, VP of Engineering at Rockset explained, most companies will find that a hybrid approach combining a vector database with your existing system offers you the most flexibility and the best results.
How do you get your data into the vector database in a way that accurately organizes it by the content? For that, you’ll need an embedding model. This is the software system which will take your text and convert it to the array of numbers you store in the vector database. There are a lot to choose from, and they vary greatly in cost and complexity. For this article, we’ll focus on embedding models that work with text, although embedding models can also be used to organize information about other types of media, like images or songs.
As Dale Markowitz wrote on the Google Cloud blog, “If you’d like to embed text–i.e. to do text search or similarity search on text–you’re in luck. There are tons and tons of pre-trained text embeddings free and easily available.” One example is the Universal Sentence Decoder, which “encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.” With just a few lines of Python code, you can prepare your data for a GenAI chatbot-style interface. If you want to take things a step further, Dale also has a great tutorial on how to prototype a language-powered app using nothing more than Google Sheets and a plugin called Semantic Reactor.
You’ll need to evaluate the tradeoffs between the time and cost of putting huge amounts of text into your embedding model and how thinly you slice the text, which is usually chunked into sections like chapters, pages, paragraphs, sentences, or even individual words. The other tradeoff is the precision of the embedding model — how many decimal places to use on vectors, as each decimal place increases in size. Over thousands of vectors for millions of tokens, this adds up. You can use techniques like quantization to shrink the model down, but it’s best to consider the amount of data and degree of detail you’re looking for before you choose which embedding method is right for you.
Retrieval Augmented Generation (RAG)
Big AI models read the internet to gain knowledge. That means they know the earth is round…and they also know that it’s flat.
One of the main problems with large language models like ChatGPT is that they were trained on a massive set of text from across the internet. That means they’ve read a lot about how the earth is round, and also a lot about how the earth is flat. The model isn’t trained to understand which of these assertions is correct, only the probability that a certain response to a question will be a good match for the query the user enters. It also mixes those inputs into a statistically probable new one, which is where hallucinations can occur. It may be responding with neither response, which is why checking sources is good.
With RAG, you can limit the dataset the model searches, meaning the model hopefully won’t be drawing on inaccurate data. Secondly, you can ask the model to cite its sources, allowing you to verify its answer against the ground truth. At Stack Overflow, that might mean containing queries to just the questions on our site with an accepted answer. When a user asks a question, the system first searches for Q&A posts that are a good match. That’s the retrieval part of this equation. A hidden prompt then instructs the model to do the following: synthesize a short answer for the user based on the answers you found that were validated by our community, then provide the short summary along with links to the three posts that were the best match for the user’s search.
A third benefit of RAG is that it allows you to keep the data the model is using fresh. Training a large model is costly. Many of the popular models available today are based on training data that ended months, or even years ago. Ask it a question about something after that, and it will happily hallucinate a convincing response, but it doesn’t have actual information to work with. RAG allows you to point the model at a specific dataset, one that you can keep up to date without having to retrain the entire model.
RAG means the user still gets the benefit of working with an LLM. They can ask questions using natural language and get back a summary that synthesizes the most relevant information from a vast data store. At the same time, drawing on a predefined data set helps to reduce hallucinations and gives the user links to the ground truth, so they can easily check the model’s output against something generated by humans.
As mentioned in the previous section, RAG can constrain the text your model is drawing on when generating its response. Ideally, that means you’re giving it accurate data, not just a random sampling of things it’s read on the internet. One of the most important laws of training an AI model is that data quality matters. Garbage in, garbage out, as the old saying goes, holds very true for your LLM. Feed it low-quality or poorly organized text, and the results will be equally uninspiring.
At Stack Overflow, we kind of lucked out on the data quality issue. Question and answer is the format being adopted by most LLMs used inside organizations, and our dataset was already built that way. Our Q&A couplets can show us which information is accurate and which is still lacking a sufficient confidence score by analyzing the number of votes or which question has an accepted answer. Votes can also be used to determine which of three similar answers might be the most widely utilized and thus the most valuable. Last but not least, tags allow the system to better understand how different information in your dataset is related.
Learn more about how Stack Overflow for Teams helps the world’s top companies share knowledge and build their foundation for an AI future.