A16Z: An Emerging Architecture for Large Model Applications

Editor's note: The explosion of generative artificial intelligence has the potential to disrupt many industries, one of which is the software industry. The rise of the large language model (LLM) has ushered in the explosion of related applications. Technology giants and start-ups have launched various LLM applications. So what kind of tools and design patterns do these applications use? This article summarizes. The article is from compilation.

Image source: Generated by Unbounded AI

Large language models (LLMs) are powerful new primitives for developing software. But because LLM is so new and behaves so differently from normal computing resources, it's not always obvious how to use LLM.

In this article, we will share a reference architecture for the emerging LLM application stack. The architecture will showcase the most common systems, tools, and design patterns we've seen used by AI startups and top tech companies. This technology stack is still relatively primitive and may undergo major changes as the underlying technology advances, but we hope it can provide a useful reference for developers working on LLM development today.

This work is based on conversations with founders and engineers of AI startups. In particular, we rely on input from people including Ted Benson, Harrison Chase, Ben Firshman, Ali Ghodsi, Raza Habib, Andrej Karpathy, Greg Kogan, Jerry Liu, Moin Nadeem, Diego Oppenheimer, Shreya Rajpal, Ion Stoica, Dennis Xu, Matei Zaharia and Jared Zoneraich. Thanks for your help!

LLM technology stack

The current version of the LLM application stack looks like this:

The gray boxes are the key components, and the ones with arrows represent different data streams: the black dotted line is the context data provided by the application developer to limit the output, the black solid line is the prompt and few sample examples passed to LLM, and the blue solid line is User query, the red solid line is the output returned to the user

Below is a list of links to each item for quick reference:

, common tools/systems for each key component of the application stack

There are many ways to develop with LLM, including training models from scratch, fine-tuning open-source models, or leveraging managed APIs. The technology stack we present here is based on in-context learning, a design pattern that we observe most developers are beginning to leverage (and currently only possible with base models).

The next section briefly explains this design pattern.

Design Pattern: Contextual Learning

The core idea of contextual learning is to leverage off-the-shelf LLMs (that is, without any fine-tuning), and then control their behavior through clever hints and conditioning on private "context" data.

For example, let's say you're developing a chatbot to answer questions about a series of legal documents. The simple way, you can paste all the documents into the ChatGPT or GPT-4 prompt, and then ask related questions. This might work for very small datasets, but it doesn't scale. The largest GPT-4 models can only handle about 50 pages of input text, and performance (measured by inference time and accuracy) degrades severely when approaching this so-called context window limit.

Context learning solves this problem with a neat trick: instead of sending all the documents every time an LLM prompt is entered, it sends only the few most relevant ones. Who will help decide which are the most relevant documents? You guessed it...LLM.

At a very high level, this workflow can be broken down into three phases:

  • Data preprocessing/embedding: This phase stores private data (legal documents in this case) for later retrieval. In general, documents are divided into chunks, passed to the embedding model, and stored in a special database called a vector database.
  • Prompt construction/retrieval: When a user submits a query (in this case, a legal question), the application constructs a series of prompts, which are then submitted to the language model. Compiled hints are often combined with hint templates hard-coded by the developer; examples of valid output are called few-shot examples; any necessary information is retrieved through an external API; and a set of related documents is retrieved from a vector database.
  • Hint execution/inference: Once hints are compiled, they are submitted to pre-trained LLMs for inference, including proprietary model APIs as well as open-source or self-trained models. During this phase, some developers also add operational systems such as logging, caching, and validation.

These may seem like a lot of work, but they are usually easier to implement than the alternatives: training the LLM or fine-tuning the LLM itself is actually harder. You don't need a dedicated team of machine learning engineers to do contextual learning. You also don't need to host your own infrastructure or buy expensive dedicated instances from OpenAI. This model effectively reduces AI problems to data engineering problems that most startups as well as large corporations already know how to solve. It also tends to outperform fine-tuning for relatively small datasets, since specific information needs to have occurred at least about 10 times in the training set before the LLM can be fine-tuned to remember specific information, and contextual learning can also incorporate new information in near real-time. data.

One of the biggest questions in context learning is: what happens if we just change the underlying model to increase the context window? It is indeed possible, and it is an active area of research. But this comes with some trade-offs - mainly the cost and time of inference scales quadratically with the length of the hint. Today, even linear scaling (the best theoretical result) is too costly for many applications. At current API rates, a single GPT-4 query over 10,000 pages would cost hundreds of dollars. Therefore, we do not foresee large-scale changes to the stack based on extended context windows, but we will elaborate further on this later.

In the rest of this article, we will walk through this technology stack using the workflow above as a guide.

Data processing/embedding

Data processing/embedding part: pass the data to the embedded model through the data pipeline for vectorization, and then store it in the vector database

Context data for LLM applications include text documents, PDFs, and even structured formats like CSV or SQL tables. The data loading and transformation (ETL) solutions employed by the developers we interviewed varied widely. Most use traditional ETL tools, such as Databricks or Airflow. Some also take advantage of the document loaders built into the orchestration framework, such as LangChain (powered by Unstructed) and LlamaIndex (powered by Llama Hub). However, we believe this part of the technology stack is relatively underdeveloped and there is an opportunity to develop a data replication solution specifically for LLM applications.

As for embedding, most developers use OpenAI API, especially the text-embedding-ada-002 model. This model is easy to use (especially if you're already using other OpenAI APIs), gives reasonably good results, and is getting cheaper. Some larger enterprises are also exploring Cohere, whose product work is more focused on embedding and has better performance in some scenarios. For developers who prefer open source, Hugging Face's Sentence Transformers library is the standard. It is also possible to create different types of embeddings depending on the use case; this is a relatively niche practice today, but a promising area of research.

From a system point of view, the most important part of the preprocessing pipeline is the vector database. A vector database is responsible for efficiently storing, comparing, and retrieving up to billions of embeddings (aka vectors). The most common option we see on the market is Pinecone. It's the default, it's easy to get started because it's fully cloud-hosted, and has many of the features that large enterprises need in production (eg, good performance at scale, single sign-on, and uptime SLA ).

However, there are also a large number of vector databases available. Notable ones include:

  • Open Source Systems such as Weaviate, Vespa, and Qdrant: These systems typically have excellent single-node performance and can be customized for specific applications, making them popular with experienced AI teams who like to develop custom platforms.
  • Faiss et al Native Vector Management Libraries: These libraries have extensive developer experience and are easy to start for small applications and development experiments. But these may not necessarily replace full databases on a large scale.
  • OLTP extensions like pgvector: Great for developers who see holes in every database shape and trying to plug into Postgres, or enterprises buying most of their data infrastructure from a single cloud provider Nice vector support solution. It's not clear that tightly coupling vector versus scalar workloads makes sense in the long run.

Going forward, most open source vector database companies are developing cloud products. Our research shows that achieving robust performance in the cloud is a very difficult problem in the broad design space of possible use cases. So the option set may not change dramatically in the short term, but it may in the long run. The key question is whether vector databases will be consolidated around one or two popular systems similar to OLTP and OLAP databases.

There is also the open question of how embedding and vector databases will evolve as the window of context available to most models expands. You could easily argue that embedding becomes less important because contextual data can be put directly into the prompt. Feedback from experts on the topic, though, suggests the opposite is the case — that embedded pipelines may become more important over time. Large context windows are indeed powerful tools, but also require significant computational cost. Therefore, it is imperative to make effective use of this window. We may start to see different types of embedding models becoming popular, training directly on model relevance, and vector databases emerging designed to enable and leverage this.

Prompt build and get

Prompt build and get

Strategies that prompt LLM and incorporate contextual data are becoming more sophisticated and also used as a source of product differentiation, and their role is growing in importance. Most developers start new projects by experimenting with simple hints that either include direct instructions (zero-shot hints) or output that may contain some examples (few-shot hints). These hints generally produce good results, but not the level of accuracy required for production deployments.

The next level of hinting tricks is to base the model's responses on some source of truth and to provide external context on which the model was not trained. The Cue Engineering Guide lists no fewer than a dozen (!) more advanced cueing strategies, including thought chains, self-consistent, generative knowledge, thought trees, directional stimuli, and more. These strategies can be combined to support different LLM use cases such as document Q&A, chatbots, etc.

This is where orchestration frameworks like LangChain and LlamaIndex come in. These frameworks abstract away many of the details of hint chaining; interacting with external APIs (including determining when an API call is required); retrieving context data from a vector database; and maintaining memory across calls across multiple LLMs. They also provide templates for many of the common applications mentioned above. Its output is a hint or a series of hints submitted to the language model. These frameworks are widely used by hobbyists as well as startups looking to develop applications, with LangChain being the leader.

LangChain is still a relatively new project (currently at version 0.0.201), but we are already starting to see applications developed with it go into production. Some developers, especially early adopters of LLM, prefer to switch to raw Python in production to remove additional dependencies. But we expect this do-it-yourself approach to diminish for most use cases, as it has with traditional web application stacks.

Eagle-eyed readers will notice a strange-looking entry in the layout box: ChatGPT. Under normal circumstances, ChatGPT is an app, not a developer tool. But it can also be accessed as an API. And, if you look closely, it performs some of the same functions as other orchestration frameworks, such as: abstracting away the need for custom hints; maintaining state; retrieving contextual data through plugins, APIs, or other sources. While ChatGPT is not a direct competitor to the other tools listed here, it can be considered an alternative solution, and could end up being a viable, easy alternative to prompt building.

Hint execution/reasoning

Hint execution/reasoning

Today, OpenAI is a leader in the field of language models. Almost every developer we interviewed has launched a new LLM application using the OpenAI API, usually they use the gpt-4 or gpt-4-32k model. This provides the best use-case scenario for app performance, and is easy to use because it can use a wide range of input domains, and often doesn't require fine-tuning or self-hosting.

Once a project is in production and starts to scale, a wider array of options can come into play. Some common questions we hear include:

  • Switch to gpt-3.5-turbo: this is about 50 times cheaper than GPT-4 and significantly faster. Many applications don't need GPT-4-level accuracy, but do need it for low-latency inference and cost-effective support for free users. *Experimented with other proprietary vendors (especially Anthropic's Claude model): Claude offers fast inference, GPT-3.5-level accuracy, more customization options for large clients, and context windows up to 100k (although We found that accuracy decreases with increasing input length).
  • Classify partial requests for open-source models: This is especially effective for high-volume B2C use cases like search or chat, where query complexities vary widely and free users need to be served cheaply:
  1. This usually makes the most sense in combination with fine-tuning the open source base model. We won't delve into this tool stack in this article, but platforms such as Databricks, Anyscale, Mosaic, Modal, and RunPod are increasingly being used by engineering teams.

  2. Open source models can use multiple inference options, including Hugging Face and Replicate's simple API interface; raw computing resources from major cloud providers; and cloud products (opinionated cloud) with clearer preferences like the ones listed above.

Currently, the open source model lags behind proprietary products, but the gap is starting to close. Meta's LLaMa model set new standards for open-source accuracy and spawned a range of variants. Since LLaMa is only licensed for research use, many new providers have started training alternative base models (e.g. Together, Mosaic, Falcon, Mistral). Meta is still discussing whether to launch a true open source version of LLaMa 2.

When (not if) open-source LLM reaches accuracy levels comparable to GPT-3.5, we expect to see text also have its own Stable Diffusion moment, with large-scale experimentation, sharing, and fine-tuning models going into production. Hosting companies like Replicate have begun adding tools to make these models more accessible to software developers. Developers increasingly believe that smaller, fine-tuned models can achieve state-of-the-art accuracy for a narrow range of use cases.

Most of the developers we interviewed did not have a deep understanding of LLM's operational tools. Caching is relatively common (often based on Redis), as this improves application response time and reduces costs. Tools like Weights & Biases with MLflow (ported from traditional machine learning) or Layer with Helicone (built for LLM) are also fairly widely used. These tools can record, track, and evaluate the output of the LLM, often for the purpose of improving tip construction, tuning pipelines, or selecting models. There are also many new tools being developed for validating LLM output (eg Guardrails) or detecting hint injection attacks (eg Rebuff). Most of these operational tools encourage their own Python clients to perform LLM calls, so it will be interesting to see how these solutions coexist over time.

Finally, the static part of the LLM application (that is, everything other than the model) also needs to be hosted somewhere. By far the most common solutions we've seen are standard options like Vercel or the major cloud providers. However, two new categories are emerging. Startups like Steamship provide end-to-end hosting for LLM applications, including orchestration (LangChain), multi-tenant data context, asynchronous tasks, vector storage, and key management. Companies like Anyscale and Modal allow developers to host models and Python code in one place.

What about proxies?

One of the most important components missing in this reference architecture is the artificial intelligence agent framework. AutoGPT has been described as "an experimental open-source attempt to fully automate GPT-4," and this spring it became the fastest-growing Github repository in history, and nearly every AI project or startup today incorporates some form of The agent goes in.

Most of the developers we talked to were very excited about the potential of proxies. The contextual learning model we describe in this paper can effectively address hallucinations and data freshness issues, thus better supporting content generation tasks. Agents, on the other hand, provide a whole new set of capabilities for AI applications: solving complex problems, acting on the external world, and learning from post-deployment experience. This is done through a combination of advanced reasoning/planning, tool use, and memory/recursion/self-reflective thinking.

As such, agents have the potential to become a core part of the LLM application architecture (or even take over the entire technology stack, if you believe in recursive self-improvement). Existing frameworks like LangChain already incorporate part of the proxy concept. There's just one problem: Proxies don't really work yet. Most agent frameworks today are still in the proof-of-concept stage, offering incredible demonstrations but not performing tasks reliably and repeatably. We are closely watching how the proxy develops in the near future.

Looking to the future

Pretrained AI models represent the most significant change in software architecture since the Internet. They enable individual developers to create incredible AI applications in days, surpassing even the supervised machine learning projects that used to take months to develop by large teams.

The tools and patterns we list here may be a starting point for integrating LLM, not an end state. We also update when there are breaking changes (say, the shift to model training), and publish new reference architectures where it makes sense.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)