Q2 2025 Gaming Industry Report Released,
View Here

Konvoy’s Weekly Newsletter:

Your go-to for the latest industry insights and trends. Learn more here.

Newsletter

|

Sep 25, 2025

Version Control for AI Prompts

The edge is now in the prompt, not in the execution

Copy Link

Copy Link

We have written about version control in the past, but the widespread integration of artificial intelligence (AI) into software development has created a new set of processes that require version control. As users integrate more natural language prompts into their code, they need to find a way to manage versions of these prompts and to evaluate how effective different versions are at achieving its goal. This week, we want to explore what prompt engineering is and why new version control systems are needed to manage this workflow.

What is Prompt Engineering?

Prompt engineering is the process of guiding an AI solution to provide a specific output. For example, if you tell an AI model to:

“Give me a picture of a building.”

You could get any range of things, and likely not what you had in mind. However, if you get more specific and say:

“Create a realistic, life-like image of the Empire State Building at night. The top of the building should glow with golden lighting. The background sky should be filled with stars, and there should be a large red moon prominently visible. The perspective should highlight the building against the night sky, with sharp detail and cinematic contrast.”

The result is likely to be a lot more precise. This same phenomenon exists with every interaction with an AI model. Whether you are trying to get it to engage with a code base, summarize an article, or chain together a series of functions, the more specific the user input, the better the output.

Integrating prompts into a software development workflow means they won’t just be one-off inputs, they’ll be reused repeatedly. For example, you might set up a program to run the same instruction every week: “Summarize every newsletter from Konvoy and post it on Twitter.” Over time, you’ll likely refine that instruction to improve clarity, accuracy, or style. Keeping track of how prompts evolve becomes essential, since even small changes can lead to very different outputs.

‍Prompt vs Code Version Control

In traditional software development, version control systems help teams track all changes made to code over time, allowing users to recall older versions, restore files, and test new features without affecting the main project. Systems like Git, Perforce, and Diversion (a Konvoy portfolio company) help developers keep track of their changes and often have built-in tools to ensure everything is working properly when things change.

Similarly, version control for prompts can help track changes, collaborate with other members of your team, and revert to old versions of a prompt. However, the similarities stop there. Version control for code and prompts has distinctly different workflows that need different tools and processes to manage properly.

  • Technical & Non-Technical: As prompts are written in natural language, non-technical team members can create and use them. Creating a familiar space where these users can engage with prompts is important, given that Git workflows can be overwhelming for non-technical users.
  • Prompt Management: Prompts show up in code as large blocks of text. This makes a traditional collaborative word processing document a better place to edit and manage prompts versus traditional version control for code.
  • AI Model: New AI models are constantly coming out with different capabilities – having a place where you can easily manage API keys, try different models, and evaluate their performance is not something you can easily do with existing version control tools, although third-party tools like Openrouter.ai can be helpful here.
  • Different Measurements: Existing version control tools do not have a central platform for monitoring usage, messages, or dollars spent on models. You can go to individual model portals, but centralizing this for testing is critical.
  • Evaluation: Evaluating these models requires more than just metrics. You need a way to validate the clarity, accuracy, and usefulness of the outputs they generate.

Evaluation Data as a Moat

New entrants into this space can differentiate themselves from traditional version control systems by 1) a clean user interface, 2)  aggregating a unique set of tools and data analytics, and 3) leveraging the unique data that is captured from the evaluation process.

Evaluating the output of an LLM is different from evaluating the output of code. For example, understanding if an LLM chatbot is providing the correct information about your product and is also using the correct demeanor is not a binary answer. This requires different types of testing. PromptLayer, a company attempting to solve version control for prompts, details a few methodologies that can be used here:

  • Negative Examples: This is the process of identifying examples of bad responses and setting up guardrails to make sure these are not represented in the output.
  • LLM as a Judge Rubric: This approach uses another LLM to score the output based on a set of parameters.

To support these new styles of evaluation, new tools are required that can automatically run tests on new prompts, see results side by side, and build custom score cards to measure evaluation needs.

Scoring these prompts provides valuable information that can be leveraged to provide suggestions for other users looking to optimize prompts. Depending on the complexity of the task, these new version control platforms could even be well-positioned to create a marketplace for prompts, helping users streamline their creation process and create a data flywheel for the business.

Takeaway: As AI becomes a core part of software development, managing prompts is increasingly cumbersome and needs to be supported with custom tools. Just as Git transformed how developers track and collaborate on code, new systems are needed to version, evaluate, and optimize prompts. The real moat will not just come from storing prompt history, but from building rich evaluation datasets that guide better outputs. Teams that invest early in prompt version control will not only gain an increased productivity and performance, but will have a system for testing that aligns with best practices in software development.

From the newsletters

View more
Left Arrow
Right Arrow

Interested in our Newsletters?

Click

here

to see them all