It All Started with Curiosity
When the AI revolution began with the first version of ChatGPT in 2022, we were curious trying to understand what was happening. We have been working with renowned clients in AI since the beginning of Parser. A leading global financial intelligence and risk analytics firm was one of those first clients who began working with us in AI more than 5 years ago. Our experience was primarily focused on automating machine learning pipelines and data processing. In addition to creating machine learning (ML) models, we transformed ML notebooks into deployable production-ready microservices, accessible via APIs and advanced user interfaces. And while we used neural networks for image processing, classification tasks, and sentiment analysis, we quickly realised that using generative models (i.e. LLMs – Large Language Models), which create novel or derivative work, was substantially different from merely classifying or predicting outcomes using analytics. There was something about it that felt magical.
As engineers working previously with neural networks and deep learning, we knew that such “magic” was actually an illusion created by many mathematical computations. It was, in fact, no more than thousands or even millions of pseudo-regressions combined in different ways to emulate how humans express and relate concepts. Our first step was to learn, and we often heard that “attention” was a key component. And indeed, it is—but so is the architecture of transformers, which allows text to be transformed into another text, questions into answers, English into Spanish, text into images, or even images into videos. There were also “embeddings”—without them, we wouldn’t understand the concept that later became popular with Retrieval-Augmented Generation (RAG), semantic search, or sparse matching. In simpler terms, embeddings help relate similar words and phrases meanings by placing them in a multi-dimensional vector space, where each dimension could represent aspects like gender, plurality, physical vs. abstract, domestic vs. wild, happy vs. sad, and so on. And as we are talking thousands of dimensions for every single word, imagine the rich array of relationships that can be captured. Each of these dimensions gave the text depth, transforming raw information into knowledge, or at least a better understanding of concepts.
This understanding, however, did not come easily. We read many books, doubted the fanatics evangelising AI, and experimented with software as a way to predict software. But let’s get back to the real story—our journey through AI.
Chapter 1: The Beginnings | Hugging Face and Self-Hosted LLMs
A few years ago, we decided to embark on our own AI journey. We weren’t exactly sure what we would do, but we were certain it could help us to improve our productivity. This led us to focus on our internal needs as a boutique software development company. We began automating tedious tasks like generating unit tests, fixing test cases when APIs or on-screen objects changed, and even writing user stories based on functional requirements. If we could generate user stories, why not acceptance criteria? If we could do that, why not in Gherkin, so they could be automated more easily by auto generating those test cases? And before we knew it, we had closed a small cycle.
At that time, services like Amazon Bedrock had not yet been launched, and trusting a newcomer like OpenAI wasn’t the safest option, primarily for security reasons. So we took the prudent route—deciding to use open-source models, mostly from reputable sources like Hugging Face, and hosting them ourselves. This way, our data remained within our environment, or at least within our Cloud VPN. Moreover, it allowed us to try different models and learn in the process. Since these models were open-source, we could look at the code to truly understand how things worked.
But there was a realisation that took us by surprise—understanding how Large Language Models (LLMs) and transformers worked was different from understanding the magic behind ChatGPTs. It’s like understanding how a motherboard works or even a simple operating system, and thinking that would help you understand an advanced software like Illustrator or AutoCAD. The depth of knowledge needed existed in more abstract layers, and looking at the hardware didn’t explain how the application functioned—like staring at a brain in a jar, trying to understand consciousness.
Hosting OpenSource LLMs
After some time, we arrived at a powerful but manageable version of a model named WizardCode-15B (there is a smaller version with 3B parameters). This LLM had been pre-trained on both text and code corpus, making it useful for the development lifecycle as well as for coding and testing. We built a basic infrastructure, starting locally (hence the importance of running it on our own computers) and later scaling to AWS.
We leveraged it for Java code generation, explanation, and refactoring, significantly speeding up our development process. It was particularly useful for quickly creating skeletons of functions, classes, and endpoints (entire APIs in seconds). The first problem we faced was the context window—inherently limited, processing only a few files with no more than a few hundred lines of code. This led us to understand an essential pattern in software and in AI generative learning—it’s easy to create a proof of concept or a small demo, but to scale it for real-life usage involving dozens of files and thousands of lines of code, the effort required becomes exponential.
We struggled with code chunking, overlapping, and summarising processed code, which drove our need for what would later become RAGs. Despite these challenges, we continued experimenting. Our self-hosted LLM allowed us to build a pseudo-ChatGPT, integrating it into Slack after early demos with Gradio as the UI. We even used it as a plugin in Visual Studio to translate between languages and versions, improve code readability, create tests, and refactor. Another plugin generated user stories and acceptance criteria in Gherkin for BDD from Jira tickets.
However, limitations became apparent—like Slack’s syntax parsing issues, inability to maintain context, and lack of streaming responses. The idea of “maintaining context” seems trivial, as it’s taken for granted with ChatGPT. But for true conversational interaction, knowing who the user is, saving past messages (in an encrypted database), and deciding whether to delete or summarise old data proactively are all essential. This was cumbersome until OpenAI implemented system messages to keep the initial prompt always relevant.
We were proud to have a working LLM in Slack, even without proper context. Yet it was clear that the result was inferior to ChatGPT or Claude—not to mention the lack of up-to-date information beyond the pre-training cutoff. However, with an acceptable context window, we could upload files (using OCR if needed) and interact with them, summarize technical articles, and answer questions. We even connected our humble data platform to the chatbot, allowing it to respond to natural language questions with SQL-like insights from our candidate pipeline and revenue data. These prototypes still suffered from the lack of personalisation, scalability, and memory.
We learned quickly that although the original costs seemed promising, hosting our own models became more expensive than using the newly released AWS Bedrock. Moreover, an inappropriate prompt could crash the infrastructure.
Chapter 2: Competing with Giants | Embracing Security and Infrastructure as a Service
We soon realised that managing everything in-house was costly and distracted us from what we wanted to learn and accomplish. With the releases of Bedrock, Azure AI, and stable versions of OpenAI’s APIs, we shifted focus. Just as data centers transitioned to the cloud, we recognised that AI security could also be managed externally for better compliance and scalability. Importantly, these cloud providers offered compliance with GDPR, HIPAA, and other regulations, and their models didn’t train on user data (under specific paid licenses of course). It is key to remember that “LLMs store nothing”, the ones that keep such information on the cloud provider servers are the user applications interacting with those LLMs (like ChatGPT) and their log systems.
We experimented with various LLMs—Google’s Gemini, Claude Sonnet and OpenAI GPTs. Endless discussions followed over which path was the right one. Ultimately, we realised we needed to use them all depending on the need. Comparing models side-by-side became crucial, as each new release brought fresh perspectives. We returned to Slack, now using Bedrock as the backend to maintain a consistent API while switching models as needed. However, we realised Slack was limited as a UI—no tables, diagrams, or dynamic streaming of content.
A User Interface To Abstract Models
Thus, we decided to build our own UI, incorporating SSO (Single Sign-On) and other functionalities, such as selecting the model or type of agent to interact with. This gave us independence and enabled us to focus on cost-effective APIs rather than the expensive per-user licenses of ChatGPT and Claude. Inspired by Anthropic’s Claude and OpenAI’s consulting canvas, we realised that presenting code with syntax highlighting, rendering HTML, and creating tables were key for LLM usability and needed to be included in our UI. Moving beyond the UI, we began experimenting with RAGs (Retrieval-Augmented Generation) to address context limitations and minimise hallucinations.
Agent, Where Is My Assistant?
In parallel, with the release of OpenAI’s assistants, we could enhance RAGs and manage context through pseudo-agents. OpenAI provided different levels of usability—from rapid prototyping with MyChatGPTs to more complex control using Assistants APIs for company-wide use. However, MyChatGPT presented downsides—like lack of control over parameters, such as temperature and model configuration, which could lead to hallucinations. Also cost, especially with per-user licenses, remained a significant concern. As an alternative, using OpenAI’s API-based assistants was more cost-effective and guaranteed that our prompts and files were not used for training purposes, thus allowing us to roll out these to our employees.
The RAG Revolution
Using these assistants, we implemented our first version of a pre-sales agent that used RAG-based retrieval to summarise projects, list capabilities by industry, anonymise clients, and group technology experience. Eventually, this system could nearly automatically create sales presentations and respond to natural language queries about our past projects to drive sales talking points—reducing bottlenecks and knowledge loss due to unavailability of key individuals.
Another straightforward use case was the HR and IT Agents, combining RAG retrieval and relational database queries to search for candidates by skills and technologies. Previously, we used just deterministics queries but that had its shortcomings—handling the sheer amount of contextual information from resumes required a hybrid approach with both deterministic and semantic searches. The agent also allowed for anonymising and summarising resumes before sharing them externally.
In addition, we’re training agents to answer HR or IT policy questions, troubleshoot devices, request travel approvals, and assist in drafting content. The generation of code, documentation, and architectural diagrams is now available through multiple providers behind a unified UI, with a focus on professional and corporate usability. Preconceived assistants help with tasks like preventing prompt injection and ensuring GDPR compliance.
Chapter 3: Real Use Cases | The Devil in the Details
Ironically, automating simple human tasks often posed more challenges than tackling complex AI problems. This realisation guided our next step in our AI maturity process – tailoring our implementations to streamline internal operations and scale without proportionally increasing our operational staff.
While many internal cost-saving use cases can be implemented using OpenAI-like technologies through APIs, more complex problems involving real-time data and intricate decisions require tailored RAG implementations and Agentic AI. Over the past year, we’ve experimented with RAGs systems using multi-cloud components to help solve our clients challenges, enabling automation of tasks over large datasets, mainly in scanned document formats. These learnings have accelerated our journey to create a powerful RAG system for our own processes.
For safe use of LLMs to translate, generate, and refactor code, we’re experimenting and deploying containers with our own UI and open source small language models. One alternative case we are evaluating versus the original OS LLM form chapter1, is using Ollama as a platform running Llama 3.2B and other supported models like Phi-3 mini. That can be locally hosted in our engineers laptops to mitigate the risk of exposing client code to external systems. In this way, working code will not leave our engineers’ environments and will not be possible for training models. This marks the next step in our internal AI maturity—tailoring implementations and deciding which type of model to use. Either the SaaS versions, cloud or locally hosted models along with the type to use: proprietary, open source, large or small. All that in order to better solve the specific real use cases.
The Path Forward
Our journey has taught us that the real value of AI isn’t in the models themselves but in how they’re implemented to solve real business problems. We’ve learned to balance the allure of cutting-edge technology with practical business needs, security requirements, and scalability demands. As we continue exploring LLMOps, small language models, model merging and advanced Agentic AI, we’re excited to help other organisations navigate their own AI transformations, armed with the lessons we’ve learned along the way.
Key Lessons for Enterprise AI Adoption
For enterprises looking to embark on their AI journey, our story offers a crucial lesson: success in AI implementation isn’t about having the latest models or the biggest computing resources – it’s about understanding how to make these technologies solve real business problems while managing the complex interplay of security, scalability, and practicality.
Whether you’re just starting your AI journey or looking to scale your existing implementations, Parser’s experience can help guide your path to successful enterprise AI adoption.