The Safer Approach to Innovating with AI: Graphs and Open Source LLMs

This site is not optimized for Internet Explorer 9 and lower. Please choose another browser or upgrade your existing browser in order get the best experience of this website.

July 27, 2023

Ben Nussbaum

In the first few months of 2023, following the public release of ChatGPT, it’s been nearly impossible to avoid the awe, curiosity, and fear of missing out around how artificial intelligence (AI) founded on large language models (LLMs) is going change the responsibilities of every knowledge worker—any role where the employee essentially thinks for a living. It’s also been impossible to avoid influencers and tech enthusiasts who claim they’ve uncovered the best way to apply AI in your day-to-day to avoid being rendered obsolete.

The real question most knowledge workers and data analysts should be asking is whether an LLM could actually help them solve complex challenges around their organization’s data. If you’re wading through a data lake of unstructured text data, like articles or social media posts, or need to organize your thinking around the relationships between people, places, and things, the answer just might be yes. LLMs, and AI in general, have already proven to be valuable tools for not replacing, but augmenting, the knowledge workers in various ways:

Generating copy, article text, or even images
Summarizing news, social media, or other unstructured text
Injecting better chatbox experiences for real-time customer support
Translating unstructured text

But before you jump headfirst into any AI solution, you need to understand what risks you’re implicitly accepting. If you begin using a tool like ChatGPT, or the API that its creator, OpenAI, provides at a cost, you’re essentially handing the value behind your data to a mysterious black box. You have no guarantees over what OpenAI does with that data. The results it returns are no longer your innovation or tactical advantage, but an out-of-the-box solution any of your competitors could stumble into as well.

Instead of following the ChatGPT hype down a path you might regret later, take time for due diligence on another quickly-changing area of AI that might’ve passed under your radar: open source LLMs.

What’s in an open source LLM?

Let’s start with what’s not. The big organizations involved in AI—think OpenAI, Google, Microsoft, and Facebook—are developing proprietary models, using a trillion parameters or more, on massive private datasets. These LLMs are most often available to the public through a commercial API, which you must pay for based on your usage, or a web application that lets you input data or a prompt to receive an answer.

There’s no visibility into the models they’ve built, the data they’ve used, or what technological magic makes their implementation stand out.

The open source LLM flips all those modalities on their head. Let’s simplify with a few definitions:

- Open source: A type of licensing that allows any user the ability to download software or its source code, and view, alter, or re-distribute the software to anyone else. The nuances of open source licenses can vary widely—and controversially—but in general, open source software is free to use for commercial purposes without limitation.

Large language model (LLM): A model trained on large quantities of textual data, which can then be used to perform a variety of tasks on other natural language data, such as analyzing sentiment, identifying named objects, or writing entirely new content based on your inputs.

GhatGPT, for example, is based on GPT-3.5 and GPT-4, both of which are proprietary LLMs developed by OpenAI.

With these definitions in hand: an open source LLM is developed transparently by a community of researchers and developers, with all the source code, model architecture, weights, training data, and build/run instructions published transparently on a Git provider like GitHub. You can freely download open source LLMs, use them to solve high-value data challenges in a commercial environment, and even modify how they work to suit your specific requirements.

They also aren’t nearly as well-trained as their proprietary counterparts, a function of the sheer cost, in GPU computing power and time, involved in pre-training data. That said, after roughly a billion parameters, most readers are unable to distinguish between human and AI-generated content, which makes this training gap between proprietary and open source LLMs negligible—and suggests that you aren’t missing out on much by going the open source route.

Much like the proprietary models, open source LLMs are constantly being optimized using advanced techniques:

Low-Rank Adaptation (LoRA), which reduces the number of trainable parameters by 10,000 times and GPU memory needs by 3 times.
Self-Instruct, which allows an LLM to improve its instruction-following abilities using its generated results instead of human-written data, creating a feedback loop that significantly speeds up model development.

With just a few GPUs, you can adapt an open source LLM to your organization’s unique needs, making the potential impact of AI far more accessible to more. This opportunity to optimize at a reasonable cost and time commitment is so rich that even OpenAI’s Andrej Karpathy predicts a “Cambrian explosion” in the near future that will challenge the proprietary moats created by OpenAI and its Silicon Valley peers.

It might also present your best path toward leveraging AI safely within your organization.

Open source LLMs vs. proprietary LLM APIs

While the former might seem like a complex technological effort, and the latter a simple (but costly) path toward experimenting with AI at your organization, there’s far more to the discussion than how these models are developed and trained:

Open source LLM

Proprietary LLM (API)

Free or low-cost options depending on the LLM you choose
Developed transparently by researchers/developers volunteering their time or contributing on behalf of an organization
View and modify the source code—if you have the technical skills in-house to do so
Contribute to the project directly through community-led data gathering or direct contributions to code
Likely get less refined results out of the box with generalist techniques and training data
Opportunity to fine-tune and re-train the standard model based on your unique needs
Re-train models to eliminate the knowledge cut-off present from the original corpus of training data
Validate training data, which you can use to avoid potential legal issues

Always paid, with varied costs that can become prohibitive when working on complex data challenges
Developed in a proprietary fashion by the organization selling API access
Simple access via API calls or a thin web-based user interface
They work even if you have no in-house machine learning talent or capabilities
Modeled on massive quantities of training data, which could lead to better outcomes right away
No opportunity to fine-tune a proprietary LLM to return better results for your unique needs
Handing your data over to a black box without know whether its secure or how it might be used

But the biggest difference is whether you view your approach to AI, and the models you develop and contribute to, as a key differentiator for your organization. If you’re okay with using standard-issue models and then layering other competitive advantages on top of the results, a proprietary LLM could be a great candidate. But if your business model requires that the quality and diversity of your AI models to be fundamentally different than what anyone can find off the shelf, then you need to leverage the flexibility of open source LLMs.

Next up: Integrate an open source LLM with your data

How do you get any LLM, much less an open source one that doesn’t come with nearly as many “batteries included,” to talk to your organization’s existing data?

If you’re already using a composable graph + AI platform, like GraphGrid, then you’re well on your way to a GhatGPT-like experience using the data you already have, opening up new possibilities for analysis and problem-solving without taking unnecessary risk or ceding your competitive advantage.

GraphGrid helps you quickly deploy a graph database and additional pipelines to connect to your existing data lake or warehouse, extract the unstructured data most relevant to your current challenge, and analyze it with natural language processing (NLP) to return a connected network of people, places, and things.

Think of this process almost like data preparation, which is a common and painful challenge for organizations that need to do more with their unstructured data. GraphGrid also comes with a knowledge graph, which helps you curate your data with graph thinking and forge a golden path toward developing knowledge that’s easier to feed into an open source LLM for solving complex data challenges.

Graph data is also broad and context-rich, which makes it ideal training data for fine-tuning your unique implementation of an open source LLM—an advantage you can’t access if you forward all your data into a black box API.

If you’re ready to take this step toward integrating an open source LLM with your data, let’s schedule a free solutioning session. You’ll get an opportunity to chat with one of our graph + AI professionals about your unique use case and start validating which open source LLM would work best to deliver the AI-driven answers you’re after. In the meantime, you’re always welcome to download GraphGrid CDP at our freemium license level and check out all the onboarding resources we’ve created to make your first day in graph + AI an eventful one.