APIs vs. In-house NLP Models: Which is Perfect for Your Business?

This site is not optimized for Internet Explorer 9 and lower. Please choose another browser or upgrade your existing browser in order get the best experience of this website.

Business, Natural Language Processing, Scalability

November 04, 2022

Ben Nussbaum

APIs vs. In-House NLP Models As your organization starts its journey into natural language processing (NLP), you immediately face a road that diverges into two. You’re faced with a decision: Do you invest in developing and maintaining your own models based on your domain, or do you use an API, under a pay-as-you-go NLP-as-a-service model, from a public cloud provider?

Your team members will inevitably start asking familiar questions about tech debt and cost, which inevitably come with implementing any new tool or service. They worry about whether they have the right talent to build a model in-house versus the availability of a public API.

But the real question you should ask yourself is: Is the quality and diversity of your NLP models and their results, your organization’s key differentiator?

If your answer is no:

You’re probably competing around user experience with an application that doesn’t require you to customize models with your own data, which means the NLP model is a means to an end,
Or you just need a “good enough” quality that lets you focus your efforts on getting all the parts of your business working around the model.

In this case, using an API is probably a viable means to an end.

If your answer is yes:

You’re competing and differentiating based on the effectiveness and quality of your model,
Or you need a technological advantage, proprietary to your organization, which none of your competitors are capable of.

Then you need to seriously consider the value of owning your NLP models from conception to deployment, as using off-the-shelf models like those provided by public clouds or John Snow Labs will hamper your NLP efforts from the ground up.

Nature and quality of API vs. in-house data corpus

Using a public API for NLP is a one-sided application: Your application queries the URL(s) where the API runs, often through a simple CURL request, and waits until receiving a response, at which point your application can display the data or perform additional analysis.

The API doesn’t know anything about the type of data you’re trying to process beyond the strings you send to the API, and, more importantly, you don’t know the details about how this provider trained their models beyond API reference documentation and the broad strokes of their inputs and goals.

Let’s say you use NLP for a highly-specialized use case, like extracting and labeling relevant medical details from unstructured healthcare data, like doctor’s notes or electronic health records. You can feel confident your API provider trained their models using a corpus of relevant data. But you have no visibility into what that data looks like, its overall quality, or get insights into how the API provider designed their models, whether that’s Named Entity Recognition (NER), sentiment analysis, Key Phrase Relationship Extraction, Coreference Resolution, and more.

In using an API, you’re ceding quality control, and agreeing to an opaque, one-way transaction to your provider, in exchange for other benefits, like a faster time-to-market or simplicity in your tech stack.

But when you build NLP models in-house, you’re responsible for everything. We’ll talk more about the developer and data team implications of in-house NLP a little later, but the burden of that responsibility comes with that tactical advantage: your team’s ability to carefully control all the inputs and outputs in a continuous cycle of quality improvement.

Other considerations for your NLP optimization

While the question around how your organization differentiates itself and your concerns around the quality of the data corpus (a provider’s versus your own) are the best waypoints to help you determine which NLP road is ideal, they’re not the only variables. Your determination of going API vs. in-house depends on other key factors:

Can you re-train models based on new or changed data?

No data remains static over time. As your organization changes and grows, along with the industry around it, your models need to be made aware of these trends. In highly-specialized industries, for example, new names and phrases can enter the lexicon unexpectedly, and your ability to differentiate depends on your NLP whether your NLP models can keep up.

With in-house NLP models, you can always take your expanded data corpus, or only your newest data, and re-train your models to identify new linguistic patterns. With an API, you’re at the provider’s mercy as to whether or how often they optimize for the latest trends.

How often do you need to improve your model?

The speed at which your data changes, which directs your improvement cycle, depends a lot on your industry, NLP model, and application. A model designed to recognize key phrases in customer support questions to power a chatbot service for your website will likely require much less change and improvement than an assertion detection model for analyzing prognoses and healthcare plans for patients with COVID-19.

When going in-house NLP, the pace of improvement depends entirely on your data and the development/deployment infrastructure you build for your data and analytics teams.
What is the developer experience (DX) like?

Developers love APIs because they decouple complex processes—instead of writing and maintaining an internal library, or using an open-source tool, for NLP processing, they simply ping a URL and pass the response along.

Remember that going with an API doesn’t magically solve all concerns around DX and tech debt. You might be outsourcing the work to train and deploy NLP models, which simplifies a portion of your entire infrastructure, but you’re still responsible for what happens to the returned data. Do you have the connected data services required to store, analyze, and present the processed data to your users or teams? Abstracting NLP away might simplify the development side of things, but there are still ongoing operations—you may need a new DevOps team, or ask your current team to expand their responsibilities.

If you go the API route, you might simplify one small part of your infrastructure at the cost of increasing complexity elsewhere.

How much does it cost?

Let’s take Azure’s Cognitive Service for Language, which uses NLP to find and label information in unstructured clinical documents. Azure charges $25 per 1,000 text records, or $0.025 per request—that might not seem like much to begin with, but these fees can add up quickly as you scale your NLP pipelines. Plus, as mentioned just above, these fees don’t include any of the internal costs associated with the infrastructure you’ll need to handle the input/output.

How do you use a public NLP API?

If your organization truly only cares about the results from running NLP models against your data, and you think you’ll find a significant business advantage in decoupling your operations from developing or even running NLP models internally, the public NLP API landscape is getting more and more exciting.

You likely already use one of the major public cloud providers, which means you can simply tack on NLP processing to the computing power you’re already utilizing. Google offers the Natural Language API, AWS has Amazon Comprehend, and Azure offers text analytics via the Cognitive Service for Language. If you prefer a more independent source of NLP insights, you can use services like NLP Cloud.

When it comes to actually utilizing the API, you’re most likely going to use a prebuilt client library based on your organization’s development language, or interact directly with a gRPC or REST/HTTP endpoint, to integrate NLP processing. You pipe your unstructured text into the service and wait for results, which you, in turn, display in your application or send to additional services for data/analytics purposes.

What do you need to know when building NLP in-house?

When you value your ability to train bespoke NLP models that become your organization’s intellectual property and competitive advantage, you’ll prioritize building in-house.

But that doesn’t mean you need to reinvent the NLP wheel. You don’t need to develop an entire model from scratch. You’ll start with an existing model, trained by another organization and offered as a service, which has one of a handful of problems:

It’s not performing adequately, in terms of the processing speed or the quality and accuracy of the results,
It’s too expensive to scale your solution,
It’s not feasible to run a service-based model where you’re operating, such as an IoT device without reliable internet access, and need a self-contained NLP deployment.

In this case, you must design and deploy the entire NLP development pipeline. That includes creating and maintaining your corpus of high-quality training data, which you’ll need to add and tweak over time to address changes and new challenges in your industry. You’ll need to perform the NLP development, inching toward more accurate results based on feedback and continuous collaboration between development, data, and business teams.

As the quality of your models grow and you want to scale how you leverage them, you must ensure the pipeline for continuously extracting unstructured data from your data lake/warehouse for the NLP service itself can handle the load. And don’t forget the output—like when you use an NLP API, the in-house solution requires infrastructure to store and send processed text to your data/analytics teams or end users.

That might sound like a lot of technical complexity. Still, the business advantage of having novel and powerful NLP models makes it a worthwhile endeavor—and there are solutions designed to help you maintain this entire ecosystem of data storage and processing services.

Build your in-house NLP powerhouse with GraphGrid

GraphGrid is a suite of connected data capabilities that help organizations define high-quality NLP models and differentiate their business model around them.

GraphGrid handles the entire storage and processing pipeline, including a handful of base NLP models to work from and integrations with your existing data lake/warehouse, and the ability to continuously process data in real-time. As a bonus, GraphGrid works entirely on graph data architectures, representing complex data with a common language to make the results of all your NLP work more actionable across the board.

The richness of your growing graph database informs the development and quality of your NLP models, carving out your differentiator and delivering the highest-quality, most meaningful results possible.

Get started today by downloading GraphGrid for free.