This site is not optimized for Internet Explorer 9 and lower. Please choose another browser or upgrade your existing browser in order get the best experience of this website.

How Change Data Capture (CDC) Works within Knowledge Graphs

How Change Data Capture Works within Knowledge Graphs

Collaborative knowledge graphs are an enormous step forward for knowledge workers operating as “knowledge graphs of one,” wrestling with SQL queries in an endless struggle to make sense of your organization’s data lake, or finding themselves stuck in repetitive cycles of asking technical staff to build new infrastructure to support yet another data format and visualization output they need to generate meaningful insights.

Through collaboration and graph data, knowledge workers now have a platform to actively explore uncharted territory in your data with a better developer experience and impeccable durability of not just your data, but maybe more importantly, your foundational knowledge.

But even with collaborative knowledge graphs, the natural cycle of developing insights—querying or visually exploring the knowledge graph, transforming their findings into insights, and sharing those back into their organization—isn’t fully optimized.

That’s the power of change data capture (CDC). Organizations can ditch the “pull”-based insights in favor of a “push” architecture that makes knowledge workers aware of meaningful changes instantly, which means they’re generating knowledge from the most up-to-date snapshot of their data reality.

What is change data capture?

Change data capture (CDC) is the methodology and underlying software required to track every time your data is modified or updated and publish those changes across your organization. Other applications, services, or people subscribe to these changes and spin up other processes upon receiving a notification.

There are two key benefits:

  • Knowing that data has changed in real-time: If a knowledge worker receives a notification that a particular node’s properties have changed, along with a few of its relationships to other people, places, or things, they can start a new round of analysis to determine if those changes provide meaningful new insights. It’s like a context-rich action item for knowledge generation and refinement.
  • Knowing the state of data over time: Think of the version history in a Google Doc or version control via Git—understanding how a particular node’s relationships to other nodes have changed over time, or even the frequency of changes to a specific unit of data, can reveal new investigative avenues that might have otherwise gone unnoticed. Changes to connections are more difficult to detect and query than attributes to a single unit of data, but it’s readily possible with graph data and CDC.

Now that organizations put more data analysis into the hands of AI/ML/NLP services, CDC’s significance is growing. These services often take a long time to run, but you always want them working from the freshest data possible to ensure they’re not creating knowledge and distributing insights based on last week’s data. CDC publishes changes directly to a subscribed AI/ML/NLP service for automatic re-processing—without the lag of a person checking for changes in the data and manually firing off another round of processing.

How change data capture works with a knowledge graph

We start with an assumption that your graph database, the storage model for your knowledge graph, is storing your organization’s data as nodes, edges, and properties. If you already have a data lake, your graph database uses a connector to access the bulk of your data, process it, and store only the relationship and context-based data in graph format.

For one of many reasons, your data changes. For example, a new customer creates an account via your mobile app, buys a cup of coffee, and then completes their profile, where they can specify that their favorite drink is flat white with oat milk.

Next, your reactive data architecture, which includes your CDC capabilities, recognizes the change to the set of customer nodes and dispatches an event.

Any application, service, or person within your organization that’s previously subscribed to changes to the customer nodes receives the event and can perform additional actions or processes with that information.

This is where, technically speaking, the bounds of CDC ends. It’s not responsible for running applications or driving insights on its own, but rather informing subscribers about data changes and their context. But, within a connected data platform, which includes both a reactive data architecture and a knowledge graph, the CDC’s events can enable various new and valuable processes.

For example, you might have an adjacent AL/ML/NLP service that retrains its models based on the latest state of your data to improve their effectiveness. The search indexes you use to comb your graph’s unstructured data can get updated to help knowledge workers discover more efficiently. Or, a knowledge worker could take that notification and jump back into data they’ve already analyzed (other customers who’ve said a flat white is their favorite drink) to explore a compelling relationship with your newest customer.

How CDC works in GraphGrid

GraphGrid is a suite of data capabilities that help organizations deploy and collaborate across knowledge graphs. A big part of those capabilities is GraphGrid Fuze, which handles the change data capture and is a key component of GraphGrid’s reactive data architecture.

GraphGrid Fuze is the integration service that makes CDC possible. Specifically, the Distributor component of Fuze listens for incoming messages containing transaction data made by changes to your graph database. Distributor then forwards all this change data to any number of message brokers—we currently support Apache Kafka, RabbitMQ, and Amazon SQA—to notify applications, services, and people that subscribe to those updates.

This can trigger other GraphGrid services, like Continuous Processing, which automatically starts extracting data from any new nodes added to your graph, or Continuous Indexing, which indexes new nodes and stores the index data in Elasticsearch. Or, you can run natural language processing (NLP) to turn unstructured text into graph data that are explorable and query-able with tools like similarity scoring, origination, Named Entity Recognition (NER), keyphrase extraction, and more.

Part of the Fuze Distributor’s power is in policies and forwarding rules, which tell GraphGrid exactly how to transform a change message before sending it to a message broker endpoint.

Let’s say you’re adding public news articles to your knowledge graph—they’ll have some structured data, like the source URL, a publish date, and author, along with the unstructured text of the article itself. When the news article and structured data are added to your knowledge graph, CDC dispatches a message. Your search indexing subscriber receives this message and automatically updates your search indexes in Elasticsearch. That sounds great, but the problem is that because your NLP processing on the unstructured text of the article takes a while, your search index isn’t complete.

With CDC configured for NLP, you ensure that your disparate endpoints—such as search indexing and NLP processing—write new properties and relationships to your knowledge graph in a coordinated fashion. Now your search indexes are updated again after NLP has finished processing the article’s full text, enriching your knowledge graph with additional information to visualize, process, and query.

Moving toward reactive graph data

When you implement CDC as part of your larger enterprise data strategy and connect it to your knowledge graph, you’re turning routine changes into context-rich notifications to applications, services, or people, which can trigger any additional processes to make the most of your data.

You can get started with GraphGrid’s capabilities by downloading it for free.

Once you’re set up with graph data and a method of ingesting new data into your knowledge graph, you can enable the Fuze Distributor to enable Continuous Processing with NLP and Continuous Indexing for search. Thanks to Geeqel and GraphGrid’s APIs, your developers will have a pleasant experience when creating all the broker connections you’ll need to enable other data and analytical services you have at your disposal.

That’s the power of CDC—it’s process-agnostic, not dictating what you do with your data or how, but simply informing you of every meaningful change so that you can establish new systems and workflows to make the most of them. It’s a powerful tool for reducing your “time to knowledge” by reversing the old patterns, where data is constantly queried to reveal meaningful changes, to a reactive architecture that pushes changes to your knowledge workers.