Knowledge Graph Construction from Unstructured Data Using Large Language Models on Alibaba Cloud

This article examines how Alibaba Cloud's inference, logging, labeling, and training services compose into a closed active learning loop that selectiv...

This article examines how Alibaba Cloud's storage, large language model, vector, and graph database services compose into a pipeline that extracts entities and relationships from unstructured text and persists them as a queryable knowledge graph.
Unstructured text — contracts, clinical notes, support tickets, research papers, regulatory filings — encodes relationships between entities that relational schemas and keyword search cannot surface directly. A knowledge graph makes those relationships explicit: entities become nodes, the connections between them become typed edges, and questions that would otherwise require reading thousands of documents become graph traversals. The obstacle has always been construction. Extracting entities and relationships from free text at scale once required hand-built rules, brittle pattern matchers, or large annotated corpora to train bespoke extraction models, each expensive to produce and narrow in coverage.
Large language models change the economics of the extraction step. A model prompted with a schema and a passage of text can return the entities and relationships the passage contains as structured output, generalising across domains without a task-specific training set. The remaining engineering work is the pipeline around the model: ingesting documents, segmenting them into passages the model can process, resolving extracted entities against those already in the graph, persisting the result in a store designed for traversal, and doing so with enough provenance to trust and audit the graph that results. This article documents how that pipeline can be assembled on Alibaba Cloud.

The structure-extraction problem
A knowledge graph represents information as triples: a subject entity, a typed relation, and an object entity. Accumulated across a corpus, these triples form a graph whose value lies in traversal — answering which entities connect to a given one, what paths link two entities, and what patterns recur across the whole. Relational tables and full-text indexes answer neither multi-hop nor relationship-shaped questions efficiently, which is what motivates a graph representation in the first place.
The difficulty has always concentrated in the first step: converting prose into triples. A language model performs this extraction directly when prompted with the passage and the target schema, returning structured triples without a domain-specific training corpus. Whether extraction is constrained to a predefined schema or left open-ended materially shapes the result, and the sections that follow treat schema-bound extraction as the path to a consistent, queryable graph.

Document ingestion and segmentation
Object Storage Service (OSS) holds the raw documents in their native formats. Function Compute, triggered on object creation, parses each document into plain text — across formats such as PDF, HTML, and office documents — and segments it into passages bounded by the model's context window and by semantic boundaries such as sections or paragraphs. Passages that run too long exceed the context window and are truncated; passages cut too short sever relationships that span sentence boundaries, so segmentation balances the two rather than splitting on a fixed length. Each passage carries metadata — source document, position, ingestion time — that later attaches to every entity and edge extracted from it, establishing provenance from the outset.
Entity and relation extraction with large language models
Model Studio hosts the Qwen model family behind a managed inference endpoint. Function Compute sends each passage to the model with a prompt that specifies the target schema — the entity types and relation types the graph admits — and instructs the model to return the extracted triples as structured output, each comprising a subject entity, a typed relation, and an object entity. Constraining the model to a schema, rather than allowing open-ended extraction, keeps the resulting graph consistent: unconstrained extraction produces synonymous relation labels that fragment what should be a single edge class. Requesting structured output rather than prose lets the response be parsed deterministically, and requesting a supporting text span or confidence signal alongside each triple supports later validation, allowing low-confidence extractions to be routed for review rather than written blindly.

Entity resolution and disambiguation
The same real-world entity appears under different surface forms across documents — abbreviations, alternate spellings, role descriptions — and naive insertion creates a separate node for each, fragmenting the graph. Resolution reconciles these references. Each extracted entity is embedded into a vector representation, and that vector is compared against the embeddings of entities already present, held in AnalyticDB for PostgreSQL, whose vector engine performs similarity search at scale. A candidate above a similarity threshold is treated as the same entity and merged onto the existing node; one below is inserted as new. Embedding-based matching captures semantic equivalence that exact string matching misses, while the threshold governs the trade-off between under-merging, which leaves duplicate nodes, and over-merging, which collapses distinct entities, and is tuned against a labelled sample rather than assumed.
Graph persistence and querying
Resolved entities and relationships are written to Graph Database (GDB), which stores data under a property graph model and exposes Apache TinkerPop Gremlin as its traversal language. Nodes carry entity type and attributes; edges carry relation type along with the provenance metadata propagated from the source passage. Batch construction loads through the DataWorks GDB writer, which synchronises vertices and edges from the staged extraction output, and incremental updates from newly ingested documents follow the same path. Once persisted, the graph answers multi-hop questions — paths between entities, the neighbourhood of a node, patterns recurring across the graph — through Gremlin traversals that would be impractical to express against the original documents. Because every edge retains its provenance, an answer can be traced back to the passages that produced it.

Pipeline orchestration and provenance
DataWorks orchestrates the recurring flow, scheduling ingestion, extraction, resolution, and load as a connected sequence rather than disjoint jobs. Log Service (SLS) records the extraction context for every passage — the input text, the model version, the prompt template, and the triples returned — forming an audit trail that serves two needs: investigating why a particular edge entered the graph, and reprocessing the corpus when the schema or the model changes. Cloud Monitor tracks throughput, extraction error rates, and resolution merge rates, surfacing drift such as a rising share of unresolved entities, which signals that the schema no longer matches the incoming documents. Versioning the prompt template and the schema alongside the model allows the graph to be rebuilt deterministically and any change in extraction behaviour to be attributed to a specific revision.

Closing observations
A knowledge graph is only as trustworthy as the extraction and resolution decisions behind it, and those decisions are probabilistic rather than exact. The contracts that hold the pipeline together are therefore as important as any single stage: the schema that constrains extraction must match the questions the graph is expected to answer; the provenance attached at ingestion must survive through to the persisted edge; the resolution threshold must reflect a measured tolerance for duplicate against collapsed entities.
Three disciplines determine whether the result is a graph that can be trusted. Schema discipline — constraining extraction to a fixed set of entity and relation types — keeps edges consistent and the graph queryable. Provenance throughout — every node and edge traceable to a source passage, model version, and prompt — makes the graph auditable and reproducible. Resolution tuning — a threshold set against a labelled sample and monitored over time — guards against the silent corruption that a wrong merge introduces, since a collapsed entity is far harder to detect after the fact than a duplicated one. With these in place, the services described here turn a corpus of unstructured documents into a queryable graph whose every answer can be traced to its source.

alibaba

Figure 1. A knowledge graph construction pipeline on Alibaba Cloud: documents in OSS are parsed and segmented by Function Compute, extracted into schema-bound triples by Qwen models on Model Studio, resolved against existing entities through vector similarity in AnalyticDB for PostgreSQL, and persisted as a property graph in Graph Database (GDB) for Gremlin traversal, with DataWorks, Log Service, and Cloud Monitor governing orchestration, provenance, and monitoring.

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

Community

Knowledge Graph Construction from Unstructured Data Using Large Language Models on Alibaba Cloud

Read previous post:

Read next post:

PM - C2C_Yuan

You may also like

Comments

PM - C2C_Yuan

Related Products

Platform For AI

Epidemic Prediction Solution

Online Education Solution

Accelerated Global Networking Solution for Distance Learning