Certificate and Document Data Extraction | Knowledge Base

Overview

An AI-powered extraction pipeline that parses uploaded compliance documents and extracts structured data for automatic population of product records in the platform database. Supported document types include AS/NZS 5125.1 test reports, VEU/ESS/SRES scheme certificates, CER STC registration notices, Watermark certificates, and AS/NZS 4234 compliance reports.

This feature is not a standalone tool — it is a shared service called by other platform features (5125.1 report analyser, client portal, product compliance tracker) whenever a document is uploaded. The user-facing experience is: upload a document, review the extracted data, confirm, and it populates the relevant fields automatically.

The 5125.1 report analyser PRD covers the specific extraction requirements for test reports in detail. This PRD covers the general extraction pipeline, the additional document types beyond test reports, and the admin tooling for maintaining extraction schemas.

User Stories

As a manufacturer client uploading a new SRES certificate via the client portal, I want the certificate number, registration date, and expiry date to automatically populate my product’s compliance record so I don’t re-enter data I’ve just uploaded.
As an EnergyAE consultant processing a batch of documents for a client, I want each document type to be auto-detected and routed to the correct extraction schema so I don’t have to specify the document type manually for every upload.
As a user reviewing an extraction result, I want to see which fields were extracted with high confidence and which need my review so I know where to focus my attention.
As Alastair, I want to update the extraction schema for a document type when scheme administrators change their certificate format without waiting for a developer.

Document Types and Extracted Fields

AS/NZS 5125.1 test reports

Covered in detail in the 5125.1 report analyser PRD. Summary: product identity, test lab, test date, performance test points at each ambient condition (capacity, power input, COP, delivery volume).

VEU Activity 44 registration certificate

Issued by the ESC Victoria on product registration.

Extracted fields:

Product name and model number
Brand / manufacturer
Registration number
Registration date
Registration status (active / conditional)
Registered zones or activity conditions (if specified)
Certificate document reference number

ESS HEAB approval

Issued by IPART / NSW ESC.

Extracted fields:

Product name and model number
Brand / manufacturer
Approval number
Issue date
Expiry date (if applicable)
Scheme category (HEAB method variant)

SRES / STC registration notice

Issued by the Clean Energy Regulator.

Extracted fields:

Product name and model number
Brand
Registration number (SWH/HP registration number)
Effective date
Registered zones (climate zones 1-7)
STCs per unit per zone
Expiry date (if applicable)
Annual deeming year

Watermark certificate

Issued by an accredited certification body (SAI Global, WELS, etc.).

Extracted fields:

Product name and model number
Brand
Certificate number
Issue date
Expiry date
Certification body
Applicable standard (AS/NZS 3500, etc.)
Licence number (if applicable)

AS/NZS 4234 compliance report (zone energy factors)

Issued by EnergyAE or another accredited simulation body.

Extracted fields:

Product name and model number
Report date
Zone energy factor per climate zone (7 values)
Draw-off profile used
Simulation methodology version
Certifying body

GEMS energy rating label

Extracted fields:

Product name and model number
Star rating
Electricity consumption (kWh/year at standard conditions)
Registration number
Expiry date

Extraction Pipeline

Step 1: Document type detection

On upload, the system attempts to automatically identify the document type based on:

PDF metadata (document title, producer)
Text patterns: e.g. “AS/NZS 5125.1” in the text triggers the 5125.1 schema; “Essential Services Commission” triggers the VEU schema
LLM classification fallback: if pattern matching is inconclusive, send the first page to the LLM with a classification prompt

If auto-detection fails, the user is prompted to select the document type manually.

Step 2: Extraction

LLM-based extraction using the same pipeline as the 5125.1 analyser. Each document type has a defined schema specifying which fields to extract and what format each field should be in.

Primary extraction method: pass PDF text to the LLM with a structured extraction prompt. The LLM returns a JSON object matching the schema.

Fallback for scanned or image-based PDFs: use vision-based extraction (pass PDF page images to the LLM). Developer to assess whether this is needed for each document type based on sample documents provided by Alastair.

Step 3: Confidence flagging

Each extracted field is assigned a confidence level:

High: field cleanly parsed and value is within expected range/format
Low: field found but value is ambiguous, formatted unusually, or in an unexpected location
Not found: field not located in the document

Low-confidence and not-found fields are highlighted in the review UI. The user must explicitly confirm or correct these before the data is committed.

Step 4: User review

The extraction review screen shows:

A table of extracted fields with their values and confidence indicators
Fields with low confidence are pre-selected for review
User can edit any field inline
User confirms the reviewed data before it is committed

The review step cannot be skipped for documents being used to populate a compliance record.

Step 5: Commitment to product record

After review and confirmation:

Extracted fields are written to the relevant section of the product record in the database
Document is stored in the product’s documents library with the extraction metadata (timestamp, confidence flags, extractor version)
If the same document type already exists on the product record, the user is prompted: “Replace existing record?” with a version history link

Extraction Schema Management (Admin)

Alastair must be able to update extraction schemas when certificate formats change, without developer involvement.

Each schema defines:

Document type name and description
Fields to extract (name, data type, format rules, required/optional)
Extraction prompt template (the LLM prompt used to extract data from this document type)
Pattern-matching rules for auto-detection

Admin UI allows Alastair to:

View all extraction schemas
Edit the prompt template for any schema
Add or remove fields from a schema
Test a schema against an uploaded sample document and review the output before saving

Changes to prompts take effect for new uploads immediately; existing extraction results are not retroactively re-extracted.

Integration Points

This feature is called by (not a standalone page):

5125.1 report analyser: uses the extraction pipeline for test report parsing (covered in that PRD in detail)
Client portal: on document upload, auto-extracts and shows the extraction result in the submission review step; extracted data can be used to pre-populate the product record if the client has one
Product compliance tracker: document upload slots on checklist items trigger extraction; extracted dates (expiry, registration date) auto-populate the scheme registration fields
Direct upload: users can also trigger extraction directly from the product record documents library by uploading any supported document type

Out of Scope (v1)

Bulk batch extraction of multiple documents
Extraction from Word documents or Excel files (PDF only in v1)
Automated re-extraction when a schema is updated (documents are extracted once at upload time)
NLP-based extraction of unstructured guidance documents (this is for structured certificate-format documents only; the RAG chatbot handles unstructured documents)
Integration with scheme administrator portals for live registration data (extraction from uploaded documents only)

Data Model (indicative)

document_extractions
  extraction_id
  document_id (foreign key to product documents table)
  document_type (enum)
  extracted_at
  extractor_version
  raw_llm_output (JSON: full LLM response)
  parsed_fields (JSON: field name → value, confidence)
  user_reviewed (boolean)
  user_confirmed_at (nullable)
  confirmed_by (user_id, nullable)
  committed_to_record (boolean)

extraction_schemas (admin-managed)
  schema_id
  document_type
  display_name
  detection_patterns (JSON: regex and keyword patterns for auto-detection)
  fields (JSON: array of field definitions)
  extraction_prompt_template (text)
  updated_at
  updated_by

Acceptance Criteria

Open Questions

Alastair to provide sample documents for each supported type before the developer begins building extraction schemas. These are needed to write the prompt templates and define the field locations. This is the critical pre-build deliverable.
Which document types are commonly issued as scanned (image-based) PDFs rather than digital PDFs? VEU and ESS certificates may be digital; older Watermark certificates may be scanned. Alastair to assess.
For the extraction prompt templates: these will need tuning based on real documents. Alastair to commit to participating in prompt development and testing before sign-off. This is a domain knowledge contribution, not a developer task.
Should extraction run automatically on upload (fire-and-forget until the user opens the review screen), or block the upload flow until extraction completes? Automatic background extraction with a notification when ready is the better UX but requires an async job queue.
When a VEU registration certificate lists multiple products (a single certificate covering a product family), how should the extraction handle the multiple model numbers? Alastair to confirm whether this is a common scenario.