EnergyAE / Knowledge Base

Certificate and Document Data Extraction

Automated AI-based extraction of structured data from uploaded compliance documents — test reports, scheme certificates, and registration notices — with automatic population of product records.

Overview

An AI-powered extraction pipeline that parses uploaded compliance documents and extracts structured data for automatic population of product records in the platform database. Supported document types include AS/NZS 5125.1 test reports, VEU/ESS/SRES scheme certificates, CER STC registration notices, Watermark certificates, and AS/NZS 4234 compliance reports.

This feature is not a standalone tool — it is a shared service called by other platform features (5125.1 report analyser, client portal, product compliance tracker) whenever a document is uploaded. The user-facing experience is: upload a document, review the extracted data, confirm, and it populates the relevant fields automatically.

The 5125.1 report analyser PRD covers the specific extraction requirements for test reports in detail. This PRD covers the general extraction pipeline, the additional document types beyond test reports, and the admin tooling for maintaining extraction schemas.

User Stories

  • As a manufacturer client uploading a new SRES certificate via the client portal, I want the certificate number, registration date, and expiry date to automatically populate my product’s compliance record so I don’t re-enter data I’ve just uploaded.
  • As an EnergyAE consultant processing a batch of documents for a client, I want each document type to be auto-detected and routed to the correct extraction schema so I don’t have to specify the document type manually for every upload.
  • As a user reviewing an extraction result, I want to see which fields were extracted with high confidence and which need my review so I know where to focus my attention.
  • As Alastair, I want to update the extraction schema for a document type when scheme administrators change their certificate format without waiting for a developer.

Document Types and Extracted Fields

AS/NZS 5125.1 test reports

Covered in detail in the 5125.1 report analyser PRD. Summary: product identity, test lab, test date, performance test points at each ambient condition (capacity, power input, COP, delivery volume).

VEU Activity 44 registration certificate

Issued by the ESC Victoria on product registration.

Extracted fields:

  • Product name and model number
  • Brand / manufacturer
  • Registration number
  • Registration date
  • Registration status (active / conditional)
  • Registered zones or activity conditions (if specified)
  • Certificate document reference number

ESS HEAB approval

Issued by IPART / NSW ESC.

Extracted fields:

  • Product name and model number
  • Brand / manufacturer
  • Approval number
  • Issue date
  • Expiry date (if applicable)
  • Scheme category (HEAB method variant)

SRES / STC registration notice

Issued by the Clean Energy Regulator.

Extracted fields:

  • Product name and model number
  • Brand
  • Registration number (SWH/HP registration number)
  • Effective date
  • Registered zones (climate zones 1-7)
  • STCs per unit per zone
  • Expiry date (if applicable)
  • Annual deeming year

Watermark certificate

Issued by an accredited certification body (SAI Global, WELS, etc.).

Extracted fields:

  • Product name and model number
  • Brand
  • Certificate number
  • Issue date
  • Expiry date
  • Certification body
  • Applicable standard (AS/NZS 3500, etc.)
  • Licence number (if applicable)

AS/NZS 4234 compliance report (zone energy factors)

Issued by EnergyAE or another accredited simulation body.

Extracted fields:

  • Product name and model number
  • Report date
  • Zone energy factor per climate zone (7 values)
  • Draw-off profile used
  • Simulation methodology version
  • Certifying body

GEMS energy rating label

Extracted fields:

  • Product name and model number
  • Star rating
  • Electricity consumption (kWh/year at standard conditions)
  • Registration number
  • Expiry date

Extraction Pipeline

Step 1: Document type detection

On upload, the system attempts to automatically identify the document type based on:

  • PDF metadata (document title, producer)
  • Text patterns: e.g. “AS/NZS 5125.1” in the text triggers the 5125.1 schema; “Essential Services Commission” triggers the VEU schema
  • LLM classification fallback: if pattern matching is inconclusive, send the first page to the LLM with a classification prompt

If auto-detection fails, the user is prompted to select the document type manually.

Step 2: Extraction

LLM-based extraction using the same pipeline as the 5125.1 analyser. Each document type has a defined schema specifying which fields to extract and what format each field should be in.

Primary extraction method: pass PDF text to the LLM with a structured extraction prompt. The LLM returns a JSON object matching the schema.

Fallback for scanned or image-based PDFs: use vision-based extraction (pass PDF page images to the LLM). Developer to assess whether this is needed for each document type based on sample documents provided by Alastair.

Step 3: Confidence flagging

Each extracted field is assigned a confidence level:

  • High: field cleanly parsed and value is within expected range/format
  • Low: field found but value is ambiguous, formatted unusually, or in an unexpected location
  • Not found: field not located in the document

Low-confidence and not-found fields are highlighted in the review UI. The user must explicitly confirm or correct these before the data is committed.

Step 4: User review

The extraction review screen shows:

  • A table of extracted fields with their values and confidence indicators
  • Fields with low confidence are pre-selected for review
  • User can edit any field inline
  • User confirms the reviewed data before it is committed

The review step cannot be skipped for documents being used to populate a compliance record.

Step 5: Commitment to product record

After review and confirmation:

  • Extracted fields are written to the relevant section of the product record in the database
  • Document is stored in the product’s documents library with the extraction metadata (timestamp, confidence flags, extractor version)
  • If the same document type already exists on the product record, the user is prompted: “Replace existing record?” with a version history link

Extraction Schema Management (Admin)

Alastair must be able to update extraction schemas when certificate formats change, without developer involvement.

Each schema defines:

  • Document type name and description
  • Fields to extract (name, data type, format rules, required/optional)
  • Extraction prompt template (the LLM prompt used to extract data from this document type)
  • Pattern-matching rules for auto-detection

Admin UI allows Alastair to:

  • View all extraction schemas
  • Edit the prompt template for any schema
  • Add or remove fields from a schema
  • Test a schema against an uploaded sample document and review the output before saving

Changes to prompts take effect for new uploads immediately; existing extraction results are not retroactively re-extracted.

Integration Points

This feature is called by (not a standalone page):

  • 5125.1 report analyser: uses the extraction pipeline for test report parsing (covered in that PRD in detail)
  • Client portal: on document upload, auto-extracts and shows the extraction result in the submission review step; extracted data can be used to pre-populate the product record if the client has one
  • Product compliance tracker: document upload slots on checklist items trigger extraction; extracted dates (expiry, registration date) auto-populate the scheme registration fields
  • Direct upload: users can also trigger extraction directly from the product record documents library by uploading any supported document type

Out of Scope (v1)

  • Bulk batch extraction of multiple documents
  • Extraction from Word documents or Excel files (PDF only in v1)
  • Automated re-extraction when a schema is updated (documents are extracted once at upload time)
  • NLP-based extraction of unstructured guidance documents (this is for structured certificate-format documents only; the RAG chatbot handles unstructured documents)
  • Integration with scheme administrator portals for live registration data (extraction from uploaded documents only)

Data Model (indicative)

document_extractions
  extraction_id
  document_id (foreign key to product documents table)
  document_type (enum)
  extracted_at
  extractor_version
  raw_llm_output (JSON: full LLM response)
  parsed_fields (JSON: field name → value, confidence)
  user_reviewed (boolean)
  user_confirmed_at (nullable)
  confirmed_by (user_id, nullable)
  committed_to_record (boolean)

extraction_schemas (admin-managed)
  schema_id
  document_type
  display_name
  detection_patterns (JSON: regex and keyword patterns for auto-detection)
  fields (JSON: array of field definitions)
  extraction_prompt_template (text)
  updated_at
  updated_by

Acceptance Criteria

  • Document type auto-detection correctly identifies document type for at least one sample of each supported type (Alastair to provide samples before build)
  • Extraction correctly populates all required fields for each document type against Alastair’s known-good reference documents
  • Low-confidence fields are correctly identified and highlighted in the review UI
  • User can edit any extracted field inline before confirmation
  • Review step cannot be bypassed for compliance record population
  • Confirmed data writes correctly to the product record (correct field mapping for each document type)
  • Duplicate document detection prompts version confirmation
  • Extraction schema can be updated by Alastair via admin UI without code deploy
  • Test-schema-against-sample function in admin UI works correctly
  • Vision-based fallback is used for scanned PDFs (developer to confirm which document types are likely to be scanned)

Open Questions

  • Alastair to provide sample documents for each supported type before the developer begins building extraction schemas. These are needed to write the prompt templates and define the field locations. This is the critical pre-build deliverable.
  • Which document types are commonly issued as scanned (image-based) PDFs rather than digital PDFs? VEU and ESS certificates may be digital; older Watermark certificates may be scanned. Alastair to assess.
  • For the extraction prompt templates: these will need tuning based on real documents. Alastair to commit to participating in prompt development and testing before sign-off. This is a domain knowledge contribution, not a developer task.
  • Should extraction run automatically on upload (fire-and-forget until the user opens the review screen), or block the upload flow until extraction completes? Automatic background extraction with a notification when ready is the better UX but requires an async job queue.
  • When a VEU registration certificate lists multiple products (a single certificate covering a product family), how should the extraction handle the multiple model numbers? Alastair to confirm whether this is a common scenario.