Engineering & AI
How SpecMake works
SpecMake is a five-stage pipeline that turns a technical PDF into structured, audited, compliance-checked product data. The extraction is grounded against the source document; the compliance and DPP scoring run against EU regulation rule sets we encoded by hand; the privacy posture is private-by-default. Here is the pipeline, stage by stage.
The pipeline
Five stages. Every upload runs the full pipeline. No stage is optional except translation, which the user selects.
- 1
Extract
The PDF is parsed with position-aware text extraction so table rows, columns, and compliance markers survive into the next stage. The extracted text layer and the raw PDF bytes are fed to the model in parallel — important when a single value in a compliance table needs to be verifiable against the source. DOCX inputs are extracted with native formatting preserved.
- 2
Structure
The model classifies the document’s industry domain (one of 13 industrial verticals — coatings, hydraulics, measurement instrumentation, electrical, etc.) and structures the content into canonical JSON. Every field carries an optional confidence score, a confidence reason, and a source anchor (page number plus verbatim quote) so any extracted value can be verified against the original document in one click.
- 3
Audit
A separate AI pass compares the structured output against the source for completeness — flagging missing fields, internal contradictions, and coverage gaps. Findings are sorted by severity. The audit runs against the extracted text layer (not a re-read of the PDF) which keeps cost predictable.
- 4
Compliance & DPP scoring
The structured output is matched against EU regulation rule sets — CE marking, REACH (Annex XVII), RoHS (2011/65/EU), ATEX (2014/34/EU), IP rating (IEC 60529), and product safety baseline. DPP readiness is scored against EU field catalogs for the categories in scope today, with more added as the regulation rolls out. All scoring runs against the source-language structure, so a German PDS and an Italian PDS of the same product produce identical compliance reports. Language-invariant by design.
- 5
Translate (optional)
For each target language, descriptive values are translated with a domain-aware prompt that respects the classified industry and merges any glossary entries (per-user and per-team, with priority rules). A grammar validation pass catches agreement and declension errors. Diagrams pass through unchanged. The choice of zero languages runs only stages 1–4 and emits structured output in the source language.
What we built around the model
The pipeline calls a frontier AI model at each stage that needs judgment, but the surrounding system is the product. None of the following is an AI call.
EU regulation rule sets
The regulations covered today (CE, REACH, RoHS, ATEX, IP, product safety) are encoded by hand from regulation text — not auto-generated, not lifted from a public list. Domain-scoped so RoHS only fires on electrical equipment, ATEX only on hazardous-area products.
DPP readiness catalogs
Typed field requirements for the EU Digital Product Passport categories that are currently in scope. Versioned so updates to the regulation can be tracked. Industry-to-category mapping is automatic.
Language-invariant compliance
Compliance and DPP scoring run on the source-language structure, never on a translation. The same source PDS produces an identical compliance report whether you process it as German, Italian, Polish, or any of the other supported languages.
Unit normalization
Dimensional analysis across the unit systems industrial specs use — pressure, length, mass, temperature, and others. 350 bar and 35 MPa compare as equivalent. Dimension mismatches (bar vs kg) are flagged distinctly in supplier comparisons.
Source-quote anchoring
Every extracted field carries a source anchor — page number and verbatim quote — so any value can be opened next to the original PDF page in one click. Works on saved documents via private signed URLs.
Cross-document alignment
Two-document spec diff and N-way supplier comparison align fields across vendors that name and group properties differently. Unit normalization makes the numeric comparison meaningful, not nominal.
Glossary & template memory
Per-user and per-team glossary entries persist across uploads. Custom output templates — logo, layout, field selection — are stored per organization.
Private-by-default storage
Source PDFs and extracted diagrams live in private storage with no public read access. Every render path mints a fresh short-lived signed URL server-side. The database stores only the storage path, never a URL.
The AI we use
SpecMake calls Anthropic Claude — frontier model, strong privacy posture, EU data-handling alignment. Different parts of the pipeline route to different model tiers, picked for the accuracy-vs-cost trade-off of each task. We re-evaluate every routing decision against accuracy benchmarks on real industrial documents whenever a new model generation ships.
The choice of vendor is intentional. We chose Anthropic for its safety + privacy posture, the no-training-on-API-traffic default, and the model quality on technical-document extraction we directly benchmarked against alternatives.
Privacy, in plain language
- We never train models on your data. Anthropic does not train on API traffic by default; we have not opted in to any training program and never will.
- Your documents stay private. Source PDFs and extracted diagrams live in private storage. Every internal render mints a short-lived signed URL server-side.
- You control retention. Document deletion is immediate. Account deletion is a two-step email-confirmed flow that cascades all your data.
- Hosted in the EU. Database and storage are EU regions. Anthropic API calls are EU-routed.
Where SpecMake fits
SpecMake is the data preparation layer. The output is portable; integration is direct.
Pair with a DPP platform
JSON-LD output mapped to schema.org Product vocabulary feeds directly into platforms that handle passport hosting, QR generation, and registry submission.
Pair with your PIM
Structured JSON, XLSX, CSV, and PDF/DOCX outputs slot into product information management workflows. Custom field mapping is on the roadmap.
Pair with your audit workflow
Per-document audit findings, per-regulation pass/fail breakdowns, and source-anchored field verification — exported or reviewed in-browser before data enters downstream systems.
Try the pipeline on your own document
Upload a spec sheet or technical document. Structured data, audit, and compliance check back in seconds.